Limits...
Reducing bias in RNA sequencing data: a novel approach to compute counts.

Finotello F, Lavezzo E, Bianco L, Barzon L, Mazzon P, Fontana P, Toppo S, Di Camillo B - BMC Bioinformatics (2014)

Bottom Line: The two measures are compared using multiple data sets and considering several evaluation criteria: independence from gene-specific covariates, such as exon length and GC-content, accuracy and precision in the quantification of true concentrations and robustness of measurements to variations of alignments quality.In summary, we confirm that counts computed with the standard approach depend on the length of the feature they are summarized on, and are sensitive to the non-uniform distribution of reads along transcripts.On the opposite, maxcounts are robust to biases due to the non-uniformity distribution of reads and are characterized by a lower technical variability.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: In the last decade, Next-Generation Sequencing technologies have been extensively applied to quantitative transcriptomics, making RNA sequencing a valuable alternative to microarrays for measuring and comparing gene transcription levels. Although several methods have been proposed to provide an unbiased estimate of transcript abundances through data normalization, all of them are based on an initial count of the total number of reads mapping on each transcript. This procedure, in principle robust to random noise, is actually error-prone if reads are not uniformly distributed along sequences, as happens indeed due to sequencing errors and ambiguity in read mapping. Here we propose a new approach, called maxcounts, to quantify the expression assigned to an exon as the maximum of its per-base counts, and we assess its performance in comparison with the standard approach described above, which considers the total number of reads aligned to an exon. The two measures are compared using multiple data sets and considering several evaluation criteria: independence from gene-specific covariates, such as exon length and GC-content, accuracy and precision in the quantification of true concentrations and robustness of measurements to variations of alignments quality.

Results: Both measures show high accuracy and low dependency on GC-content. However, maxcounts expression quantification is less biased towards long exons with respect to the standard approach. Moreover, it shows lower technical variability at low expressions and is more robust to variations in the quality of alignments.

Conclusions: In summary, we confirm that counts computed with the standard approach depend on the length of the feature they are summarized on, and are sensitive to the non-uniform distribution of reads along transcripts. On the opposite, maxcounts are robust to biases due to the non-uniformity distribution of reads and are characterized by a lower technical variability. Hence, we propose maxcounts as an alternative approach for quantitative RNA-sequencing applications.

Show MeSH
Data variance and coefficient of variation. Variance and coefficient of variation (CV) of Jiang's data: variance vs. mean of log-counts/RPKMs (left plots) and CV vs. log-mean of counts/RPKMs (right plots). Curves represent cubic-spline fits computed on variance/CV, averaged in bins of 5000 exons each. Since maxcounts, totcounts, and totcounts normalized with RPKM (RPKM) and within-lane full-quantile normalization over exon length (FullQ) approaches are compared, x-values are scaled to cover the range [0, 1] in order to make them comparable.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4016203&req=5

Figure 6: Data variance and coefficient of variation. Variance and coefficient of variation (CV) of Jiang's data: variance vs. mean of log-counts/RPKMs (left plots) and CV vs. log-mean of counts/RPKMs (right plots). Curves represent cubic-spline fits computed on variance/CV, averaged in bins of 5000 exons each. Since maxcounts, totcounts, and totcounts normalized with RPKM (RPKM) and within-lane full-quantile normalization over exon length (FullQ) approaches are compared, x-values are scaled to cover the range [0, 1] in order to make them comparable.

Mentions: To easily compare variance of totcounts (and its normalized versions) versus maxcounts, at different expression intensities, we quantized the estimated average expression intensities in intervals of equal size and, for each interval, we calculated the average intensity and the average variance as explained in [38]. Finally we fitted data using a cubic spline (Figure 6 and Additional Files 9 and 10).


Reducing bias in RNA sequencing data: a novel approach to compute counts.

Finotello F, Lavezzo E, Bianco L, Barzon L, Mazzon P, Fontana P, Toppo S, Di Camillo B - BMC Bioinformatics (2014)

Data variance and coefficient of variation. Variance and coefficient of variation (CV) of Jiang's data: variance vs. mean of log-counts/RPKMs (left plots) and CV vs. log-mean of counts/RPKMs (right plots). Curves represent cubic-spline fits computed on variance/CV, averaged in bins of 5000 exons each. Since maxcounts, totcounts, and totcounts normalized with RPKM (RPKM) and within-lane full-quantile normalization over exon length (FullQ) approaches are compared, x-values are scaled to cover the range [0, 1] in order to make them comparable.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4016203&req=5

Figure 6: Data variance and coefficient of variation. Variance and coefficient of variation (CV) of Jiang's data: variance vs. mean of log-counts/RPKMs (left plots) and CV vs. log-mean of counts/RPKMs (right plots). Curves represent cubic-spline fits computed on variance/CV, averaged in bins of 5000 exons each. Since maxcounts, totcounts, and totcounts normalized with RPKM (RPKM) and within-lane full-quantile normalization over exon length (FullQ) approaches are compared, x-values are scaled to cover the range [0, 1] in order to make them comparable.
Mentions: To easily compare variance of totcounts (and its normalized versions) versus maxcounts, at different expression intensities, we quantized the estimated average expression intensities in intervals of equal size and, for each interval, we calculated the average intensity and the average variance as explained in [38]. Finally we fitted data using a cubic spline (Figure 6 and Additional Files 9 and 10).

Bottom Line: The two measures are compared using multiple data sets and considering several evaluation criteria: independence from gene-specific covariates, such as exon length and GC-content, accuracy and precision in the quantification of true concentrations and robustness of measurements to variations of alignments quality.In summary, we confirm that counts computed with the standard approach depend on the length of the feature they are summarized on, and are sensitive to the non-uniform distribution of reads along transcripts.On the opposite, maxcounts are robust to biases due to the non-uniformity distribution of reads and are characterized by a lower technical variability.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: In the last decade, Next-Generation Sequencing technologies have been extensively applied to quantitative transcriptomics, making RNA sequencing a valuable alternative to microarrays for measuring and comparing gene transcription levels. Although several methods have been proposed to provide an unbiased estimate of transcript abundances through data normalization, all of them are based on an initial count of the total number of reads mapping on each transcript. This procedure, in principle robust to random noise, is actually error-prone if reads are not uniformly distributed along sequences, as happens indeed due to sequencing errors and ambiguity in read mapping. Here we propose a new approach, called maxcounts, to quantify the expression assigned to an exon as the maximum of its per-base counts, and we assess its performance in comparison with the standard approach described above, which considers the total number of reads aligned to an exon. The two measures are compared using multiple data sets and considering several evaluation criteria: independence from gene-specific covariates, such as exon length and GC-content, accuracy and precision in the quantification of true concentrations and robustness of measurements to variations of alignments quality.

Results: Both measures show high accuracy and low dependency on GC-content. However, maxcounts expression quantification is less biased towards long exons with respect to the standard approach. Moreover, it shows lower technical variability at low expressions and is more robust to variations in the quality of alignments.

Conclusions: In summary, we confirm that counts computed with the standard approach depend on the length of the feature they are summarized on, and are sensitive to the non-uniform distribution of reads along transcripts. On the opposite, maxcounts are robust to biases due to the non-uniformity distribution of reads and are characterized by a lower technical variability. Hence, we propose maxcounts as an alternative approach for quantitative RNA-sequencing applications.

Show MeSH