Limits...
Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling.

Łabaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP - Bioinformatics (2011)

Bottom Line: Based on established tools, we then introduce a new approach for mapping and analysing sequencing reads that yields substantially improved performance in gene expression profiling, increasing the number of transcripts that can reliably be quantified to over 40%.Extrapolations to higher sequencing depths highlight the need for efficient complementary steps.In discussion we outline possible experimental and computational strategies for further improvements in quantification precision. rnaseq10@boku.ac.at

View Article: PubMed Central - PubMed

Affiliation: Boku University Vienna, 1190 Muthgasse 18, Vienna, Austria.

ABSTRACT

Motivation: Measurement precision determines the power of any analysis to reliably identify significant signals, such as in screens for differential expression, independent of whether the experimental design incorporates replicates or not. With the compilation of large-scale RNA-Seq datasets with technical replicate samples, however, we can now, for the first time, perform a systematic analysis of the precision of expression level estimates from massively parallel sequencing technology. This then allows considerations for its improvement by computational or experimental means.

Results: We report on a comprehensive study of target identification and measurement precision, including their dependence on transcript expression levels, read depth and other parameters. In particular, an impressive recall of 84% of the estimated true transcript population could be achieved with 331 million 50 bp reads, with diminishing returns from longer read lengths and even less gains from increased sequencing depths. Most of the measurement power (75%) is spent on only 7% of the known transcriptome, however, making less strongly expressed transcripts harder to measure. Consequently, <30% of all transcripts could be quantified reliably with a relative error<20%. Based on established tools, we then introduce a new approach for mapping and analysing sequencing reads that yields substantially improved performance in gene expression profiling, increasing the number of transcripts that can reliably be quantified to over 40%. Extrapolations to higher sequencing depths highlight the need for efficient complementary steps. In discussion we outline possible experimental and computational strategies for further improvements in quantification precision.

Contact: rnaseq10@boku.ac.at

Show MeSH

Related in: MedlinePlus

Cumulative distribution of read alignments across transcript targets. The fraction of read alignments is plotted (y-axis) that has been mapped to a certain percentage of transcript targets (x-axis). Over 75% of all read alignments cover less than 7% of the known transcriptome (circle symbol). Two particular positions are marked by vertical lines in the figure: The 41% of targets with the highest expression are to the left of the first line (dotted). The vast majority of read alignments (99.5%) has been assigned to these targets, supporting a reliable measurement of their expression levels. Consequently, most of them (84%) could be determined with an error of 20% or less. On average, 67% of all transcript targets were identified in a measurement and this is marked by the second line (dashed). A substantial number of transcript targets falls between the two lines, receiving as few as only one read alignment. Consequently, most of these targets could not be quantified reliably. The remaining 33% of transcript targets falling to the right of the second line (dashed) were either undetected or not expressed.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3117338&req=5

Figure 3: Cumulative distribution of read alignments across transcript targets. The fraction of read alignments is plotted (y-axis) that has been mapped to a certain percentage of transcript targets (x-axis). Over 75% of all read alignments cover less than 7% of the known transcriptome (circle symbol). Two particular positions are marked by vertical lines in the figure: The 41% of targets with the highest expression are to the left of the first line (dotted). The vast majority of read alignments (99.5%) has been assigned to these targets, supporting a reliable measurement of their expression levels. Consequently, most of them (84%) could be determined with an error of 20% or less. On average, 67% of all transcript targets were identified in a measurement and this is marked by the second line (dashed). A substantial number of transcript targets falls between the two lines, receiving as few as only one read alignment. Consequently, most of these targets could not be quantified reliably. The remaining 33% of transcript targets falling to the right of the second line (dashed) were either undetected or not expressed.

Mentions: Assessing expression levels by randomly sequencing reads from the transcriptome, one expects that some high abundance transcripts can dominate results, such as certain housekeeping genes (e.g. actin, ubiquitin, etc.), or genes abundant in specific cell types or tissues such as secretory proteins or myosin. The difficulty of reliably measuring the expression levels of low abundance spliceforms can be understood from a study of the distribution of sequence reads across transcripts (Fig. 3). On average, 67% of all targets were identified in a measurement (dashed vertical line). Reflecting the complexity of the transcriptome and a highly skewed distribution of expression levels, over 75% of the collected read alignments hit just 7% of all the known spliceforms (circle symbol). Indeed, the vast majority of read alignments (99.5%) has been assigned to the 41% most abundant targets (to the left of the dotted vertical line). Consequently, the expression levels for most of these targets could be determined reliably with an error ≤20%. In contrast, many targets fall between the two vertical lines, receiving as few as only one read alignment. As a result, most of those could not be quantified with such precision. It is thus interesting to examine how the read depth of an RNA-Seq experiment affects the distribution of genes that can be measured reliably.Fig. 3.


Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling.

Łabaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP - Bioinformatics (2011)

Cumulative distribution of read alignments across transcript targets. The fraction of read alignments is plotted (y-axis) that has been mapped to a certain percentage of transcript targets (x-axis). Over 75% of all read alignments cover less than 7% of the known transcriptome (circle symbol). Two particular positions are marked by vertical lines in the figure: The 41% of targets with the highest expression are to the left of the first line (dotted). The vast majority of read alignments (99.5%) has been assigned to these targets, supporting a reliable measurement of their expression levels. Consequently, most of them (84%) could be determined with an error of 20% or less. On average, 67% of all transcript targets were identified in a measurement and this is marked by the second line (dashed). A substantial number of transcript targets falls between the two lines, receiving as few as only one read alignment. Consequently, most of these targets could not be quantified reliably. The remaining 33% of transcript targets falling to the right of the second line (dashed) were either undetected or not expressed.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3117338&req=5

Figure 3: Cumulative distribution of read alignments across transcript targets. The fraction of read alignments is plotted (y-axis) that has been mapped to a certain percentage of transcript targets (x-axis). Over 75% of all read alignments cover less than 7% of the known transcriptome (circle symbol). Two particular positions are marked by vertical lines in the figure: The 41% of targets with the highest expression are to the left of the first line (dotted). The vast majority of read alignments (99.5%) has been assigned to these targets, supporting a reliable measurement of their expression levels. Consequently, most of them (84%) could be determined with an error of 20% or less. On average, 67% of all transcript targets were identified in a measurement and this is marked by the second line (dashed). A substantial number of transcript targets falls between the two lines, receiving as few as only one read alignment. Consequently, most of these targets could not be quantified reliably. The remaining 33% of transcript targets falling to the right of the second line (dashed) were either undetected or not expressed.
Mentions: Assessing expression levels by randomly sequencing reads from the transcriptome, one expects that some high abundance transcripts can dominate results, such as certain housekeeping genes (e.g. actin, ubiquitin, etc.), or genes abundant in specific cell types or tissues such as secretory proteins or myosin. The difficulty of reliably measuring the expression levels of low abundance spliceforms can be understood from a study of the distribution of sequence reads across transcripts (Fig. 3). On average, 67% of all targets were identified in a measurement (dashed vertical line). Reflecting the complexity of the transcriptome and a highly skewed distribution of expression levels, over 75% of the collected read alignments hit just 7% of all the known spliceforms (circle symbol). Indeed, the vast majority of read alignments (99.5%) has been assigned to the 41% most abundant targets (to the left of the dotted vertical line). Consequently, the expression levels for most of these targets could be determined reliably with an error ≤20%. In contrast, many targets fall between the two vertical lines, receiving as few as only one read alignment. As a result, most of those could not be quantified with such precision. It is thus interesting to examine how the read depth of an RNA-Seq experiment affects the distribution of genes that can be measured reliably.Fig. 3.

Bottom Line: Based on established tools, we then introduce a new approach for mapping and analysing sequencing reads that yields substantially improved performance in gene expression profiling, increasing the number of transcripts that can reliably be quantified to over 40%.Extrapolations to higher sequencing depths highlight the need for efficient complementary steps.In discussion we outline possible experimental and computational strategies for further improvements in quantification precision. rnaseq10@boku.ac.at

View Article: PubMed Central - PubMed

Affiliation: Boku University Vienna, 1190 Muthgasse 18, Vienna, Austria.

ABSTRACT

Motivation: Measurement precision determines the power of any analysis to reliably identify significant signals, such as in screens for differential expression, independent of whether the experimental design incorporates replicates or not. With the compilation of large-scale RNA-Seq datasets with technical replicate samples, however, we can now, for the first time, perform a systematic analysis of the precision of expression level estimates from massively parallel sequencing technology. This then allows considerations for its improvement by computational or experimental means.

Results: We report on a comprehensive study of target identification and measurement precision, including their dependence on transcript expression levels, read depth and other parameters. In particular, an impressive recall of 84% of the estimated true transcript population could be achieved with 331 million 50 bp reads, with diminishing returns from longer read lengths and even less gains from increased sequencing depths. Most of the measurement power (75%) is spent on only 7% of the known transcriptome, however, making less strongly expressed transcripts harder to measure. Consequently, <30% of all transcripts could be quantified reliably with a relative error<20%. Based on established tools, we then introduce a new approach for mapping and analysing sequencing reads that yields substantially improved performance in gene expression profiling, increasing the number of transcripts that can reliably be quantified to over 40%. Extrapolations to higher sequencing depths highlight the need for efficient complementary steps. In discussion we outline possible experimental and computational strategies for further improvements in quantification precision.

Contact: rnaseq10@boku.ac.at

Show MeSH
Related in: MedlinePlus