Limits...
Assessing De Novo transcriptome assembly metrics for consistency and utility.

O'Neil ST, Emrich SJ - BMC Genomics (2013)

Bottom Line: We simulated sequencing transcripts of Drosophila melanogaster.We found several annotation-based metrics to be consistent and informative, including contig reciprocal best hit count and contig unique annotation count.Our results provide an important review of these metrics and give researchers tools to produce the highest quality transcriptome assemblies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genome Research and Biocomputing, Oregon State University,Corvallis, OR 97333, USA.

ABSTRACT

Background: Transcriptome sequencing and assembly represent a great resource for the study of non-model species, and many metrics have been used to evaluate and compare these assemblies. Unfortunately, it is still unclear which of these metrics accurately reflect assembly quality.

Results: We simulated sequencing transcripts of Drosophila melanogaster. By assembling these simulated reads using both a "perfect" and a modern transcriptome assembler while varying read length and sequencing depth, we evaluated quality metrics to determine whether they 1) revealed perfect assemblies to be of higher quality, and 2) revealed perfect assemblies to be more complete as data quantity increased.Several commonly used metrics were not consistent with these expectations, including average contig coverage and length, though they became consistent when singletons were included in the analysis. We found several annotation-based metrics to be consistent and informative, including contig reciprocal best hit count and contig unique annotation count. Finally, we evaluated a number of novel metrics such as reverse annotation count, contig collapse factor, and the ortholog hit ratio, discovering that each assess assembly quality in unique ways.

Conclusions: Although much attention has been given to transcriptome assembly, little research has focused on determining how best to evaluate assemblies, particularly in light of the variety of options available for read length and sequencing depth. Our results provide an important review of these metrics and give researchers tools to produce the highest quality transcriptome assemblies.

Show MeSH
Newbler read usage by transcript sampling rate. Percentage of reads assembled, binned by transcript sequence abundance, for the 2.2M read and 1,000 bp read length Newbler assemblies. Fewer reads were assembled for transcripts with high abundance, particularly for the long read dataset.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3733778&req=5

Figure 2: Newbler read usage by transcript sampling rate. Percentage of reads assembled, binned by transcript sequence abundance, for the 2.2M read and 1,000 bp read length Newbler assemblies. Fewer reads were assembled for transcripts with high abundance, particularly for the long read dataset.

Mentions: Ewen-Campen et al. reported their singletons to be highly redundant based on annotation, and employed a secondary CAP3 assembly strategy for them. To assess the redundancy of singletons produced by the non-perfect assembler, we compared singleton counts by source transcript to the simulated sampling frequency. We found that for both the highest sequencing depth and longest read length assemblies, most singletons were sourced from transcripts with the highest representation. Figure 2 shows average read usage for transcripts in these assemblies binned by probability of read selection. In both cases, reads from the rarest transcripts were more likely to be left as singletons (as expected given our non-uniform sampling; see Methods). For the high sequencing depth assembly, read use initially decreases then increases slightly as transcripts become more abundant. For the long read length assembly, read usage is overall lower and drops significantly as abundance increases: only 4–10% of reads are assembled from the most common transcripts.


Assessing De Novo transcriptome assembly metrics for consistency and utility.

O'Neil ST, Emrich SJ - BMC Genomics (2013)

Newbler read usage by transcript sampling rate. Percentage of reads assembled, binned by transcript sequence abundance, for the 2.2M read and 1,000 bp read length Newbler assemblies. Fewer reads were assembled for transcripts with high abundance, particularly for the long read dataset.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3733778&req=5

Figure 2: Newbler read usage by transcript sampling rate. Percentage of reads assembled, binned by transcript sequence abundance, for the 2.2M read and 1,000 bp read length Newbler assemblies. Fewer reads were assembled for transcripts with high abundance, particularly for the long read dataset.
Mentions: Ewen-Campen et al. reported their singletons to be highly redundant based on annotation, and employed a secondary CAP3 assembly strategy for them. To assess the redundancy of singletons produced by the non-perfect assembler, we compared singleton counts by source transcript to the simulated sampling frequency. We found that for both the highest sequencing depth and longest read length assemblies, most singletons were sourced from transcripts with the highest representation. Figure 2 shows average read usage for transcripts in these assemblies binned by probability of read selection. In both cases, reads from the rarest transcripts were more likely to be left as singletons (as expected given our non-uniform sampling; see Methods). For the high sequencing depth assembly, read use initially decreases then increases slightly as transcripts become more abundant. For the long read length assembly, read usage is overall lower and drops significantly as abundance increases: only 4–10% of reads are assembled from the most common transcripts.

Bottom Line: We simulated sequencing transcripts of Drosophila melanogaster.We found several annotation-based metrics to be consistent and informative, including contig reciprocal best hit count and contig unique annotation count.Our results provide an important review of these metrics and give researchers tools to produce the highest quality transcriptome assemblies.

View Article: PubMed Central - HTML - PubMed

Affiliation: Center for Genome Research and Biocomputing, Oregon State University,Corvallis, OR 97333, USA.

ABSTRACT

Background: Transcriptome sequencing and assembly represent a great resource for the study of non-model species, and many metrics have been used to evaluate and compare these assemblies. Unfortunately, it is still unclear which of these metrics accurately reflect assembly quality.

Results: We simulated sequencing transcripts of Drosophila melanogaster. By assembling these simulated reads using both a "perfect" and a modern transcriptome assembler while varying read length and sequencing depth, we evaluated quality metrics to determine whether they 1) revealed perfect assemblies to be of higher quality, and 2) revealed perfect assemblies to be more complete as data quantity increased.Several commonly used metrics were not consistent with these expectations, including average contig coverage and length, though they became consistent when singletons were included in the analysis. We found several annotation-based metrics to be consistent and informative, including contig reciprocal best hit count and contig unique annotation count. Finally, we evaluated a number of novel metrics such as reverse annotation count, contig collapse factor, and the ortholog hit ratio, discovering that each assess assembly quality in unique ways.

Conclusions: Although much attention has been given to transcriptome assembly, little research has focused on determining how best to evaluate assemblies, particularly in light of the variety of options available for read length and sequencing depth. Our results provide an important review of these metrics and give researchers tools to produce the highest quality transcriptome assemblies.

Show MeSH