Limits...
Enhancing de novo transcriptome assembly by incorporating multiple overlap sizes.

Chen CC, Lin WD, Chang YJ, Chen CL, Ho JM - ISRN Bioinform (2012)

Bottom Line: Methodology.Significance.The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Information Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 10617, Taiwan.

ABSTRACT
Background. The emergence of next-generation sequencing platform gives rise to a new generation of assembly algorithms. Compared with the Sanger sequencing data, the next-generation sequence data present shorter reads, higher coverage depth, and different error profiles. These features bring new challenging issues for de novo transcriptome assembly. Methodology. To explore the influence of these features on assembly algorithms, we studied the relationship between read overlap size, coverage depth, and error rate using simulated data. According to the relationship, we propose a de novo transcriptome assembly procedure, called Euler-mix, and demonstrate its performance on a real transcriptome dataset of mice. The simulation tool and evaluation tool are freely available as open source. Significance. Euler-mix is a straightforward pipeline; it focuses on dealing with the variation of coverage depth of short reads dataset. The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.

No MeSH data available.


Histogram of the coverage depths (expression levels) of the 26,332 transcripts of mice.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4417554&req=5

fig4: Histogram of the coverage depths (expression levels) of the 26,332 transcripts of mice.

Mentions: We created a synthetic dataset that mimicked the experimental data of transcriptome shotgun sequencing. The synthetic dataset of 80 million pair-ended 36 bp reads was randomly sampled from 26,332 transcripts of mice, which were collected from the NCBI RefSeq database [21]. To mimic the varied coverage depth of transcriptome shotgun sequencing data, the number of reads of each transcript was proportional to the number of ESTs multiplied by the length of the transcript, where the EST numbers were computed according to the NCBI dbEST database. Most transcripts have low coverage depths, and the variation of coverage depth is large, extending from 1 to 4,266 (see Figure 4). Additionally, the distribution of the coverage depth is a power-law distribution and is similar to the experiment data of the whole transcriptome shotgun sequencing for HeLa [22]. To better fit in with real-world data, we applied error rates that were slightly increased from start to end in reads. For the average error rate of 0.3%, the error rate at first nucleotide is 0.2% and increased 0.005% for every next nucleotide. Similarly, for average error rates 0.6%, 0.9%,…, and 2.4%, the error rates start with 0.5%, 0.8%,…, and 2.3%, respectively. The sizes of inserts were uniformly distributed from 175 to 225.


Enhancing de novo transcriptome assembly by incorporating multiple overlap sizes.

Chen CC, Lin WD, Chang YJ, Chen CL, Ho JM - ISRN Bioinform (2012)

Histogram of the coverage depths (expression levels) of the 26,332 transcripts of mice.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4417554&req=5

fig4: Histogram of the coverage depths (expression levels) of the 26,332 transcripts of mice.
Mentions: We created a synthetic dataset that mimicked the experimental data of transcriptome shotgun sequencing. The synthetic dataset of 80 million pair-ended 36 bp reads was randomly sampled from 26,332 transcripts of mice, which were collected from the NCBI RefSeq database [21]. To mimic the varied coverage depth of transcriptome shotgun sequencing data, the number of reads of each transcript was proportional to the number of ESTs multiplied by the length of the transcript, where the EST numbers were computed according to the NCBI dbEST database. Most transcripts have low coverage depths, and the variation of coverage depth is large, extending from 1 to 4,266 (see Figure 4). Additionally, the distribution of the coverage depth is a power-law distribution and is similar to the experiment data of the whole transcriptome shotgun sequencing for HeLa [22]. To better fit in with real-world data, we applied error rates that were slightly increased from start to end in reads. For the average error rate of 0.3%, the error rate at first nucleotide is 0.2% and increased 0.005% for every next nucleotide. Similarly, for average error rates 0.6%, 0.9%,…, and 2.4%, the error rates start with 0.5%, 0.8%,…, and 2.3%, respectively. The sizes of inserts were uniformly distributed from 175 to 225.

Bottom Line: Methodology.Significance.The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science and Information Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 10617, Taiwan.

ABSTRACT
Background. The emergence of next-generation sequencing platform gives rise to a new generation of assembly algorithms. Compared with the Sanger sequencing data, the next-generation sequence data present shorter reads, higher coverage depth, and different error profiles. These features bring new challenging issues for de novo transcriptome assembly. Methodology. To explore the influence of these features on assembly algorithms, we studied the relationship between read overlap size, coverage depth, and error rate using simulated data. According to the relationship, we propose a de novo transcriptome assembly procedure, called Euler-mix, and demonstrate its performance on a real transcriptome dataset of mice. The simulation tool and evaluation tool are freely available as open source. Significance. Euler-mix is a straightforward pipeline; it focuses on dealing with the variation of coverage depth of short reads dataset. The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.

No MeSH data available.