Limits...
A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads.

Mbandi SK, Hesse U, Rees DJ, Christoffels A - Front Genet (2014)

Bottom Line: The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly.We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads.However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enable selection of the optimum de novo transcriptome assembly in non-model organisms.

View Article: PubMed Central - PubMed

Affiliation: South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape Bellville, South Africa.

ABSTRACT
Downstream analyses of short-reads from next-generation sequencing platforms are often preceded by a pre-processing step that removes uncalled and wrongly called bases. Standard approaches rely on their associated base quality scores to retain the read or a portion of it when the score is above a predefined threshold. It is difficult to differentiate sequencing error from biological variation without a reference using quality scores. The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly. Using RNA-Seq data produced from Illumina, we teased out the effects of quality score based filtering or trimming on de novo transcriptome reconstruction. We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads. Our data supports the fact that de novo assembling of untrimmed data is challenging for de Bruijn graph assemblers. However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enable selection of the optimum de novo transcriptome assembly in non-model organisms.

No MeSH data available.


A GBrowse snapshot of predicted genes and transfrags (TFs) for V. inaequalis. Ab initio gene predictions are shown in red. TFs produced by Trinity with untrimmed and trimmed (category one) reads are shown in orange and green, respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3921913&req=5

Figure 2: A GBrowse snapshot of predicted genes and transfrags (TFs) for V. inaequalis. Ab initio gene predictions are shown in red. TFs produced by Trinity with untrimmed and trimmed (category one) reads are shown in orange and green, respectively.

Mentions: To investigate the potential side effects of quality based trimming and artifact removal on de novo transcriptome assembly, we analyzed datasets from a model (N. crassa) and non-model organism (V. inaequalis). A summary of read counts for each category of untrimmed and trimmed reads is shown in Table 1. More reads are removed when quality based trimming is preceded by adapter removal compared to doing the reverse. The percentage of trimmed reads ranged from 35 to 88%. Out of ~134 Gb V. inaequalis untrimmed reads, quality trimming preceded with adapter removal retained the smallest amount of reads. When comparing assemblies from various categories of reads, we note that the number of unique TF from untrimmed reads is always higher than those from trimmed reads irrespective of the assembler and dataset used (Figure 1). For N. crassa TFs, this is much more profound at lower k-mers. A similar trend is observed with the number of TFs, derived from untrimmed and trimmed reads that map to the same genomic loci. TFs produced with untrimmed reads recovered a higher number of known N. crassa proteins than those from the trimmed reads (Table 1). A total of 521 known gene loci were identified in N. crassa that overlapped with TFs derived from untrimmed but not trimmed reads. Transcriptome assembly statistics for each category of quality trimmed reads and the HSP ratiosare shown in Table 1. The number of unique TFs is comparable among all assemblies for each organism. Untrimmed reads generated the largest number of TFs and identified the largest numbers of known Uniprotproteins. Sequence similarity search identified 791 proteins that were present in all V. inaequalis assemblies. For N. crassa, 6218 proteins were common to all assemblies generated with Trinity. Kruskal–Wallis one-way analysis of variance suggest that quality score base pre-processing had a significant effect on TF quality in both N. crassa (p = 0.002999) and V. inaequalis (p < 2.2e-16) data. The mean and median HSP ratios for TF from untrimmed reads were slightly higher than those from trimmed reads for both N. crassa and V. inaequalis. In addition, the untrimmed datasets has the least variation (Table 1). Multiple comparisons testing between HSP ratio is show in Table 1. Post hoc analysis indicated that the more aggressive Q20 trimming, produced TF of inferior quality compare to the Q10. TF from Q10 and the untrimmed reads yielded not significant different in HSP ratio. Groups with the same letters are not statistically different. Category one and two trimming strategies were significantly different than the other five categories (p < 0.01), for V. inaequalis. In both N. crassa and V. inaequalis datasets, TF from untrimmed reads produced higher N50 values. Visual assessment of aligned V. inaequalis TF from untrimmed and trimmed reads (category two), reveals missing TF and incomplete TF reconstruction in the latter as shown in Figure 2.


A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads.

Mbandi SK, Hesse U, Rees DJ, Christoffels A - Front Genet (2014)

A GBrowse snapshot of predicted genes and transfrags (TFs) for V. inaequalis. Ab initio gene predictions are shown in red. TFs produced by Trinity with untrimmed and trimmed (category one) reads are shown in orange and green, respectively.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3921913&req=5

Figure 2: A GBrowse snapshot of predicted genes and transfrags (TFs) for V. inaequalis. Ab initio gene predictions are shown in red. TFs produced by Trinity with untrimmed and trimmed (category one) reads are shown in orange and green, respectively.
Mentions: To investigate the potential side effects of quality based trimming and artifact removal on de novo transcriptome assembly, we analyzed datasets from a model (N. crassa) and non-model organism (V. inaequalis). A summary of read counts for each category of untrimmed and trimmed reads is shown in Table 1. More reads are removed when quality based trimming is preceded by adapter removal compared to doing the reverse. The percentage of trimmed reads ranged from 35 to 88%. Out of ~134 Gb V. inaequalis untrimmed reads, quality trimming preceded with adapter removal retained the smallest amount of reads. When comparing assemblies from various categories of reads, we note that the number of unique TF from untrimmed reads is always higher than those from trimmed reads irrespective of the assembler and dataset used (Figure 1). For N. crassa TFs, this is much more profound at lower k-mers. A similar trend is observed with the number of TFs, derived from untrimmed and trimmed reads that map to the same genomic loci. TFs produced with untrimmed reads recovered a higher number of known N. crassa proteins than those from the trimmed reads (Table 1). A total of 521 known gene loci were identified in N. crassa that overlapped with TFs derived from untrimmed but not trimmed reads. Transcriptome assembly statistics for each category of quality trimmed reads and the HSP ratiosare shown in Table 1. The number of unique TFs is comparable among all assemblies for each organism. Untrimmed reads generated the largest number of TFs and identified the largest numbers of known Uniprotproteins. Sequence similarity search identified 791 proteins that were present in all V. inaequalis assemblies. For N. crassa, 6218 proteins were common to all assemblies generated with Trinity. Kruskal–Wallis one-way analysis of variance suggest that quality score base pre-processing had a significant effect on TF quality in both N. crassa (p = 0.002999) and V. inaequalis (p < 2.2e-16) data. The mean and median HSP ratios for TF from untrimmed reads were slightly higher than those from trimmed reads for both N. crassa and V. inaequalis. In addition, the untrimmed datasets has the least variation (Table 1). Multiple comparisons testing between HSP ratio is show in Table 1. Post hoc analysis indicated that the more aggressive Q20 trimming, produced TF of inferior quality compare to the Q10. TF from Q10 and the untrimmed reads yielded not significant different in HSP ratio. Groups with the same letters are not statistically different. Category one and two trimming strategies were significantly different than the other five categories (p < 0.01), for V. inaequalis. In both N. crassa and V. inaequalis datasets, TF from untrimmed reads produced higher N50 values. Visual assessment of aligned V. inaequalis TF from untrimmed and trimmed reads (category two), reveals missing TF and incomplete TF reconstruction in the latter as shown in Figure 2.

Bottom Line: The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly.We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads.However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enable selection of the optimum de novo transcriptome assembly in non-model organisms.

View Article: PubMed Central - PubMed

Affiliation: South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape Bellville, South Africa.

ABSTRACT
Downstream analyses of short-reads from next-generation sequencing platforms are often preceded by a pre-processing step that removes uncalled and wrongly called bases. Standard approaches rely on their associated base quality scores to retain the read or a portion of it when the score is above a predefined threshold. It is difficult to differentiate sequencing error from biological variation without a reference using quality scores. The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly. Using RNA-Seq data produced from Illumina, we teased out the effects of quality score based filtering or trimming on de novo transcriptome reconstruction. We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads. Our data supports the fact that de novo assembling of untrimmed data is challenging for de Bruijn graph assemblers. However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enable selection of the optimum de novo transcriptome assembly in non-model organisms.

No MeSH data available.