Limits...
Impact of RNA-seq attributes on false positive rates in differential expression analysis of de novo assembled transcriptomes.

González E, Joly S - BMC Res Notes (2013)

Bottom Line: All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes.The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut de recherche en biologie végétale, Université de Montréal, 4101 Sherbrooke E, Montréal, H1X 2B2, (QC), Canada. emmanuel.gonzalez@umontreal.ca.

ABSTRACT

Background: High-throughput RNA sequencing studies are becoming increasingly popular and differential expression studies represent an important downstream analysis that often follow de novo transcriptome assembly. If a lot of attention has been given to bioinformatics tools for differential gene expression, little has yet been given to the impact of the sequence data itself used in pipelines.

Results: We tested how using different types of reads from the ones used to assemble a de novo transcriptome (both differing in length and pairing attributes) could potentially affect differential expression (DE) results. To investigate this, we created artificial datasets out of long paired-end RNA-seq datasets initially used to build the assembly. All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes. If the false positive rate for differential gene expression does not seem to be strongly affected by sequencing strategy (max. of 3.5%), it could reach 12.2% or 28.1% for differential isoform expression depending of the pipeline used. The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.

Conclusion: In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.

Show MeSH
False positive ratios in DE experiment (pipeline 2) as a function of input data type.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4222115&req=5

Figure 3: False positive ratios in DE experiment (pipeline 2) as a function of input data type.

Mentions: Because we are testing differential gene or isoform expressions of a sample with itself, the expectation is to find no differential gene expression. Indeed, if any read is distinctive enough from others, it should be unambiguously mapped back to the transcriptome. If pairing type or sequence length would not affect the reads specificity, the different datasets should have exactly the same genes and isoforms abundance and no differential expression should be observed. As shown in Figures 2 and 3, only sets that are exactly of the same length and pairing type show this trend (“same data” line). False positives appear when modified RNA-seq datasets are compared. Any observation of differential gene expression means that strategies between data sets affect the abundance estimates.


Impact of RNA-seq attributes on false positive rates in differential expression analysis of de novo assembled transcriptomes.

González E, Joly S - BMC Res Notes (2013)

False positive ratios in DE experiment (pipeline 2) as a function of input data type.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4222115&req=5

Figure 3: False positive ratios in DE experiment (pipeline 2) as a function of input data type.
Mentions: Because we are testing differential gene or isoform expressions of a sample with itself, the expectation is to find no differential gene expression. Indeed, if any read is distinctive enough from others, it should be unambiguously mapped back to the transcriptome. If pairing type or sequence length would not affect the reads specificity, the different datasets should have exactly the same genes and isoforms abundance and no differential expression should be observed. As shown in Figures 2 and 3, only sets that are exactly of the same length and pairing type show this trend (“same data” line). False positives appear when modified RNA-seq datasets are compared. Any observation of differential gene expression means that strategies between data sets affect the abundance estimates.

Bottom Line: All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes.The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institut de recherche en biologie végétale, Université de Montréal, 4101 Sherbrooke E, Montréal, H1X 2B2, (QC), Canada. emmanuel.gonzalez@umontreal.ca.

ABSTRACT

Background: High-throughput RNA sequencing studies are becoming increasingly popular and differential expression studies represent an important downstream analysis that often follow de novo transcriptome assembly. If a lot of attention has been given to bioinformatics tools for differential gene expression, little has yet been given to the impact of the sequence data itself used in pipelines.

Results: We tested how using different types of reads from the ones used to assemble a de novo transcriptome (both differing in length and pairing attributes) could potentially affect differential expression (DE) results. To investigate this, we created artificial datasets out of long paired-end RNA-seq datasets initially used to build the assembly. All datasets were compared via DE analyses and because all samples come from the same sequencing run, DE of genes or isoforms can be interpreted as false positives resulting from sequence attributes. If the false positive rate for differential gene expression does not seem to be strongly affected by sequencing strategy (max. of 3.5%), it could reach 12.2% or 28.1% for differential isoform expression depending of the pipeline used. The effect of paired-end vs. single-end strategy was found to have a much greater impact in terms of false positives than sequence length.

Conclusion: In light of false positive rate results, we recommend using paired-end over single-end sequences in differential expression studies, even if the impact is less serious for differential gene expression.

Show MeSH