Limits...
High-throughput and quantitative genome-wide messenger RNA sequencing for molecular phenotyping.

Collins JE, Wali N, Sealy IM, Morris JA, White RJ, Leonard SR, Jackson DK, Jones MC, Smerdon NC, Zamora J, Dooley CM, Carruthers SN, Barrett JC, Stemple DL, Busch-Nentwich EM - BMC Genomics (2015)

Bottom Line: Multiplex sequencing of transcript 3' ends identifies differential transcript abundance independent of gene annotation.We show that increasing biological replicate number while maintaining the total amount of sequencing identifies more differentially abundant transcripts.This method can be implemented on polyadenylated RNA from any organism with an annotated reference genome and in any laboratory with access to Illumina sequencing.

View Article: PubMed Central - PubMed

Affiliation: Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

ABSTRACT

Background: We present a genome-wide messenger RNA (mRNA) sequencing technique that converts small amounts of RNA from many samples into molecular phenotypes. It encompasses all steps from sample preparation to sequence analysis and is applicable to baseline profiling or perturbation measurements.

Results: Multiplex sequencing of transcript 3' ends identifies differential transcript abundance independent of gene annotation. We show that increasing biological replicate number while maintaining the total amount of sequencing identifies more differentially abundant transcripts.

Conclusions: This method can be implemented on polyadenylated RNA from any organism with an annotated reference genome and in any laboratory with access to Illumina sequencing.

No MeSH data available.


Comparing RNA-seq and transcript counting. The genes identified as showing differential transcript abundance in experiment lamc1sa379 with an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change of‚ÄČ>= 2 by one or both methods at the less stringent proximity filter between the transcript counting 3‚Ä≤ end (TC 3‚Ä≤ end) and the Ensembl transcript end (Table¬†1 row 16) were plotted with their log2 fold change from RNA-seq against the log2 fold change from TC. Genes with an adjusted p-value‚ÄČ<= 0.05 and absolute fold change‚ÄČ>= 2 in both RNA-seq and TC are shown as blue circles. Genes which fail on one or both criteria in RNA-seq but not in TC are shown as green circles and vice versa as red circles. a The log2 fold changes of genes with the TC 3‚Ä≤ end and Ensembl transcript proximity filter between -100 and +5000. The arrows highlight examples of genes which are not seen in graph B. b The log2 fold changes of genes with the proximity filter between -100 and +100. c A table showing the number of genes represented by each circle colour. The fourth and fifth columns show the genes which are lost during the more stringent proximity filtering and whether these are considered true positives or false positives after examining the TC 3‚Ä≤ end. Note that for two genes where the RNA-seq and TC data correlated, the closest TC 3‚Ä≤ end was found to be false, but a true end was identified further downstream and indicated by square brackets. The number in curly brackets indicates genes with a true TC 3‚Ä≤ end and an absolute fold change >2 but which fail to show an adjusted p-value‚ÄČ<= 0.05 in TC
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4524448&req=5

Fig5: Comparing RNA-seq and transcript counting. The genes identified as showing differential transcript abundance in experiment lamc1sa379 with an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change of‚ÄČ>= 2 by one or both methods at the less stringent proximity filter between the transcript counting 3‚Ä≤ end (TC 3‚Ä≤ end) and the Ensembl transcript end (Table¬†1 row 16) were plotted with their log2 fold change from RNA-seq against the log2 fold change from TC. Genes with an adjusted p-value‚ÄČ<= 0.05 and absolute fold change‚ÄČ>= 2 in both RNA-seq and TC are shown as blue circles. Genes which fail on one or both criteria in RNA-seq but not in TC are shown as green circles and vice versa as red circles. a The log2 fold changes of genes with the TC 3‚Ä≤ end and Ensembl transcript proximity filter between -100 and +5000. The arrows highlight examples of genes which are not seen in graph B. b The log2 fold changes of genes with the proximity filter between -100 and +100. c A table showing the number of genes represented by each circle colour. The fourth and fifth columns show the genes which are lost during the more stringent proximity filtering and whether these are considered true positives or false positives after examining the TC 3‚Ä≤ end. Note that for two genes where the RNA-seq and TC data correlated, the closest TC 3‚Ä≤ end was found to be false, but a true end was identified further downstream and indicated by square brackets. The number in curly brackets indicates genes with a true TC 3‚Ä≤ end and an absolute fold change >2 but which fail to show an adjusted p-value‚ÄČ<= 0.05 in TC

Mentions: Many mRNA expression pipelines use whole transcript RNA-seq protocols and a range of analysis tools [32]. We don‚Äôt intend to replace these methods but present an alternative method for high-throughput mRNA expression screening. However, it is useful to compare the results of both methods. To this end we made DeTCT libraries and standard non-directional polyA pulldown Illumina RNA-seq libraries for two alleles. The same three wild-type and three mutant zebrafish total RNA samples were processed for each method, plus two or six additional libraries using the DeTCT protocol. We sequenced one HiSeq 2000 lane equivalent for each allele by each method (Table¬†1). Potential duplicate reads were identified and eliminated. The UMI in DeTCT allowed a more accurate identification of duplicates and hence fewer reads were dismissed as duplicates. Read 2s from the RNA-seq data that mapped to the genome were compared to Ensembl gene annotation to produce count data for each gene and the DeTCT pipeline was used to extract count data linked to Ensembl transcripts from the transcript counting reads. DESeq2 was run on both sets of count data. Even after removing duplicate reads there are generally more counts in the RNA-seq count data (Table¬†1). This is probably due to a drop in read quality following the oligo dT sequence in transcript counting (TC) read 1. RNA-seq initially identifies more genes, however, the gap between the methods is reduced substantially when regions with a low mean count (which are unlikely to be called significantly differentially expressed due to lack of power) are filtered out by DESeq2 (Table¬†1 row 11). The number of TC 3‚Ä≤ ends (Table¬†1 row 9) is higher than that of DeTCT genes (Table¬†1 row 8) which suggests alternative 3‚Ä≤ ends [6, 33] (see Additional file 4 for an example). However, some represent false positive TC 3‚Ä≤ ends which escape our filter for false oligo dT priming. To assess the genes showing differential transcript abundance the full gene list was filtered for protein-coding genes with an adjusted p-value of‚ÄČ<= 0.05 and an absolute fold change‚ÄČ>= 2 (log2 fold change‚ÄČ<-1 or >1) between mutant and wild-type (Table¬†1 row 13 and volcano plots in Additional file 6). RNA-seq identifies more genes, but using a less stringent proximity filter between the TC 3‚Ä≤ end and the 3‚Ä≤ end of an Ensembl transcript increases the detection rate in DeTCT. For a direct comparison between the two methods we identified the genes with an adjusted p-value‚ÄČ<= 0.05 and absolute fold change‚ÄČ>= 2 (Table¬†1 row 16) for both methods and applied the less stringent proximity filter to the TC 3‚Ä≤ ends. The fold change for these genes in RNA-seq and TC was compared (Fig.¬†5 for lamc1sa379 and Additional file 7 for mdn1sa1349) and showed a good correlation with r2‚ÄČ=‚ÄČ0.96 (blue circles on Fig.¬†5a), which suggests the two methods are finding the same alterations in transcript abundance. We next looked at genes which show an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change‚ÄČ>= 2 by one method, but failed to meet one or both criteria by the other (red and green circles on Fig.¬†5a). For both methods 14 genes have an absolute fold change‚ÄČ>= 2, but fail to have sufficient power to call an adjusted p-value‚ÄČ<= 0.05 by one method. Similarly, for 38 genes where one method fails to show an absolute fold change‚ÄČ>= 2 the actual fold change is just below 2 (cut off log2 fold change‚ÄČ>= 0.8 or‚ÄČ<= -0.8) suggesting further genes where the two methods give comparable fold change results. We then applied the stringent TC 3‚Ä≤ end proximity filter to the same data which led to the removal of 39 genes (Fig.¬†5b). Examining the TC 3‚Ä≤ end of these 39 genes showed they fell into two groups, either true TC 3‚Ä≤ ends or false TC 3‚Ä≤ ends assumed to be derived from experimental artefact (Fig.¬†5c). Where both methods gave an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change‚ÄČ>= 2 all 14 were shown to be true ends (note that in two cases the closest TC 3‚Ä≤ end was found to be false, but a true TC 3‚Ä≤ end was found downstream). By contrast, in the gene sets only called by one method 11/25 TC 3‚Ä≤ ends lost to more stringent filtering were false positives. Together this analysis shows the removal of 39 genes by increasing the stringency of the DeTCT proximity filter resulted in losing 28 true positives (14 were only found by one method), but prevented calling 11 false positive TC 3‚Ä≤ ends.Table 1


High-throughput and quantitative genome-wide messenger RNA sequencing for molecular phenotyping.

Collins JE, Wali N, Sealy IM, Morris JA, White RJ, Leonard SR, Jackson DK, Jones MC, Smerdon NC, Zamora J, Dooley CM, Carruthers SN, Barrett JC, Stemple DL, Busch-Nentwich EM - BMC Genomics (2015)

Comparing RNA-seq and transcript counting. The genes identified as showing differential transcript abundance in experiment lamc1sa379 with an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change of‚ÄČ>= 2 by one or both methods at the less stringent proximity filter between the transcript counting 3‚Ä≤ end (TC 3‚Ä≤ end) and the Ensembl transcript end (Table¬†1 row 16) were plotted with their log2 fold change from RNA-seq against the log2 fold change from TC. Genes with an adjusted p-value‚ÄČ<= 0.05 and absolute fold change‚ÄČ>= 2 in both RNA-seq and TC are shown as blue circles. Genes which fail on one or both criteria in RNA-seq but not in TC are shown as green circles and vice versa as red circles. a The log2 fold changes of genes with the TC 3‚Ä≤ end and Ensembl transcript proximity filter between -100 and +5000. The arrows highlight examples of genes which are not seen in graph B. b The log2 fold changes of genes with the proximity filter between -100 and +100. c A table showing the number of genes represented by each circle colour. The fourth and fifth columns show the genes which are lost during the more stringent proximity filtering and whether these are considered true positives or false positives after examining the TC 3‚Ä≤ end. Note that for two genes where the RNA-seq and TC data correlated, the closest TC 3‚Ä≤ end was found to be false, but a true end was identified further downstream and indicated by square brackets. The number in curly brackets indicates genes with a true TC 3‚Ä≤ end and an absolute fold change >2 but which fail to show an adjusted p-value‚ÄČ<= 0.05 in TC
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4524448&req=5

Fig5: Comparing RNA-seq and transcript counting. The genes identified as showing differential transcript abundance in experiment lamc1sa379 with an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change of‚ÄČ>= 2 by one or both methods at the less stringent proximity filter between the transcript counting 3‚Ä≤ end (TC 3‚Ä≤ end) and the Ensembl transcript end (Table¬†1 row 16) were plotted with their log2 fold change from RNA-seq against the log2 fold change from TC. Genes with an adjusted p-value‚ÄČ<= 0.05 and absolute fold change‚ÄČ>= 2 in both RNA-seq and TC are shown as blue circles. Genes which fail on one or both criteria in RNA-seq but not in TC are shown as green circles and vice versa as red circles. a The log2 fold changes of genes with the TC 3‚Ä≤ end and Ensembl transcript proximity filter between -100 and +5000. The arrows highlight examples of genes which are not seen in graph B. b The log2 fold changes of genes with the proximity filter between -100 and +100. c A table showing the number of genes represented by each circle colour. The fourth and fifth columns show the genes which are lost during the more stringent proximity filtering and whether these are considered true positives or false positives after examining the TC 3‚Ä≤ end. Note that for two genes where the RNA-seq and TC data correlated, the closest TC 3‚Ä≤ end was found to be false, but a true end was identified further downstream and indicated by square brackets. The number in curly brackets indicates genes with a true TC 3‚Ä≤ end and an absolute fold change >2 but which fail to show an adjusted p-value‚ÄČ<= 0.05 in TC
Mentions: Many mRNA expression pipelines use whole transcript RNA-seq protocols and a range of analysis tools [32]. We don‚Äôt intend to replace these methods but present an alternative method for high-throughput mRNA expression screening. However, it is useful to compare the results of both methods. To this end we made DeTCT libraries and standard non-directional polyA pulldown Illumina RNA-seq libraries for two alleles. The same three wild-type and three mutant zebrafish total RNA samples were processed for each method, plus two or six additional libraries using the DeTCT protocol. We sequenced one HiSeq 2000 lane equivalent for each allele by each method (Table¬†1). Potential duplicate reads were identified and eliminated. The UMI in DeTCT allowed a more accurate identification of duplicates and hence fewer reads were dismissed as duplicates. Read 2s from the RNA-seq data that mapped to the genome were compared to Ensembl gene annotation to produce count data for each gene and the DeTCT pipeline was used to extract count data linked to Ensembl transcripts from the transcript counting reads. DESeq2 was run on both sets of count data. Even after removing duplicate reads there are generally more counts in the RNA-seq count data (Table¬†1). This is probably due to a drop in read quality following the oligo dT sequence in transcript counting (TC) read 1. RNA-seq initially identifies more genes, however, the gap between the methods is reduced substantially when regions with a low mean count (which are unlikely to be called significantly differentially expressed due to lack of power) are filtered out by DESeq2 (Table¬†1 row 11). The number of TC 3‚Ä≤ ends (Table¬†1 row 9) is higher than that of DeTCT genes (Table¬†1 row 8) which suggests alternative 3‚Ä≤ ends [6, 33] (see Additional file 4 for an example). However, some represent false positive TC 3‚Ä≤ ends which escape our filter for false oligo dT priming. To assess the genes showing differential transcript abundance the full gene list was filtered for protein-coding genes with an adjusted p-value of‚ÄČ<= 0.05 and an absolute fold change‚ÄČ>= 2 (log2 fold change‚ÄČ<-1 or >1) between mutant and wild-type (Table¬†1 row 13 and volcano plots in Additional file 6). RNA-seq identifies more genes, but using a less stringent proximity filter between the TC 3‚Ä≤ end and the 3‚Ä≤ end of an Ensembl transcript increases the detection rate in DeTCT. For a direct comparison between the two methods we identified the genes with an adjusted p-value‚ÄČ<= 0.05 and absolute fold change‚ÄČ>= 2 (Table¬†1 row 16) for both methods and applied the less stringent proximity filter to the TC 3‚Ä≤ ends. The fold change for these genes in RNA-seq and TC was compared (Fig.¬†5 for lamc1sa379 and Additional file 7 for mdn1sa1349) and showed a good correlation with r2‚ÄČ=‚ÄČ0.96 (blue circles on Fig.¬†5a), which suggests the two methods are finding the same alterations in transcript abundance. We next looked at genes which show an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change‚ÄČ>= 2 by one method, but failed to meet one or both criteria by the other (red and green circles on Fig.¬†5a). For both methods 14 genes have an absolute fold change‚ÄČ>= 2, but fail to have sufficient power to call an adjusted p-value‚ÄČ<= 0.05 by one method. Similarly, for 38 genes where one method fails to show an absolute fold change‚ÄČ>= 2 the actual fold change is just below 2 (cut off log2 fold change‚ÄČ>= 0.8 or‚ÄČ<= -0.8) suggesting further genes where the two methods give comparable fold change results. We then applied the stringent TC 3‚Ä≤ end proximity filter to the same data which led to the removal of 39 genes (Fig.¬†5b). Examining the TC 3‚Ä≤ end of these 39 genes showed they fell into two groups, either true TC 3‚Ä≤ ends or false TC 3‚Ä≤ ends assumed to be derived from experimental artefact (Fig.¬†5c). Where both methods gave an adjusted p-value‚ÄČ<= 0.05 and an absolute fold change‚ÄČ>= 2 all 14 were shown to be true ends (note that in two cases the closest TC 3‚Ä≤ end was found to be false, but a true TC 3‚Ä≤ end was found downstream). By contrast, in the gene sets only called by one method 11/25 TC 3‚Ä≤ ends lost to more stringent filtering were false positives. Together this analysis shows the removal of 39 genes by increasing the stringency of the DeTCT proximity filter resulted in losing 28 true positives (14 were only found by one method), but prevented calling 11 false positive TC 3‚Ä≤ ends.Table 1

Bottom Line: Multiplex sequencing of transcript 3' ends identifies differential transcript abundance independent of gene annotation.We show that increasing biological replicate number while maintaining the total amount of sequencing identifies more differentially abundant transcripts.This method can be implemented on polyadenylated RNA from any organism with an annotated reference genome and in any laboratory with access to Illumina sequencing.

View Article: PubMed Central - PubMed

Affiliation: Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

ABSTRACT

Background: We present a genome-wide messenger RNA (mRNA) sequencing technique that converts small amounts of RNA from many samples into molecular phenotypes. It encompasses all steps from sample preparation to sequence analysis and is applicable to baseline profiling or perturbation measurements.

Results: Multiplex sequencing of transcript 3' ends identifies differential transcript abundance independent of gene annotation. We show that increasing biological replicate number while maintaining the total amount of sequencing identifies more differentially abundant transcripts.

Conclusions: This method can be implemented on polyadenylated RNA from any organism with an annotated reference genome and in any laboratory with access to Illumina sequencing.

No MeSH data available.