Limits...
A low-cost library construction protocol and data analysis pipeline for Illumina-based strand-specific multiplex RNA-seq.

Wang L, Si Y, Dedow LK, Shao Y, Liu P, Brutnell TP - PLoS ONE (2011)

Bottom Line: Our data supports novel gene models and can be used to improve current rice genome annotation.Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events.Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

View Article: PubMed Central - PubMed

Affiliation: Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York, United States of America.

ABSTRACT
The emergence of NextGen sequencing technology has generated much interest in the exploration of transcriptomes. Currently, Illumina Inc. (San Diego, CA) provides one of the most widely utilized sequencing platforms for gene expression analysis. While Illumina reagents and protocols perform adequately in RNA-sequencing (RNA-seq), alternative reagents and protocols promise a higher throughput at a much lower cost. We have developed a low-cost and robust protocol to produce Illumina-compatible (GAIIx and HiSeq2000 platforms) RNA-seq libraries by combining several recent improvements. First, we designed balanced adapter sequences for multiplexing of samples; second, dUTP incorporation in 2(nd) strand synthesis was used to enforce strand-specificity; third, we simplified RNA purification, fragmentation and library size-selection steps thus drastically reducing the time and increasing throughput of library construction; fourth, we included an RNA spike-in control for validation and normalization purposes. To streamline informatics analysis for the community, we established a pipeline within the iPlant Collaborative. These scripts are easily customized to meet specific research needs and improve on existing informatics and statistical treatments of RNA-seq data. In particular, we apply significance tests for determining differential gene expression and intron retention events. To demonstrate the potential of both the library-construction protocol and data-analysis pipeline, we characterized the transcriptome of the rice leaf. Our data supports novel gene models and can be used to improve current rice genome annotation. Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events. Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

Show MeSH

Related in: MedlinePlus

Correlation of RPKM calculated from NSS and SS protocols and coverage statistics.(a) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. SS-derived RPKM values are calculated from the sense strand only. r is the correlation coefficient. (b) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. Derived RPKM values are calculated from both strands of SS-data (c) Coverage plot along average gene body from 5′ to 3′ calculated from both NSS and SS methods. The percent GC content is also plotted.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3198403&req=5

pone-0026426-g003: Correlation of RPKM calculated from NSS and SS protocols and coverage statistics.(a) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. SS-derived RPKM values are calculated from the sense strand only. r is the correlation coefficient. (b) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. Derived RPKM values are calculated from both strands of SS-data (c) Coverage plot along average gene body from 5′ to 3′ calculated from both NSS and SS methods. The percent GC content is also plotted.

Mentions: To compare the output from the strand-specific (SS) and standard non strand-specific (NSS) versions, we performed two RNA-seq experiments in parallel using two-week old rice seedling leaf tissue following nearly identical procedures, with the exception of the dUTP labeling step. For the SS protocol, dUTP was used in second strand synthesis, whereas dTTP was used for the NSS protocol. Libraries were sequenced on six lanes of the GAIIx platform (three lanes for each library). Approximately 100 million 35-nt processed reads were generated for each library and RPKM values calculated. As shown in Figure 3A, the correlation of log10 RPKM values from the two datasets is high with a correlation coefficient, r = 0.976. As illustrated by the shaded portion of the chart in Fig. 3A, there appears to be a subset of genes with higher estimated gene expression values when the NSS protocol was used relative to the SS protocol. We reason that this shift likely occurs when a large number of sense and antisense reads map to the same gene. In the SS protocol only sense alignments are counted, whereas both sense and antisense alignments contribute to the RPKM in the NSS protocol. To test this hypothesis, we re-analyzed the data generated from the SS protocol and added the reads from both sense and antisense strands together. Indeed the new comparison, as shown in the Figure 3B, has a higher correlation with the r value of 0.987, suggesting that the NSS method does overestimate RPKM for a subset of genes. To visualize this discrepancy, we examined four genes with high differential RPKM values between NSS and SS protocols. Figure S4 shows alignment profiles using the Integrative Genomics Viewer (IGV; [21]). Interestingly, some of the discrepancies arise from incorrectly annotated gene models. In a few cases, convergent genes were incorporated into the one gene model (Figure S4a–c). In other cases, numerous anti-sense and sense reads mapped to the same gene model (Figure S4d), thus confirming our hypothesis that the SS method is a more accurate method to calculate RPKM values and can be used to validate gene models.


A low-cost library construction protocol and data analysis pipeline for Illumina-based strand-specific multiplex RNA-seq.

Wang L, Si Y, Dedow LK, Shao Y, Liu P, Brutnell TP - PLoS ONE (2011)

Correlation of RPKM calculated from NSS and SS protocols and coverage statistics.(a) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. SS-derived RPKM values are calculated from the sense strand only. r is the correlation coefficient. (b) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. Derived RPKM values are calculated from both strands of SS-data (c) Coverage plot along average gene body from 5′ to 3′ calculated from both NSS and SS methods. The percent GC content is also plotted.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3198403&req=5

pone-0026426-g003: Correlation of RPKM calculated from NSS and SS protocols and coverage statistics.(a) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. SS-derived RPKM values are calculated from the sense strand only. r is the correlation coefficient. (b) Scatter plot of log10 correlation of SS- and NSS-derived RPKM values. Derived RPKM values are calculated from both strands of SS-data (c) Coverage plot along average gene body from 5′ to 3′ calculated from both NSS and SS methods. The percent GC content is also plotted.
Mentions: To compare the output from the strand-specific (SS) and standard non strand-specific (NSS) versions, we performed two RNA-seq experiments in parallel using two-week old rice seedling leaf tissue following nearly identical procedures, with the exception of the dUTP labeling step. For the SS protocol, dUTP was used in second strand synthesis, whereas dTTP was used for the NSS protocol. Libraries were sequenced on six lanes of the GAIIx platform (three lanes for each library). Approximately 100 million 35-nt processed reads were generated for each library and RPKM values calculated. As shown in Figure 3A, the correlation of log10 RPKM values from the two datasets is high with a correlation coefficient, r = 0.976. As illustrated by the shaded portion of the chart in Fig. 3A, there appears to be a subset of genes with higher estimated gene expression values when the NSS protocol was used relative to the SS protocol. We reason that this shift likely occurs when a large number of sense and antisense reads map to the same gene. In the SS protocol only sense alignments are counted, whereas both sense and antisense alignments contribute to the RPKM in the NSS protocol. To test this hypothesis, we re-analyzed the data generated from the SS protocol and added the reads from both sense and antisense strands together. Indeed the new comparison, as shown in the Figure 3B, has a higher correlation with the r value of 0.987, suggesting that the NSS method does overestimate RPKM for a subset of genes. To visualize this discrepancy, we examined four genes with high differential RPKM values between NSS and SS protocols. Figure S4 shows alignment profiles using the Integrative Genomics Viewer (IGV; [21]). Interestingly, some of the discrepancies arise from incorrectly annotated gene models. In a few cases, convergent genes were incorporated into the one gene model (Figure S4a–c). In other cases, numerous anti-sense and sense reads mapped to the same gene model (Figure S4d), thus confirming our hypothesis that the SS method is a more accurate method to calculate RPKM values and can be used to validate gene models.

Bottom Line: Our data supports novel gene models and can be used to improve current rice genome annotation.Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events.Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

View Article: PubMed Central - PubMed

Affiliation: Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York, United States of America.

ABSTRACT
The emergence of NextGen sequencing technology has generated much interest in the exploration of transcriptomes. Currently, Illumina Inc. (San Diego, CA) provides one of the most widely utilized sequencing platforms for gene expression analysis. While Illumina reagents and protocols perform adequately in RNA-sequencing (RNA-seq), alternative reagents and protocols promise a higher throughput at a much lower cost. We have developed a low-cost and robust protocol to produce Illumina-compatible (GAIIx and HiSeq2000 platforms) RNA-seq libraries by combining several recent improvements. First, we designed balanced adapter sequences for multiplexing of samples; second, dUTP incorporation in 2(nd) strand synthesis was used to enforce strand-specificity; third, we simplified RNA purification, fragmentation and library size-selection steps thus drastically reducing the time and increasing throughput of library construction; fourth, we included an RNA spike-in control for validation and normalization purposes. To streamline informatics analysis for the community, we established a pipeline within the iPlant Collaborative. These scripts are easily customized to meet specific research needs and improve on existing informatics and statistical treatments of RNA-seq data. In particular, we apply significance tests for determining differential gene expression and intron retention events. To demonstrate the potential of both the library-construction protocol and data-analysis pipeline, we characterized the transcriptome of the rice leaf. Our data supports novel gene models and can be used to improve current rice genome annotation. Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events. Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

Show MeSH
Related in: MedlinePlus