Limits...
A low-cost library construction protocol and data analysis pipeline for Illumina-based strand-specific multiplex RNA-seq.

Wang L, Si Y, Dedow LK, Shao Y, Liu P, Brutnell TP - PLoS ONE (2011)

Bottom Line: Our data supports novel gene models and can be used to improve current rice genome annotation.Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events.Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

View Article: PubMed Central - PubMed

Affiliation: Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York, United States of America.

ABSTRACT
The emergence of NextGen sequencing technology has generated much interest in the exploration of transcriptomes. Currently, Illumina Inc. (San Diego, CA) provides one of the most widely utilized sequencing platforms for gene expression analysis. While Illumina reagents and protocols perform adequately in RNA-sequencing (RNA-seq), alternative reagents and protocols promise a higher throughput at a much lower cost. We have developed a low-cost and robust protocol to produce Illumina-compatible (GAIIx and HiSeq2000 platforms) RNA-seq libraries by combining several recent improvements. First, we designed balanced adapter sequences for multiplexing of samples; second, dUTP incorporation in 2(nd) strand synthesis was used to enforce strand-specificity; third, we simplified RNA purification, fragmentation and library size-selection steps thus drastically reducing the time and increasing throughput of library construction; fourth, we included an RNA spike-in control for validation and normalization purposes. To streamline informatics analysis for the community, we established a pipeline within the iPlant Collaborative. These scripts are easily customized to meet specific research needs and improve on existing informatics and statistical treatments of RNA-seq data. In particular, we apply significance tests for determining differential gene expression and intron retention events. To demonstrate the potential of both the library-construction protocol and data-analysis pipeline, we characterized the transcriptome of the rice leaf. Our data supports novel gene models and can be used to improve current rice genome annotation. Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events. Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

Show MeSH

Related in: MedlinePlus

Determination of the background cutoff for RNA-seq data.Y axis shows the number of reads that map to non-coding regions relative to the total bp of non-coding regions (NCR). The x-axis displays the fraction of NCRs. As shown, 99% of the regions denoted as non-coding have fewer than 0.0015 reads/bp.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3198403&req=5

pone-0026426-g006: Determination of the background cutoff for RNA-seq data.Y axis shows the number of reads that map to non-coding regions relative to the total bp of non-coding regions (NCR). The x-axis displays the fraction of NCRs. As shown, 99% of the regions denoted as non-coding have fewer than 0.0015 reads/bp.

Mentions: An emerging need for RNA-seq data analysis is to determine confidence intervals for defining significance in gene expression values. As a part of our data analysis pipeline, we determined the significantly expressed genes by comparing the reads that aligned to the annotated gene space (i.e. exons and UTRs) to the “non-coding regions” (NCRs) that are defined as regions of at least 5 kb away from any annotated genes. Our method is built upon two assumptions: first, the NCRs, by definition, do not generate a large number of transcripts; second, that gene annotation is accurate. Based on these assumptions, the reads mapped to the NCRs are likely due to artifacts of the sequencing method or library construction. For instance, DNA contamination in the RNA samples could lead to read placements in NCRs and thus can be used to estimate the background or “noise” level. As shown in Figure 6, we calculated the significance of gene expression using the normalized coverage in 99% of the NCRs (see methods for more detail). Using an empirical Bayesian method based on a Poisson distribution of reads, we calculated the posterior odds (B) for all genes and consider a gene as expressed if the B value is less than 1 (see methods section and Figure S8 for detail).


A low-cost library construction protocol and data analysis pipeline for Illumina-based strand-specific multiplex RNA-seq.

Wang L, Si Y, Dedow LK, Shao Y, Liu P, Brutnell TP - PLoS ONE (2011)

Determination of the background cutoff for RNA-seq data.Y axis shows the number of reads that map to non-coding regions relative to the total bp of non-coding regions (NCR). The x-axis displays the fraction of NCRs. As shown, 99% of the regions denoted as non-coding have fewer than 0.0015 reads/bp.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3198403&req=5

pone-0026426-g006: Determination of the background cutoff for RNA-seq data.Y axis shows the number of reads that map to non-coding regions relative to the total bp of non-coding regions (NCR). The x-axis displays the fraction of NCRs. As shown, 99% of the regions denoted as non-coding have fewer than 0.0015 reads/bp.
Mentions: An emerging need for RNA-seq data analysis is to determine confidence intervals for defining significance in gene expression values. As a part of our data analysis pipeline, we determined the significantly expressed genes by comparing the reads that aligned to the annotated gene space (i.e. exons and UTRs) to the “non-coding regions” (NCRs) that are defined as regions of at least 5 kb away from any annotated genes. Our method is built upon two assumptions: first, the NCRs, by definition, do not generate a large number of transcripts; second, that gene annotation is accurate. Based on these assumptions, the reads mapped to the NCRs are likely due to artifacts of the sequencing method or library construction. For instance, DNA contamination in the RNA samples could lead to read placements in NCRs and thus can be used to estimate the background or “noise” level. As shown in Figure 6, we calculated the significance of gene expression using the normalized coverage in 99% of the NCRs (see methods for more detail). Using an empirical Bayesian method based on a Poisson distribution of reads, we calculated the posterior odds (B) for all genes and consider a gene as expressed if the B value is less than 1 (see methods section and Figure S8 for detail).

Bottom Line: Our data supports novel gene models and can be used to improve current rice genome annotation.Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events.Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

View Article: PubMed Central - PubMed

Affiliation: Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York, United States of America.

ABSTRACT
The emergence of NextGen sequencing technology has generated much interest in the exploration of transcriptomes. Currently, Illumina Inc. (San Diego, CA) provides one of the most widely utilized sequencing platforms for gene expression analysis. While Illumina reagents and protocols perform adequately in RNA-sequencing (RNA-seq), alternative reagents and protocols promise a higher throughput at a much lower cost. We have developed a low-cost and robust protocol to produce Illumina-compatible (GAIIx and HiSeq2000 platforms) RNA-seq libraries by combining several recent improvements. First, we designed balanced adapter sequences for multiplexing of samples; second, dUTP incorporation in 2(nd) strand synthesis was used to enforce strand-specificity; third, we simplified RNA purification, fragmentation and library size-selection steps thus drastically reducing the time and increasing throughput of library construction; fourth, we included an RNA spike-in control for validation and normalization purposes. To streamline informatics analysis for the community, we established a pipeline within the iPlant Collaborative. These scripts are easily customized to meet specific research needs and improve on existing informatics and statistical treatments of RNA-seq data. In particular, we apply significance tests for determining differential gene expression and intron retention events. To demonstrate the potential of both the library-construction protocol and data-analysis pipeline, we characterized the transcriptome of the rice leaf. Our data supports novel gene models and can be used to improve current rice genome annotation. Additionally, using the rice transcriptome data, we compared different methods of calculating gene expression and discuss the advantages of a strand-specific approach to detect bona-fide anti-sense transcripts and to detect intron retention events. Our results demonstrate the potential of this low cost and robust method for RNA-seq library construction and data analysis.

Show MeSH
Related in: MedlinePlus