Limits...
R-SAP: a multi-threading computational pipeline for the characterization of high-throughput RNA-sequencing data.

Mittal VK, McDonald JF - Nucleic Acids Res. (2012)

Bottom Line: We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets.R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading.In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.

View Article: PubMed Central - PubMed

Affiliation: School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA.

ABSTRACT
The rapid expansion in the quantity and quality of RNA-Seq data requires the development of sophisticated high-performance bioinformatics tools capable of rapidly transforming this data into meaningful information that is easily interpretable by biologists. Currently available analysis tools are often not easily installed by the general biologist and most of them lack inherent parallel processing capabilities widely recognized as an essential feature of next-generation bioinformatics tools. We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets. R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading. In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.

Show MeSH

Related in: MedlinePlus

Distribution of the high-scoring reads from MAQC Reference Human dataset onto RefSeq transcripts. ‘Exons’ includes those reads characterized as Exons-only, Exon-deletion, Alternative TSS, AlternativePolyadenylation, Internal-exon-extension and Multiple-annotations. ‘Intergenic’ includes those reads characterized as gene-desert or neighboring-exon, ‘Introns’ represent reads mapping completely within introns and ‘Uncharacterized’ are those reads that cannot be characterized with any RefSeq transcript (distribution is presented in Table 2).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3351179&req=5

gks047-F4: Distribution of the high-scoring reads from MAQC Reference Human dataset onto RefSeq transcripts. ‘Exons’ includes those reads characterized as Exons-only, Exon-deletion, Alternative TSS, AlternativePolyadenylation, Internal-exon-extension and Multiple-annotations. ‘Intergenic’ includes those reads characterized as gene-desert or neighboring-exon, ‘Introns’ represent reads mapping completely within introns and ‘Uncharacterized’ are those reads that cannot be characterized with any RefSeq transcript (distribution is presented in Table 2).

Mentions: As expected from the RNA-Seq data, the majority (299 473/491 117 or 61%) of the high-scoring reads mapped to the exons (Figure 4 and Table 2). Slightly more than half (54.42%; 267 279/491 117) of high-scoring reads were exon-only reads that could be attributable to 24 461 RefSeq transcripts (Table 2). RPKM values (expression levels) for these RefSeq transcripts are presented in Supplementary File S2. R-SAP identified a wide spectrum of expression values (RPKM values) ranging from a minimum of 0.046 for the TTN (titin or connectin) gene to a maximum of 2112 for the MTRNR2L2 (humanin- like protein 2) gene. More than 1% (1.38%; 6786/491 117) of the high-scoring reads were found to be associated with exon-deletion events among the 4850 RefSeq transcripts (Table 2). Relatively few (840/6786 or 12.37%) of the events characterized by R-SAP as exon deletions were attributable to exon-skipping events corresponding to 620 RefSeq transcripts. While skipping of a maximum of 20 exons was observed, the majority of the exon skipping events involved skipping of only one exon (Supplementary Figure S1). It is important to note that the power and accuracy of R-SAP to detect splice variants depends completely upon the length of the sequencing reads. For instance, exon-skipping events are detected when the read spans the flanking exons of the skipped exon. Short reads from such new splice junctions will not produce significant alignments on the genome and hence will go undetected. Previously published RNA-Seq studies detect exon skipping by mapping the short reads to a synthetically created library of new splice junctions (12).Figure 4.


R-SAP: a multi-threading computational pipeline for the characterization of high-throughput RNA-sequencing data.

Mittal VK, McDonald JF - Nucleic Acids Res. (2012)

Distribution of the high-scoring reads from MAQC Reference Human dataset onto RefSeq transcripts. ‘Exons’ includes those reads characterized as Exons-only, Exon-deletion, Alternative TSS, AlternativePolyadenylation, Internal-exon-extension and Multiple-annotations. ‘Intergenic’ includes those reads characterized as gene-desert or neighboring-exon, ‘Introns’ represent reads mapping completely within introns and ‘Uncharacterized’ are those reads that cannot be characterized with any RefSeq transcript (distribution is presented in Table 2).
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3351179&req=5

gks047-F4: Distribution of the high-scoring reads from MAQC Reference Human dataset onto RefSeq transcripts. ‘Exons’ includes those reads characterized as Exons-only, Exon-deletion, Alternative TSS, AlternativePolyadenylation, Internal-exon-extension and Multiple-annotations. ‘Intergenic’ includes those reads characterized as gene-desert or neighboring-exon, ‘Introns’ represent reads mapping completely within introns and ‘Uncharacterized’ are those reads that cannot be characterized with any RefSeq transcript (distribution is presented in Table 2).
Mentions: As expected from the RNA-Seq data, the majority (299 473/491 117 or 61%) of the high-scoring reads mapped to the exons (Figure 4 and Table 2). Slightly more than half (54.42%; 267 279/491 117) of high-scoring reads were exon-only reads that could be attributable to 24 461 RefSeq transcripts (Table 2). RPKM values (expression levels) for these RefSeq transcripts are presented in Supplementary File S2. R-SAP identified a wide spectrum of expression values (RPKM values) ranging from a minimum of 0.046 for the TTN (titin or connectin) gene to a maximum of 2112 for the MTRNR2L2 (humanin- like protein 2) gene. More than 1% (1.38%; 6786/491 117) of the high-scoring reads were found to be associated with exon-deletion events among the 4850 RefSeq transcripts (Table 2). Relatively few (840/6786 or 12.37%) of the events characterized by R-SAP as exon deletions were attributable to exon-skipping events corresponding to 620 RefSeq transcripts. While skipping of a maximum of 20 exons was observed, the majority of the exon skipping events involved skipping of only one exon (Supplementary Figure S1). It is important to note that the power and accuracy of R-SAP to detect splice variants depends completely upon the length of the sequencing reads. For instance, exon-skipping events are detected when the read spans the flanking exons of the skipped exon. Short reads from such new splice junctions will not produce significant alignments on the genome and hence will go undetected. Previously published RNA-Seq studies detect exon skipping by mapping the short reads to a synthetically created library of new splice junctions (12).Figure 4.

Bottom Line: We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets.R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading.In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.

View Article: PubMed Central - PubMed

Affiliation: School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA.

ABSTRACT
The rapid expansion in the quantity and quality of RNA-Seq data requires the development of sophisticated high-performance bioinformatics tools capable of rapidly transforming this data into meaningful information that is easily interpretable by biologists. Currently available analysis tools are often not easily installed by the general biologist and most of them lack inherent parallel processing capabilities widely recognized as an essential feature of next-generation bioinformatics tools. We present here a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets. R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading. In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.

Show MeSH
Related in: MedlinePlus