Limits...
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs.

LeGault LH, Dewey CN - Bioinformatics (2013)

Bottom Line: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell.RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences.We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, USA.

ABSTRACT

Motivation: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell. RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences. However, analysis of RNA-Seq data in the presence of genes with large numbers of alternative transcripts is currently challenging due to efficiency, identifiability and representation issues.

Results: We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues. We prove that our models are often identifiable and demonstrate that our inference methods for quantification and differential processing detection are efficient and accurate.

Availability: Software implementing our methods is available at http://deweylab.biostat.wisc.edu/psginfer.

Contact: cdewey@biostat.wisc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH
An example gene for which an explicit model of all possible isoform frequencies is not identifiable, whereas a PSG model for the gene is identifiable, given RNA-Seq reads. We assume that the RNA-Seq fragments are shorter than the middle exon and thus that reads from a fragment identify at most one splice junction. (A) The gene model with levels of coverage by RNA-Seq reads indicated above each exon. (B) The four possible isoforms of the gene. (C) and (D) give two (of infinitely many) possible isoform abundances that explain the observed RNA-Seq read coverages equally well. (E) The exon graph PSG for the gene, which is identifiable given this data (the unique ML parameters are above each edge), assuming the exon sequences are relatively unique
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3753571&req=5

btt396-F1: An example gene for which an explicit model of all possible isoform frequencies is not identifiable, whereas a PSG model for the gene is identifiable, given RNA-Seq reads. We assume that the RNA-Seq fragments are shorter than the middle exon and thus that reads from a fragment identify at most one splice junction. (A) The gene model with levels of coverage by RNA-Seq reads indicated above each exon. (B) The four possible isoforms of the gene. (C) and (D) give two (of infinitely many) possible isoform abundances that explain the observed RNA-Seq read coverages equally well. (E) The exon graph PSG for the gene, which is identifiable given this data (the unique ML parameters are above each edge), assuming the exon sequences are relatively unique

Mentions: Second, models for quantifying full-length isoforms with single-end RNA-Seq data are often not identifiable for genes with many alternative splice forms (Hiller et al., 2009; Lacroix et al., 2008). For a model to be identifiable, different parameter values for the model must give rise to distinct probability distributions over possible datasets. The practical disadvantage of having a non-identifiable model is that for a given dataset (e.g. RNA-Seq reads), there may be multiple possible parameter settings (e.g. transcript abundances) that explain the data equally well. Figure 1 provides a simple example of a gene for which the frequencies of its four possible isoforms are not identifiable given typical RNA-Seq data. In theory, paired-end data can eliminate this issue (Lacroix et al., 2008); however, in practice, paired-end data are derived from short size-selected fragments that provide local information similar to that of longer single-end data.Fig. 1.


Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs.

LeGault LH, Dewey CN - Bioinformatics (2013)

An example gene for which an explicit model of all possible isoform frequencies is not identifiable, whereas a PSG model for the gene is identifiable, given RNA-Seq reads. We assume that the RNA-Seq fragments are shorter than the middle exon and thus that reads from a fragment identify at most one splice junction. (A) The gene model with levels of coverage by RNA-Seq reads indicated above each exon. (B) The four possible isoforms of the gene. (C) and (D) give two (of infinitely many) possible isoform abundances that explain the observed RNA-Seq read coverages equally well. (E) The exon graph PSG for the gene, which is identifiable given this data (the unique ML parameters are above each edge), assuming the exon sequences are relatively unique
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3753571&req=5

btt396-F1: An example gene for which an explicit model of all possible isoform frequencies is not identifiable, whereas a PSG model for the gene is identifiable, given RNA-Seq reads. We assume that the RNA-Seq fragments are shorter than the middle exon and thus that reads from a fragment identify at most one splice junction. (A) The gene model with levels of coverage by RNA-Seq reads indicated above each exon. (B) The four possible isoforms of the gene. (C) and (D) give two (of infinitely many) possible isoform abundances that explain the observed RNA-Seq read coverages equally well. (E) The exon graph PSG for the gene, which is identifiable given this data (the unique ML parameters are above each edge), assuming the exon sequences are relatively unique
Mentions: Second, models for quantifying full-length isoforms with single-end RNA-Seq data are often not identifiable for genes with many alternative splice forms (Hiller et al., 2009; Lacroix et al., 2008). For a model to be identifiable, different parameter values for the model must give rise to distinct probability distributions over possible datasets. The practical disadvantage of having a non-identifiable model is that for a given dataset (e.g. RNA-Seq reads), there may be multiple possible parameter settings (e.g. transcript abundances) that explain the data equally well. Figure 1 provides a simple example of a gene for which the frequencies of its four possible isoforms are not identifiable given typical RNA-Seq data. In theory, paired-end data can eliminate this issue (Lacroix et al., 2008); however, in practice, paired-end data are derived from short size-selected fragments that provide local information similar to that of longer single-end data.Fig. 1.

Bottom Line: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell.RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences.We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, USA.

ABSTRACT

Motivation: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell. RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences. However, analysis of RNA-Seq data in the presence of genes with large numbers of alternative transcripts is currently challenging due to efficiency, identifiability and representation issues.

Results: We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues. We prove that our models are often identifiable and demonstrate that our inference methods for quantification and differential processing detection are efficient and accurate.

Availability: Software implementing our methods is available at http://deweylab.biostat.wisc.edu/psginfer.

Contact: cdewey@biostat.wisc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Show MeSH