Limits...
SpliceNet: recovering splicing isoform-specific differential gene networks from RNA-Seq data of normal and diseased samples.

Yalamanchili HK, Li Z, Wang P, Wong MP, Yao J, Wang J - Nucleic Acids Res. (2014)

Bottom Line: It eases the sample size bottleneck; evaluations on simulated data and lung cancer-specific ERBB2 and MAPK signaling pathways, with varying number of samples, evince the merit in handling high exon to sample size ratio datasets.Gene level evaluations demonstrate a substantial performance of SpliceNet over canonical correlation analysis, a method that is currently applied to exon level RNA-Seq data.SpliceNet can also be applied to exon array data.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry, The University of Hong Kong, Hong Kong (SAR), China Department of Pathology, The University of Hong Kong, Hong Kong (SAR), China.

Show MeSH

Related in: MedlinePlus

Illustration of extracting exon-expression matrix for each isofom of a gene in interest list. G is the gene of interest, I is the isoform, e is the exon, S is the sample, m is the number of isoforms, p is the number of exons, n is the number of samples. Em,n,p, Cexm,n,p and Wm,n,p are raw expression, corrected expression and correction weight for pth exon of mth isoform in nth sample, respectively. Im,n is the expression of mth isoform in nth sample (from isoform expression files). It can be observed that exon E1 is shared by two isoforms I1 and I2. Thus, corrected exon-expression value of exon 1 in sample 1 for isoform 1 is computed as Cex111 = [E111 × {I1,1/(I1,1 + I2,1)}].
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4150760&req=5

Figure 1: Illustration of extracting exon-expression matrix for each isofom of a gene in interest list. G is the gene of interest, I is the isoform, e is the exon, S is the sample, m is the number of isoforms, p is the number of exons, n is the number of samples. Em,n,p, Cexm,n,p and Wm,n,p are raw expression, corrected expression and correction weight for pth exon of mth isoform in nth sample, respectively. Im,n is the expression of mth isoform in nth sample (from isoform expression files). It can be observed that exon E1 is shared by two isoforms I1 and I2. Thus, corrected exon-expression value of exon 1 in sample 1 for isoform 1 is computed as Cex111 = [E111 × {I1,1/(I1,1 + I2,1)}].

Mentions: Every isoform of a gene in the interest list is represented as an exon-expression matrix (multivariate random variable) of order p × n, where p is the number of exons mapped to the isoform and n is the number of samples (RNA-Seq) as illustrated in Figure 1. Firstly a gene G is mapped to its isoforms and then to their corresponding exon boundaries according to the coordinates of HG-19 (UCSC genome browser) reference genome. Secondly, exon boundaries of each isoform from 1,.., m of gene G are matched to exon-positions of each level 3 RNA-Seq sample and corresponding exon-expression values are extracted. An exon is considered only if it is expressed in at least 50% of the samples, as any inference with half of the data missing (no expression) is not reliable. Considering sequencing errors an error margin of ±5 nt positions is allowed in mapping exon boundaries. The error margin of ±5 nt is a reasonable tradeoff between the acceptable sequencing errors and the smallest human exon of 15 nt (28) and can avoid imprecise exon mappings. Thus, each isoform is represented as an expression matrix with exons and samples as columns and rows respectively. However, it is well established that a significant fraction of mammalian genes overlap and share common exons. In the light of this fact it is not reasonable to assign same expression value to an exon for all its instances that are shared by multiple isoforms/genes. This makes it difficult to distinguish isoforms that share a significant number of exons or overlapping genes and is not accounted by previous studies (17,18). Moreover, isoform expression is tissue- and condition-specific i.e. isoforms of a gene express differentially in different tissues and conditions. Assigning the same expression value to all the instances of an exon will result in farcical imputations. For example, B-cell lymphoma-extra, Bcl-x, a very well studied cancer associated gene, has two isoforms Bcl-xS (short) and Bcl-xL (long). The two isoforms differ only by one exon but with totally distinct expressions and functions. Any inferences using uniform exon-expression values for both the isoforms will be inaccurate. This problem is addressed by normalizing the expression value of each instance by relative abundance of the corresponding isoform in a specific sample. Firstly, all known HG-19 isoforms are scanned for shared exon boundaries and summarized to a sharing exon file with each row representing an exon and its isoform instances as shown in Figure 1. Corrected exon-expression value for each isoform is computed as follows:(1)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{upgreek}\usepackage{mathrsfs}\setlength{\oddsidemargin}{-69pt}\begin{document}}{}\begin{equation*} {\rm Cex}_{m,n,p} = E_{m,n,p} \times W_{m,n,p} \end{equation*}\end{document}Figure 1.


SpliceNet: recovering splicing isoform-specific differential gene networks from RNA-Seq data of normal and diseased samples.

Yalamanchili HK, Li Z, Wang P, Wong MP, Yao J, Wang J - Nucleic Acids Res. (2014)

Illustration of extracting exon-expression matrix for each isofom of a gene in interest list. G is the gene of interest, I is the isoform, e is the exon, S is the sample, m is the number of isoforms, p is the number of exons, n is the number of samples. Em,n,p, Cexm,n,p and Wm,n,p are raw expression, corrected expression and correction weight for pth exon of mth isoform in nth sample, respectively. Im,n is the expression of mth isoform in nth sample (from isoform expression files). It can be observed that exon E1 is shared by two isoforms I1 and I2. Thus, corrected exon-expression value of exon 1 in sample 1 for isoform 1 is computed as Cex111 = [E111 × {I1,1/(I1,1 + I2,1)}].
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4150760&req=5

Figure 1: Illustration of extracting exon-expression matrix for each isofom of a gene in interest list. G is the gene of interest, I is the isoform, e is the exon, S is the sample, m is the number of isoforms, p is the number of exons, n is the number of samples. Em,n,p, Cexm,n,p and Wm,n,p are raw expression, corrected expression and correction weight for pth exon of mth isoform in nth sample, respectively. Im,n is the expression of mth isoform in nth sample (from isoform expression files). It can be observed that exon E1 is shared by two isoforms I1 and I2. Thus, corrected exon-expression value of exon 1 in sample 1 for isoform 1 is computed as Cex111 = [E111 × {I1,1/(I1,1 + I2,1)}].
Mentions: Every isoform of a gene in the interest list is represented as an exon-expression matrix (multivariate random variable) of order p × n, where p is the number of exons mapped to the isoform and n is the number of samples (RNA-Seq) as illustrated in Figure 1. Firstly a gene G is mapped to its isoforms and then to their corresponding exon boundaries according to the coordinates of HG-19 (UCSC genome browser) reference genome. Secondly, exon boundaries of each isoform from 1,.., m of gene G are matched to exon-positions of each level 3 RNA-Seq sample and corresponding exon-expression values are extracted. An exon is considered only if it is expressed in at least 50% of the samples, as any inference with half of the data missing (no expression) is not reliable. Considering sequencing errors an error margin of ±5 nt positions is allowed in mapping exon boundaries. The error margin of ±5 nt is a reasonable tradeoff between the acceptable sequencing errors and the smallest human exon of 15 nt (28) and can avoid imprecise exon mappings. Thus, each isoform is represented as an expression matrix with exons and samples as columns and rows respectively. However, it is well established that a significant fraction of mammalian genes overlap and share common exons. In the light of this fact it is not reasonable to assign same expression value to an exon for all its instances that are shared by multiple isoforms/genes. This makes it difficult to distinguish isoforms that share a significant number of exons or overlapping genes and is not accounted by previous studies (17,18). Moreover, isoform expression is tissue- and condition-specific i.e. isoforms of a gene express differentially in different tissues and conditions. Assigning the same expression value to all the instances of an exon will result in farcical imputations. For example, B-cell lymphoma-extra, Bcl-x, a very well studied cancer associated gene, has two isoforms Bcl-xS (short) and Bcl-xL (long). The two isoforms differ only by one exon but with totally distinct expressions and functions. Any inferences using uniform exon-expression values for both the isoforms will be inaccurate. This problem is addressed by normalizing the expression value of each instance by relative abundance of the corresponding isoform in a specific sample. Firstly, all known HG-19 isoforms are scanned for shared exon boundaries and summarized to a sharing exon file with each row representing an exon and its isoform instances as shown in Figure 1. Corrected exon-expression value for each isoform is computed as follows:(1)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{upgreek}\usepackage{mathrsfs}\setlength{\oddsidemargin}{-69pt}\begin{document}}{}\begin{equation*} {\rm Cex}_{m,n,p} = E_{m,n,p} \times W_{m,n,p} \end{equation*}\end{document}Figure 1.

Bottom Line: It eases the sample size bottleneck; evaluations on simulated data and lung cancer-specific ERBB2 and MAPK signaling pathways, with varying number of samples, evince the merit in handling high exon to sample size ratio datasets.Gene level evaluations demonstrate a substantial performance of SpliceNet over canonical correlation analysis, a method that is currently applied to exon level RNA-Seq data.SpliceNet can also be applied to exon array data.

View Article: PubMed Central - PubMed

Affiliation: Department of Biochemistry, The University of Hong Kong, Hong Kong (SAR), China Department of Pathology, The University of Hong Kong, Hong Kong (SAR), China.

Show MeSH
Related in: MedlinePlus