Limits...
Seq-ing improved gene expression estimates from microarrays using machine learning.

Korir PK, Geeleher P, Seoighe C - BMC Bioinformatics (2015)

Bottom Line: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale.This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible.This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated.

View Article: PubMed Central - PubMed

Affiliation: School of Biochemistry and Cell Biology, University College Cork, Western Road, Cork, Ireland. paul.korir@gmail.com.

ABSTRACT

Background: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories.

Results: We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues.

Conclusion: This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated.

No MeSH data available.


The effects of OOB filtering. Mean cross-sample Pearson correlation as a function of OOB correlation threshold for (a) genes and (b) transcripts. Error bars correspond to two standard errors. Note that transcript-level estimates are not provided by RMA and PLIER. The black line represents the number of genes/transcripts at each level
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4559919&req=5

Fig4: The effects of OOB filtering. Mean cross-sample Pearson correlation as a function of OOB correlation threshold for (a) genes and (b) transcripts. Error bars correspond to two standard errors. Note that transcript-level estimates are not provided by RMA and PLIER. The black line represents the number of genes/transcripts at each level

Mentions: The performance of MaLTE is better for some genes. For example, low values of cross-sample correlation between MaLTE and RNA-Seq can be obtained for genes with low variation in expression across samples. Such genes will typically also show poor cross-sample correlation when their expression is estimated using median-polish and PLIER. However, MaLTE has the advantage that it provides an estimate of the accuracy with which the expression level of a given gene can be predicted. This is provided by the cross-validation carried out by Random Forest when the gene-specific regression model is learned from the training data [24]. Each regression tree in the forest is constructed from a subset of the samples. The expression level of the gene in a given sample can be estimated from the regression trees from which that sample was omitted. This is called the out-of-bag (OOB) estimate. For example, to estimate how well MaLTE will perform for a given gene as assessed by cross-sample correlation with RNA-Seq, we calculate the cross-sample correlation between the OOB estimates and the RNA-Seq data from the training samples. This provides an accurate estimate of the cross-sample correlation in test data (Supplementary Fig. S2). The OOB estimate can be used as a filter, so that MaLTE returns expression estimates only for genes with a desired property. By thresholding on the OOB cross-sample correlation, we found that very high values of cross-sample correlation can be achieved for a subset of genes (Fig. 4a). Because genes that pass the OOB cross-sample correlation threshold are likely to have high cross-sample variation, median-polish and PLIER also achieve higher cross-sample correlation for these genes. However, MaLTE maintains a performance advantage over the other methods with increasing threshold values (Fig. 4a).Fig. 4


Seq-ing improved gene expression estimates from microarrays using machine learning.

Korir PK, Geeleher P, Seoighe C - BMC Bioinformatics (2015)

The effects of OOB filtering. Mean cross-sample Pearson correlation as a function of OOB correlation threshold for (a) genes and (b) transcripts. Error bars correspond to two standard errors. Note that transcript-level estimates are not provided by RMA and PLIER. The black line represents the number of genes/transcripts at each level
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4559919&req=5

Fig4: The effects of OOB filtering. Mean cross-sample Pearson correlation as a function of OOB correlation threshold for (a) genes and (b) transcripts. Error bars correspond to two standard errors. Note that transcript-level estimates are not provided by RMA and PLIER. The black line represents the number of genes/transcripts at each level
Mentions: The performance of MaLTE is better for some genes. For example, low values of cross-sample correlation between MaLTE and RNA-Seq can be obtained for genes with low variation in expression across samples. Such genes will typically also show poor cross-sample correlation when their expression is estimated using median-polish and PLIER. However, MaLTE has the advantage that it provides an estimate of the accuracy with which the expression level of a given gene can be predicted. This is provided by the cross-validation carried out by Random Forest when the gene-specific regression model is learned from the training data [24]. Each regression tree in the forest is constructed from a subset of the samples. The expression level of the gene in a given sample can be estimated from the regression trees from which that sample was omitted. This is called the out-of-bag (OOB) estimate. For example, to estimate how well MaLTE will perform for a given gene as assessed by cross-sample correlation with RNA-Seq, we calculate the cross-sample correlation between the OOB estimates and the RNA-Seq data from the training samples. This provides an accurate estimate of the cross-sample correlation in test data (Supplementary Fig. S2). The OOB estimate can be used as a filter, so that MaLTE returns expression estimates only for genes with a desired property. By thresholding on the OOB cross-sample correlation, we found that very high values of cross-sample correlation can be achieved for a subset of genes (Fig. 4a). Because genes that pass the OOB cross-sample correlation threshold are likely to have high cross-sample variation, median-polish and PLIER also achieve higher cross-sample correlation for these genes. However, MaLTE maintains a performance advantage over the other methods with increasing threshold values (Fig. 4a).Fig. 4

Bottom Line: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale.This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible.This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated.

View Article: PubMed Central - PubMed

Affiliation: School of Biochemistry and Cell Biology, University College Cork, Western Road, Cork, Ireland. paul.korir@gmail.com.

ABSTRACT

Background: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories.

Results: We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues.

Conclusion: This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated.

No MeSH data available.