Limits...
Using epigenomics data to predict gene expression in lung cancer.

Li J, Ching T, Huang S, Garmire LX - BMC Bioinformatics (2015)

Bottom Line: Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance.In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts.Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.

Methods: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.

Results: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

Conclusions: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

Show MeSH

Related in: MedlinePlus

Evaluation of features generated from various data types. (a-b) Effects of feature set drop-off on ROC curves from the 10-fold cross-validation training set (a) and testing set (b). (c) Effects of feature set drop-off on other four metrics: AUC, Accuracy, F-measure and MCC, in the training set and testing set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4402699&req=5

Figure 4: Evaluation of features generated from various data types. (a-b) Effects of feature set drop-off on ROC curves from the 10-fold cross-validation training set (a) and testing set (b). (c) Effects of feature set drop-off on other four metrics: AUC, Accuracy, F-measure and MCC, in the training set and testing set.

Mentions: To determine the contribution of different types of features to gene expression, we tested the performance of models when a subset of features from the same data type were dropped. We present the results of four measures of model performance: AUC, accuracy, F-measure and Matthew's correlation coefficient (MCC) (Figure 4). Dropping any individual feature set of nucleotide composition, histone modification or CpG methylation, did not seem to have a large effect on the model performance, indicating that there is redundancy between feature sets. The sub-model performance for the dropping-off of a single feature set from the full model is in the following order: nucleotide composition removal > histone modification removal > CpG methylation removal. Thus dropping methylation features had the largest effect among individual feature set, as the AUC decreases from 0.864 in the full model to 0.832 in the training set, as well as from 0.836 to 0.810 in the testing set. Likewise, MCC, upon single feature set drop-off, shows the largest proportional change among the four performance measures, and decreases from 0.56 to 0.49 on the training set and 0.51 to 0.45 on the testing set.


Using epigenomics data to predict gene expression in lung cancer.

Li J, Ching T, Huang S, Garmire LX - BMC Bioinformatics (2015)

Evaluation of features generated from various data types. (a-b) Effects of feature set drop-off on ROC curves from the 10-fold cross-validation training set (a) and testing set (b). (c) Effects of feature set drop-off on other four metrics: AUC, Accuracy, F-measure and MCC, in the training set and testing set.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4402699&req=5

Figure 4: Evaluation of features generated from various data types. (a-b) Effects of feature set drop-off on ROC curves from the 10-fold cross-validation training set (a) and testing set (b). (c) Effects of feature set drop-off on other four metrics: AUC, Accuracy, F-measure and MCC, in the training set and testing set.
Mentions: To determine the contribution of different types of features to gene expression, we tested the performance of models when a subset of features from the same data type were dropped. We present the results of four measures of model performance: AUC, accuracy, F-measure and Matthew's correlation coefficient (MCC) (Figure 4). Dropping any individual feature set of nucleotide composition, histone modification or CpG methylation, did not seem to have a large effect on the model performance, indicating that there is redundancy between feature sets. The sub-model performance for the dropping-off of a single feature set from the full model is in the following order: nucleotide composition removal > histone modification removal > CpG methylation removal. Thus dropping methylation features had the largest effect among individual feature set, as the AUC decreases from 0.864 in the full model to 0.832 in the training set, as well as from 0.836 to 0.810 in the testing set. Likewise, MCC, upon single feature set drop-off, shows the largest proportional change among the four performance measures, and decreases from 0.56 to 0.49 on the training set and 0.51 to 0.45 on the testing set.

Bottom Line: Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance.In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts.Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.

Methods: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.

Results: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

Conclusions: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

Show MeSH
Related in: MedlinePlus