Limits...
Using epigenomics data to predict gene expression in lung cancer.

Li J, Ching T, Huang S, Garmire LX - BMC Bioinformatics (2015)

Bottom Line: Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance.In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts.Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.

Methods: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.

Results: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

Conclusions: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

Show MeSH

Related in: MedlinePlus

Performance comparison of models with various feature selection and classification methods. The Areas Under the Curve (AUC) of ROC are used as the metric to compare the performance of models with different combinations of feature selection (CFS, Gain Ratios and ReliefF) and classification (Gaussian SVM, Linear SVM, Logistic regression, Naïve Bayes and Random Forest), on the training data with 10 fold cross-validation. The model with ReliefF based feature selection and Random Forest classification is selected as the best model.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4402699&req=5

Figure 2: Performance comparison of models with various feature selection and classification methods. The Areas Under the Curve (AUC) of ROC are used as the metric to compare the performance of models with different combinations of feature selection (CFS, Gain Ratios and ReliefF) and classification (Gaussian SVM, Linear SVM, Logistic regression, Naïve Bayes and Random Forest), on the training data with 10 fold cross-validation. The model with ReliefF based feature selection and Random Forest classification is selected as the best model.

Mentions: The model uses 2298 gene data points in the training set, with an additional 576 genes kept in the testing set. Three different feature selection methods were evaluated in combination with five classification methods, using 10-fold cross-validation on the training data set (Figure 2). The three feature selection methods are: correlation-based feature selection (CFS), ReliefF, and Gain Ratio. In most cases with combined classification methods, except for Gaussian SVM, ReliefF gives the best AUCs among the three feature selection methods. Among the five classification methods that we considered, namely Gaussian SVM, linear SVM, Logistic Regression, Naïve Bayes and Random Forest, the two non-linear methods (Gaussian SVM and Random Forest) show superior performances to the other linear classifiers (Logistic Regression, linear SVM, and Naïve Bayes). This indicates that interactions exist among the selected features. However, the differences are not very big, suggesting that the decision boundary is close to linear. Given that the model based on ReliefF feature selection and Random Forest classification gives the best AUC of 0.864, it is selected as the best model for the rest of the project. Similarly, a ReliefF and Random Forest based model has the best predictive performance on the 20% holdout data set, with an AUC of 0.836.


Using epigenomics data to predict gene expression in lung cancer.

Li J, Ching T, Huang S, Garmire LX - BMC Bioinformatics (2015)

Performance comparison of models with various feature selection and classification methods. The Areas Under the Curve (AUC) of ROC are used as the metric to compare the performance of models with different combinations of feature selection (CFS, Gain Ratios and ReliefF) and classification (Gaussian SVM, Linear SVM, Logistic regression, Naïve Bayes and Random Forest), on the training data with 10 fold cross-validation. The model with ReliefF based feature selection and Random Forest classification is selected as the best model.
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4402699&req=5

Figure 2: Performance comparison of models with various feature selection and classification methods. The Areas Under the Curve (AUC) of ROC are used as the metric to compare the performance of models with different combinations of feature selection (CFS, Gain Ratios and ReliefF) and classification (Gaussian SVM, Linear SVM, Logistic regression, Naïve Bayes and Random Forest), on the training data with 10 fold cross-validation. The model with ReliefF based feature selection and Random Forest classification is selected as the best model.
Mentions: The model uses 2298 gene data points in the training set, with an additional 576 genes kept in the testing set. Three different feature selection methods were evaluated in combination with five classification methods, using 10-fold cross-validation on the training data set (Figure 2). The three feature selection methods are: correlation-based feature selection (CFS), ReliefF, and Gain Ratio. In most cases with combined classification methods, except for Gaussian SVM, ReliefF gives the best AUCs among the three feature selection methods. Among the five classification methods that we considered, namely Gaussian SVM, linear SVM, Logistic Regression, Naïve Bayes and Random Forest, the two non-linear methods (Gaussian SVM and Random Forest) show superior performances to the other linear classifiers (Logistic Regression, linear SVM, and Naïve Bayes). This indicates that interactions exist among the selected features. However, the differences are not very big, suggesting that the decision boundary is close to linear. Given that the model based on ReliefF feature selection and Random Forest classification gives the best AUC of 0.864, it is selected as the best model for the rest of the project. Similarly, a ReliefF and Random Forest based model has the best predictive performance on the 20% holdout data set, with an AUC of 0.836.

Bottom Line: Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance.In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts.Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

View Article: PubMed Central - HTML - PubMed

ABSTRACT

Background: Epigenetic alterations are known to correlate with changes in gene expression among various diseases including cancers. However, quantitative models that accurately predict the up or down regulation of gene expression are currently lacking.

Methods: A new machine learning-based method of gene expression prediction is developed in the context of lung cancer. This method uses the Illumina Infinium HumanMethylation450K Beadchip CpG methylation array data from paired lung cancer and adjacent normal tissues in The Cancer Genome Atlas (TCGA) and histone modification marker CHIP-Seq data from the ENCODE project, to predict the differential expression of RNA-Seq data in TCGA lung cancers. It considers a comprehensive list of 1424 features spanning the four categories of CpG methylation, histone H3 methylation modification, nucleotide composition, and conservation. Various feature selection and classification methods are compared to select the best model over 10-fold cross-validation in the training data set.

Results: A best model comprising 67 features is chosen by ReliefF based feature selection and random forest classification method, with AUC = 0.864 from the 10-fold cross-validation of the training set and AUC = 0.836 from the testing set. The selected features cover all four data types, with histone H3 methylation modification (32 features) and CpG methylation (15 features) being most abundant. Among the dropping-off tests of individual data-type based features, removal of CpG methylation feature leads to the most reduction in model performance. In the best model, 19 selected features are from the promoter regions (TSS200 and TSS1500), highest among all locations relative to transcripts. Sequential dropping-off of CpG methylation features relative to different regions on the protein coding transcripts shows that promoter regions contribute most significantly to the accurate prediction of gene expression.

Conclusions: By considering a comprehensive list of epigenomic and genomic features, we have constructed an accurate model to predict transcriptomic differential expression, exemplified in lung cancer.

Show MeSH
Related in: MedlinePlus