Limits...
Multi-model inference using mixed effects from a linear regression based genetic algorithm.

Van der Borght K, Verbeke G, van Vlijmen H - BMC Bioinformatics (2014)

Bottom Line: A GA solution was found when R2MM > 0.95 (goal.fitness).When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO).We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set.

View Article: PubMed Central - HTML - PubMed

Affiliation: Janssen Infectious Diseases-Diagnostics BVBA, B-2340 Beerse, Belgium. kvdborgh@its.jnj.com.

ABSTRACT

Background: Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92.

Results: In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R2MM > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO).

Conclusions: We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.

Show MeSH

Related in: MedlinePlus

GA parameter: tournament size. Mean R2(OLS) from best chromosomes (from 10 runs) on the training data for tournament selection with tournament.size 1–20 (pop.size).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC3987104&req=5

Figure 2: GA parameter: tournament size. Mean R2(OLS) from best chromosomes (from 10 runs) on the training data for tournament selection with tournament.size 1–20 (pop.size).

Mentions: In the optimization (Table 3), all tournament sizes 1 ≤ k ≤ pop.size were considered. From section I, we selected the following (Pm,Pc) combinations for evaluation: (0.1,0.6), (0.2,0.6), and (0.258,0.372). We also considered (0.05,0.7), the (Pm,Pc) combination used in [5]. From Figure 2, to improve the R2(OLS) performance the tournament.size k should be taken > 2. We chose to continue to use k = 10 (as pre-set in section I). Slightly better R2 performance was seen for the (Pm,Pc) combinations (0.1,0.6) and (0.2,0.6). The former was chosen for reasons of computational efficiency.


Multi-model inference using mixed effects from a linear regression based genetic algorithm.

Van der Borght K, Verbeke G, van Vlijmen H - BMC Bioinformatics (2014)

GA parameter: tournament size. Mean R2(OLS) from best chromosomes (from 10 runs) on the training data for tournament selection with tournament.size 1–20 (pop.size).
© Copyright Policy - open-access
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC3987104&req=5

Figure 2: GA parameter: tournament size. Mean R2(OLS) from best chromosomes (from 10 runs) on the training data for tournament selection with tournament.size 1–20 (pop.size).
Mentions: In the optimization (Table 3), all tournament sizes 1 ≤ k ≤ pop.size were considered. From section I, we selected the following (Pm,Pc) combinations for evaluation: (0.1,0.6), (0.2,0.6), and (0.258,0.372). We also considered (0.05,0.7), the (Pm,Pc) combination used in [5]. From Figure 2, to improve the R2(OLS) performance the tournament.size k should be taken > 2. We chose to continue to use k = 10 (as pre-set in section I). Slightly better R2 performance was seen for the (Pm,Pc) combinations (0.1,0.6) and (0.2,0.6). The former was chosen for reasons of computational efficiency.

Bottom Line: A GA solution was found when R2MM > 0.95 (goal.fitness).When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO).We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set.

View Article: PubMed Central - HTML - PubMed

Affiliation: Janssen Infectious Diseases-Diagnostics BVBA, B-2340 Beerse, Belgium. kvdborgh@its.jnj.com.

ABSTRACT

Background: Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92.

Results: In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R2MM > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO).

Conclusions: We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.

Show MeSH
Related in: MedlinePlus