Limits...
From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes.

Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, Frackleton E, Hou C, Glessner JT, Chiavacci R, Stanley C, Monos D, Grant SF, Polychronakos C, Hakonarson H - PLoS Genet. (2009)

Bottom Line: Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci.Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers.We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, United States of America.

ABSTRACT
Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of approximately 0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.

Show MeSH

Related in: MedlinePlus

Performance of risk assessment models trained on the CHOP/Montreal-T1D dataset.For both the WTCCC-T1D and the GoKind-T1D datasets, the SVM (support vector machine) algorithm consistently outperforms LR (logistic regression), and the best performance is achieved when SNPs were selected using P-value cutoff of 1×10−6 or 1×10−5.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2748686&req=5

pgen-1000678-g002: Performance of risk assessment models trained on the CHOP/Montreal-T1D dataset.For both the WTCCC-T1D and the GoKind-T1D datasets, the SVM (support vector machine) algorithm consistently outperforms LR (logistic regression), and the best performance is achieved when SNPs were selected using P-value cutoff of 1×10−6 or 1×10−5.

Mentions: Furthermore, we investigated how the algorithms performed when the models were trained on an independent dataset with different size and ascertainment schema. As such, we built prediction models from the imputed CHOP/Montreal-T1D dataset using markers that are present on the Affymetrix arrays, and then evaluated the model performance on the WTCCC-T1D and the GoKind-T1D data (Figure 2, Table S3 and Table S4). Despite the use of a different training dataset, SVM still demonstrated an advantage over LR, with an AUC score of 0.85 on WTCCC-T1D and 0.84 on GoKind-T1D datasets, respectively, when a P<1×10−6 threshold is used for SNP selection. Altogether, these results suggest that an SVM-based risk assessment algorithm can accommodate differences in training data and can generate consistent, robust results across different datasets.


From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes.

Wei Z, Wang K, Qu HQ, Zhang H, Bradfield J, Kim C, Frackleton E, Hou C, Glessner JT, Chiavacci R, Stanley C, Monos D, Grant SF, Polychronakos C, Hakonarson H - PLoS Genet. (2009)

Performance of risk assessment models trained on the CHOP/Montreal-T1D dataset.For both the WTCCC-T1D and the GoKind-T1D datasets, the SVM (support vector machine) algorithm consistently outperforms LR (logistic regression), and the best performance is achieved when SNPs were selected using P-value cutoff of 1×10−6 or 1×10−5.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2748686&req=5

pgen-1000678-g002: Performance of risk assessment models trained on the CHOP/Montreal-T1D dataset.For both the WTCCC-T1D and the GoKind-T1D datasets, the SVM (support vector machine) algorithm consistently outperforms LR (logistic regression), and the best performance is achieved when SNPs were selected using P-value cutoff of 1×10−6 or 1×10−5.
Mentions: Furthermore, we investigated how the algorithms performed when the models were trained on an independent dataset with different size and ascertainment schema. As such, we built prediction models from the imputed CHOP/Montreal-T1D dataset using markers that are present on the Affymetrix arrays, and then evaluated the model performance on the WTCCC-T1D and the GoKind-T1D data (Figure 2, Table S3 and Table S4). Despite the use of a different training dataset, SVM still demonstrated an advantage over LR, with an AUC score of 0.85 on WTCCC-T1D and 0.84 on GoKind-T1D datasets, respectively, when a P<1×10−6 threshold is used for SNP selection. Altogether, these results suggest that an SVM-based risk assessment algorithm can accommodate differences in training data and can generate consistent, robust results across different datasets.

Bottom Line: Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci.Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers.We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.

View Article: PubMed Central - PubMed

Affiliation: Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, United States of America.

ABSTRACT
Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of approximately 0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.

Show MeSH
Related in: MedlinePlus