Limits...
Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study.

Seffens W, Evans C, Minority Health-GRID NetworkTaylor H - Bioinform Biol Insights (2016)

Bottom Line: We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set.Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients.Data mining classification tools were used to generate association rules.

View Article: PubMed Central - PubMed

Affiliation: Physiology Department, Morehouse School of Medicine, Atlanta, GA, USA.

ABSTRACT
Health-care initiatives are pushing the development and utilization of clinical data for medical discovery and translational research studies. Machine learning tools implemented for Big Data have been applied to detect patterns in complex diseases. This study focuses on hypertension and examines phenotype data across a major clinical study called Minority Health Genomics and Translational Research Repository Database composed of self-reported African American (AA) participants combined with related cohorts. Prior genome-wide association studies for hypertension in AAs presumed that an increase of disease burden in susceptible populations is due to rare variants. But genomic analysis of hypertension, even those designed to focus on rare variants, has yielded marginal genome-wide results over many studies. Machine learning and other nonparametric statistical methods have recently been shown to uncover relationships in complex phenotypes, genotypes, and clinical data. We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set. Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients. Data mining classification tools were used to generate association rules.

No MeSH data available.


Related in: MedlinePlus

Fasting glucose ANN learning curve.Notes: Upper curve is SSE of the output neuron layer, while lower curve is the number of training set instances that are incorrect above an error set point.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4862746&req=5

f3-bbi-suppl.3-2015-043: Fasting glucose ANN learning curve.Notes: Upper curve is SSE of the output neuron layer, while lower curve is the number of training set instances that are incorrect above an error set point.

Mentions: Mean glucose in the training set was 90.9 mg/dL, with a range of 54–360 mg/dL (Table 1). Fasting glucose levels in blood have a well-known connection to health, but again, the variables in the phenotype file may not be predictive.39 The learning curve plateaus ~10,000 epochs (Fig. 3), and good test set accuracy occurs at 20,000 epochs (Table 2). The long learning curve past 10K epochs provides only minor increases in accuracy. At 5K epochs and 10% significance level, the accuracy is 89.3%, while at 20K, it is 89.5%, which is not significantly different. As a function of testing significance level, the accuracy values at 20K epochs are 78.1% (±7.7 at 5% significance level), 43.3% (±8.2 at 1% significance level), and 15.7% at 0.1% significance level. As for smoking status, data mining with Weka produced sets of many rules, which did not have ready interpretations to physiological models.


Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study.

Seffens W, Evans C, Minority Health-GRID NetworkTaylor H - Bioinform Biol Insights (2016)

Fasting glucose ANN learning curve.Notes: Upper curve is SSE of the output neuron layer, while lower curve is the number of training set instances that are incorrect above an error set point.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4862746&req=5

f3-bbi-suppl.3-2015-043: Fasting glucose ANN learning curve.Notes: Upper curve is SSE of the output neuron layer, while lower curve is the number of training set instances that are incorrect above an error set point.
Mentions: Mean glucose in the training set was 90.9 mg/dL, with a range of 54–360 mg/dL (Table 1). Fasting glucose levels in blood have a well-known connection to health, but again, the variables in the phenotype file may not be predictive.39 The learning curve plateaus ~10,000 epochs (Fig. 3), and good test set accuracy occurs at 20,000 epochs (Table 2). The long learning curve past 10K epochs provides only minor increases in accuracy. At 5K epochs and 10% significance level, the accuracy is 89.3%, while at 20K, it is 89.5%, which is not significantly different. As a function of testing significance level, the accuracy values at 20K epochs are 78.1% (±7.7 at 5% significance level), 43.3% (±8.2 at 1% significance level), and 15.7% at 0.1% significance level. As for smoking status, data mining with Weka produced sets of many rules, which did not have ready interpretations to physiological models.

Bottom Line: We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set.Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients.Data mining classification tools were used to generate association rules.

View Article: PubMed Central - PubMed

Affiliation: Physiology Department, Morehouse School of Medicine, Atlanta, GA, USA.

ABSTRACT
Health-care initiatives are pushing the development and utilization of clinical data for medical discovery and translational research studies. Machine learning tools implemented for Big Data have been applied to detect patterns in complex diseases. This study focuses on hypertension and examines phenotype data across a major clinical study called Minority Health Genomics and Translational Research Repository Database composed of self-reported African American (AA) participants combined with related cohorts. Prior genome-wide association studies for hypertension in AAs presumed that an increase of disease burden in susceptible populations is due to rare variants. But genomic analysis of hypertension, even those designed to focus on rare variants, has yielded marginal genome-wide results over many studies. Machine learning and other nonparametric statistical methods have recently been shown to uncover relationships in complex phenotypes, genotypes, and clinical data. We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set. Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients. Data mining classification tools were used to generate association rules.

No MeSH data available.


Related in: MedlinePlus