Limits...
Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study.

Seffens W, Evans C, Minority Health-GRID NetworkTaylor H - Bioinform Biol Insights (2016)

Bottom Line: We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set.Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients.Data mining classification tools were used to generate association rules.

View Article: PubMed Central - PubMed

Affiliation: Physiology Department, Morehouse School of Medicine, Atlanta, GA, USA.

ABSTRACT
Health-care initiatives are pushing the development and utilization of clinical data for medical discovery and translational research studies. Machine learning tools implemented for Big Data have been applied to detect patterns in complex diseases. This study focuses on hypertension and examines phenotype data across a major clinical study called Minority Health Genomics and Translational Research Repository Database composed of self-reported African American (AA) participants combined with related cohorts. Prior genome-wide association studies for hypertension in AAs presumed that an increase of disease burden in susceptible populations is due to rare variants. But genomic analysis of hypertension, even those designed to focus on rare variants, has yielded marginal genome-wide results over many studies. Machine learning and other nonparametric statistical methods have recently been shown to uncover relationships in complex phenotypes, genotypes, and clinical data. We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set. Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients. Data mining classification tools were used to generate association rules.

No MeSH data available.


Related in: MedlinePlus

ANN learning curve on cigarette smoking.Notes: Upper curve is the number of training set instances out of 1,266 that are incorrect above an error set point (0.01), while lower curve is SSE of the output neuron layer.
© Copyright Policy - open-access
Related In: Results  -  Collection


getmorefigures.php?uid=PMC4862746&req=5

f2-bbi-suppl.3-2015-043: ANN learning curve on cigarette smoking.Notes: Upper curve is the number of training set instances out of 1,266 that are incorrect above an error set point (0.01), while lower curve is SSE of the output neuron layer.

Mentions: Cigarette smoking has well-known impacts on cardiovascular function and health, but conversely, the variables in the phenotype file may not be predictive.35 This variable among the MH-GRID cohorts was inconsistently defined and scored, some with a simple binary yes/no, while others with gradations and time frame qualifiers, and some that differentiated the tobacco product. Cigarette smoking status was initially coded 0.5 for nonsmoker and 1.0 for self-reported smoker, with 0.0 reserved for missing. We chose a nonbinary coding scheme to allow for more categories that were present in some of the cohort phenotype data and later compared to binary indicator variable coding. For the “0.5/1.0” coding, the learning curve plateaus ~8,000 epochs, and good test accuracy occurs at 17,000 epochs (Fig. 2). During learning, the fraction of correct predictions in the training set goes from 5% at 0 epochs to 65% at 17,000 epochs. The stringency is set at 1% on the SSE during learning in the ANN. Again, the ANN does not base predictions on statistically expected values, because the mean values in the training and test sets are 0.657 and 0.600 softmax normalized, respectively (medians are both 0.500), while the mean of the two imputed values is 0.833, which is significantly higher. The two imputed values correspond to smokers by rounding. So the missing data class is most likely Missing At Random (MAR) and at worst MNAR if smokers tend not to self-report on questionnaires. Some clinical studies suggest high veracity and accuracy on self-reports,36 while other studies in minority populations have reported substantial refusal of confirmation and disconfirmation rates in both intervention and control groups37,38


Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study.

Seffens W, Evans C, Minority Health-GRID NetworkTaylor H - Bioinform Biol Insights (2016)

ANN learning curve on cigarette smoking.Notes: Upper curve is the number of training set instances out of 1,266 that are incorrect above an error set point (0.01), while lower curve is SSE of the output neuron layer.
© Copyright Policy - open-access
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC4862746&req=5

f2-bbi-suppl.3-2015-043: ANN learning curve on cigarette smoking.Notes: Upper curve is the number of training set instances out of 1,266 that are incorrect above an error set point (0.01), while lower curve is SSE of the output neuron layer.
Mentions: Cigarette smoking has well-known impacts on cardiovascular function and health, but conversely, the variables in the phenotype file may not be predictive.35 This variable among the MH-GRID cohorts was inconsistently defined and scored, some with a simple binary yes/no, while others with gradations and time frame qualifiers, and some that differentiated the tobacco product. Cigarette smoking status was initially coded 0.5 for nonsmoker and 1.0 for self-reported smoker, with 0.0 reserved for missing. We chose a nonbinary coding scheme to allow for more categories that were present in some of the cohort phenotype data and later compared to binary indicator variable coding. For the “0.5/1.0” coding, the learning curve plateaus ~8,000 epochs, and good test accuracy occurs at 17,000 epochs (Fig. 2). During learning, the fraction of correct predictions in the training set goes from 5% at 0 epochs to 65% at 17,000 epochs. The stringency is set at 1% on the SSE during learning in the ANN. Again, the ANN does not base predictions on statistically expected values, because the mean values in the training and test sets are 0.657 and 0.600 softmax normalized, respectively (medians are both 0.500), while the mean of the two imputed values is 0.833, which is significantly higher. The two imputed values correspond to smokers by rounding. So the missing data class is most likely Missing At Random (MAR) and at worst MNAR if smokers tend not to self-report on questionnaires. Some clinical studies suggest high veracity and accuracy on self-reports,36 while other studies in minority populations have reported substantial refusal of confirmation and disconfirmation rates in both intervention and control groups37,38

Bottom Line: We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set.Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients.Data mining classification tools were used to generate association rules.

View Article: PubMed Central - PubMed

Affiliation: Physiology Department, Morehouse School of Medicine, Atlanta, GA, USA.

ABSTRACT
Health-care initiatives are pushing the development and utilization of clinical data for medical discovery and translational research studies. Machine learning tools implemented for Big Data have been applied to detect patterns in complex diseases. This study focuses on hypertension and examines phenotype data across a major clinical study called Minority Health Genomics and Translational Research Repository Database composed of self-reported African American (AA) participants combined with related cohorts. Prior genome-wide association studies for hypertension in AAs presumed that an increase of disease burden in susceptible populations is due to rare variants. But genomic analysis of hypertension, even those designed to focus on rare variants, has yielded marginal genome-wide results over many studies. Machine learning and other nonparametric statistical methods have recently been shown to uncover relationships in complex phenotypes, genotypes, and clinical data. We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set. Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients. Data mining classification tools were used to generate association rules.

No MeSH data available.


Related in: MedlinePlus