Limits...
Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus

Decay and broadening of the prediction accuracy. The theoretically attainable 100% accuracy (A) is lowered, because the gold standard is imperfect (B). If the brain abnormalities underlying the disease are heterogeneous, no straightforward (linear) model can classify all patients correctly (C). Sampling effects, biological variation and measurement noise further lower and broaden the accuracy (D). Finally, if a model is tested in an independent sample, the accuracy becomes lower due to between-sample heterogeneity and broader due to sampling effects (E).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4814515&req=5

Figure 4: Decay and broadening of the prediction accuracy. The theoretically attainable 100% accuracy (A) is lowered, because the gold standard is imperfect (B). If the brain abnormalities underlying the disease are heterogeneous, no straightforward (linear) model can classify all patients correctly (C). Sampling effects, biological variation and measurement noise further lower and broaden the accuracy (D). Finally, if a model is tested in an independent sample, the accuracy becomes lower due to between-sample heterogeneity and broader due to sampling effects (E).

Mentions: We will show that there are basically four “channels” through which the performance of a classifier is influenced (subsections 1–4; Figures 2B-1–4; Table 2). Depending on the kind of study and the way performance is assessed, different sources play a role. We present the ideas and some resulting formulas and numerical results for simple cases here. The derivation of the formulas is given in the Datasheet S1 in Supplementary Material. The results are then related to the observations made from Figure 1. To investigate the influence of the different sources, we will consider two types of ML prediction studies: (i) the single-sample study, where a model is trained and tested using (k-fold; leave-one-out) cross-validation (CV) in the same sample and (ii) the two-sample study where a model trained on one sample (the discovery sample) is tested in an independent second sample (the validation sample).


Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Decay and broadening of the prediction accuracy. The theoretically attainable 100% accuracy (A) is lowered, because the gold standard is imperfect (B). If the brain abnormalities underlying the disease are heterogeneous, no straightforward (linear) model can classify all patients correctly (C). Sampling effects, biological variation and measurement noise further lower and broaden the accuracy (D). Finally, if a model is tested in an independent sample, the accuracy becomes lower due to between-sample heterogeneity and broader due to sampling effects (E).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4814515&req=5

Figure 4: Decay and broadening of the prediction accuracy. The theoretically attainable 100% accuracy (A) is lowered, because the gold standard is imperfect (B). If the brain abnormalities underlying the disease are heterogeneous, no straightforward (linear) model can classify all patients correctly (C). Sampling effects, biological variation and measurement noise further lower and broaden the accuracy (D). Finally, if a model is tested in an independent sample, the accuracy becomes lower due to between-sample heterogeneity and broader due to sampling effects (E).
Mentions: We will show that there are basically four “channels” through which the performance of a classifier is influenced (subsections 1–4; Figures 2B-1–4; Table 2). Depending on the kind of study and the way performance is assessed, different sources play a role. We present the ideas and some resulting formulas and numerical results for simple cases here. The derivation of the formulas is given in the Datasheet S1 in Supplementary Material. The results are then related to the observations made from Figure 1. To investigate the influence of the different sources, we will consider two types of ML prediction studies: (i) the single-sample study, where a model is trained and tested using (k-fold; leave-one-out) cross-validation (CV) in the same sample and (ii) the two-sample study where a model trained on one sample (the discovery sample) is tested in an independent second sample (the validation sample).

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus