Limits...
Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus

Prediction accuracy versus sample size for the schizophrenia machine learning studies using structural MRI. Data are taken from the reviews by Zarogianni et al. (9) and Kambeitz et al. (1) and some (recent) studies (4, 14–18). Sample size N/2 was calculated as the mean of the patient and control sample sizes. Different symbols mark cross-validation accuracy from studies on chronic/mixed patients (circles) and first-episode patients (diamonds). Soft colors were used to indicate studies that included only females (pink) or males (blue). Train accuracy without cross-validation is marked by squares. Lines connect train–test studies, where the accuracy in the independent test sample is marked with solidly filled symbols. “3T” and “7T” mark the study by Iwabuchi (19), using the same subjects scanned at different field strengths. Curved dashed line: heterogeneous-sample theory. Horizontal dashed line: stretched range of homogeneous samples.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4814515&req=5

Figure 1: Prediction accuracy versus sample size for the schizophrenia machine learning studies using structural MRI. Data are taken from the reviews by Zarogianni et al. (9) and Kambeitz et al. (1) and some (recent) studies (4, 14–18). Sample size N/2 was calculated as the mean of the patient and control sample sizes. Different symbols mark cross-validation accuracy from studies on chronic/mixed patients (circles) and first-episode patients (diamonds). Soft colors were used to indicate studies that included only females (pink) or males (blue). Train accuracy without cross-validation is marked by squares. Lines connect train–test studies, where the accuracy in the independent test sample is marked with solidly filled symbols. “3T” and “7T” mark the study by Iwabuchi (19), using the same subjects scanned at different field strengths. Curved dashed line: heterogeneous-sample theory. Horizontal dashed line: stretched range of homogeneous samples.

Mentions: We base our approach on the observations from published sMRI-ML studies on SZ summarized in Figure 1. The figure clearly illustrates two phenomena: (1) published ML models from smaller samples yield higher classification accuracies and the observed accuracies appear to lie on (and below) a line that divides the diagram in two and (2) replications in independent validation samples yield lower accuracies, also decreasing with (training) sample size. These effects may be explained by (at least) two reasons: sample homogeneity and publication bias. Poorer performing models in small samples may remain unpublished, while this variation in accuracy may be due to chance (11) as well as better performance in homogeneous (small) samples. The lower accuracy in the larger samples can be ascribed to increased within-sample heterogeneity; these models may better capture the “complete picture” of SZ patterns. These models generalize better to other samples drawn from different populations at the cost of a lower accuracy. From all applications of ML in neuroimaging, those in psychiatry seem to be affected most severely by the effects of small samples, given the heterogeneous nature of the disorders, both in appearance and in underlying brain features (13). In the following, we will setup a simple theoretical framework to describe the different factors that affect a prediction model’s performance. The resulting formulas can be used to (1) quantitatively relate the observed fall in classification accuracy for increasing sample size to within-sample heterogeneity and (2) determine the between-sample heterogeneity, i.e., the non-overlap in sample characteristics from the accuracy difference in a test/retest study. These tools can also be applied to (post hoc) multicenter ML studies to unravel the heterogeneity of the brain biomarker structure related to psychiatric disorders.


Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Prediction accuracy versus sample size for the schizophrenia machine learning studies using structural MRI. Data are taken from the reviews by Zarogianni et al. (9) and Kambeitz et al. (1) and some (recent) studies (4, 14–18). Sample size N/2 was calculated as the mean of the patient and control sample sizes. Different symbols mark cross-validation accuracy from studies on chronic/mixed patients (circles) and first-episode patients (diamonds). Soft colors were used to indicate studies that included only females (pink) or males (blue). Train accuracy without cross-validation is marked by squares. Lines connect train–test studies, where the accuracy in the independent test sample is marked with solidly filled symbols. “3T” and “7T” mark the study by Iwabuchi (19), using the same subjects scanned at different field strengths. Curved dashed line: heterogeneous-sample theory. Horizontal dashed line: stretched range of homogeneous samples.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4814515&req=5

Figure 1: Prediction accuracy versus sample size for the schizophrenia machine learning studies using structural MRI. Data are taken from the reviews by Zarogianni et al. (9) and Kambeitz et al. (1) and some (recent) studies (4, 14–18). Sample size N/2 was calculated as the mean of the patient and control sample sizes. Different symbols mark cross-validation accuracy from studies on chronic/mixed patients (circles) and first-episode patients (diamonds). Soft colors were used to indicate studies that included only females (pink) or males (blue). Train accuracy without cross-validation is marked by squares. Lines connect train–test studies, where the accuracy in the independent test sample is marked with solidly filled symbols. “3T” and “7T” mark the study by Iwabuchi (19), using the same subjects scanned at different field strengths. Curved dashed line: heterogeneous-sample theory. Horizontal dashed line: stretched range of homogeneous samples.
Mentions: We base our approach on the observations from published sMRI-ML studies on SZ summarized in Figure 1. The figure clearly illustrates two phenomena: (1) published ML models from smaller samples yield higher classification accuracies and the observed accuracies appear to lie on (and below) a line that divides the diagram in two and (2) replications in independent validation samples yield lower accuracies, also decreasing with (training) sample size. These effects may be explained by (at least) two reasons: sample homogeneity and publication bias. Poorer performing models in small samples may remain unpublished, while this variation in accuracy may be due to chance (11) as well as better performance in homogeneous (small) samples. The lower accuracy in the larger samples can be ascribed to increased within-sample heterogeneity; these models may better capture the “complete picture” of SZ patterns. These models generalize better to other samples drawn from different populations at the cost of a lower accuracy. From all applications of ML in neuroimaging, those in psychiatry seem to be affected most severely by the effects of small samples, given the heterogeneous nature of the disorders, both in appearance and in underlying brain features (13). In the following, we will setup a simple theoretical framework to describe the different factors that affect a prediction model’s performance. The resulting formulas can be used to (1) quantitatively relate the observed fall in classification accuracy for increasing sample size to within-sample heterogeneity and (2) determine the between-sample heterogeneity, i.e., the non-overlap in sample characteristics from the accuracy difference in a test/retest study. These tools can also be applied to (post hoc) multicenter ML studies to unravel the heterogeneity of the brain biomarker structure related to psychiatric disorders.

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus