Limits...
Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus

Petal model to describe disease heterogeneity within and between samples. (A) Two-fold heterogeneity sample consists of blue and yellow subsamples that have shared (white) and unique (colored) brain abnormalities. (B)H-fold heterogeneity sample consists of H = 5 subsamples that have shared (white) and unique (colored) brain abnormalities. (C) Train/test heterogeneity The model trained on sample B (H = 5; black sector) is tested on a sample with H′-fold heterogeneity (H′ = 4; grey sector). The overlap is T-fold (T = 2; yellow and orange petals).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4814515&req=5

Figure 3: Petal model to describe disease heterogeneity within and between samples. (A) Two-fold heterogeneity sample consists of blue and yellow subsamples that have shared (white) and unique (colored) brain abnormalities. (B)H-fold heterogeneity sample consists of H = 5 subsamples that have shared (white) and unique (colored) brain abnormalities. (C) Train/test heterogeneity The model trained on sample B (H = 5; black sector) is tested on a sample with H′-fold heterogeneity (H′ = 4; grey sector). The overlap is T-fold (T = 2; yellow and orange petals).

Mentions: In Section “Testing a Model’s Accuracy in an Independent Validation Sample: Heterogeneity in the Population, Causing between-Sample Heterogeneity,” we considered the case of homogeneous samples. Such samples will be relatively small in practice. Larger samples will inevitably be heterogeneous because of the difficulty to include a large amount of subjects fulfilling the strict inclusion criteria. In the following, we will assume that this heterogeneous sample is constituted of two or more homogeneous subsamples. Part of the properties is shared between the subsamples, but other properties differ (e.g., duration of illness). For the case of two subsamples, the situation is the same as described in the previous section (Figures 2B-1 and 3A), but now a model M12 is trained on the combined s1 + s2 sample. The model will be a weighted average of the (hypothetical) models from the homogeneous subsamples. The amount of within-sample heterogeneity is determined by both the number of homogeneous subsamples the sample can be divided into and the overlap between the discriminative feature sets (Figures 2B-1 and 3A,B). The larger the within-sample heterogeneity, the lower the separation strength Δy will be (cf, see Testing a Model’s Accuracy in an Independent Validation Sample: Heterogeneity in the Population, Causing between-Sample Heterogeneity). Furthermore, different sets of discriminative features will generally also lead to a larger pool of discriminative features that will not contribute to the discrimination in most subjects but will have random effects (affecting σy). The net effect is a decrease of the effect size dML. Training (CV) accuracy will be lower as compared to a (hypothetical) homogeneous sample (of the same size) and depend on the within-sample heterogeneity, which, for a two-subsample case is defined by the fraction of shared features, f12. The effect size is attenuated as follows: √((1 + f12)/2). (See section C of the Datasheet S1 in Supplementary Material for the derivation of this formula and section F for numerical simulations to test it.) When applied to an independent sample, the test accuracy of the model depends on the between-sample heterogeneity (as discussed in Section “Testing a Model’s Accuracy in an Independent Validation Sample: Heterogeneity in the Population, Causing between-Sample Heterogeneity”). For a sample consisting of two homogeneous subsamples s1 and s2, the accuracy drop is in the range 70–90% approximately given by: Δacc ≈ 26% ×10log((1 + f12)/2), as compared to the accuracy in sample s1 or sample s2 alone.


Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Petal model to describe disease heterogeneity within and between samples. (A) Two-fold heterogeneity sample consists of blue and yellow subsamples that have shared (white) and unique (colored) brain abnormalities. (B)H-fold heterogeneity sample consists of H = 5 subsamples that have shared (white) and unique (colored) brain abnormalities. (C) Train/test heterogeneity The model trained on sample B (H = 5; black sector) is tested on a sample with H′-fold heterogeneity (H′ = 4; grey sector). The overlap is T-fold (T = 2; yellow and orange petals).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4814515&req=5

Figure 3: Petal model to describe disease heterogeneity within and between samples. (A) Two-fold heterogeneity sample consists of blue and yellow subsamples that have shared (white) and unique (colored) brain abnormalities. (B)H-fold heterogeneity sample consists of H = 5 subsamples that have shared (white) and unique (colored) brain abnormalities. (C) Train/test heterogeneity The model trained on sample B (H = 5; black sector) is tested on a sample with H′-fold heterogeneity (H′ = 4; grey sector). The overlap is T-fold (T = 2; yellow and orange petals).
Mentions: In Section “Testing a Model’s Accuracy in an Independent Validation Sample: Heterogeneity in the Population, Causing between-Sample Heterogeneity,” we considered the case of homogeneous samples. Such samples will be relatively small in practice. Larger samples will inevitably be heterogeneous because of the difficulty to include a large amount of subjects fulfilling the strict inclusion criteria. In the following, we will assume that this heterogeneous sample is constituted of two or more homogeneous subsamples. Part of the properties is shared between the subsamples, but other properties differ (e.g., duration of illness). For the case of two subsamples, the situation is the same as described in the previous section (Figures 2B-1 and 3A), but now a model M12 is trained on the combined s1 + s2 sample. The model will be a weighted average of the (hypothetical) models from the homogeneous subsamples. The amount of within-sample heterogeneity is determined by both the number of homogeneous subsamples the sample can be divided into and the overlap between the discriminative feature sets (Figures 2B-1 and 3A,B). The larger the within-sample heterogeneity, the lower the separation strength Δy will be (cf, see Testing a Model’s Accuracy in an Independent Validation Sample: Heterogeneity in the Population, Causing between-Sample Heterogeneity). Furthermore, different sets of discriminative features will generally also lead to a larger pool of discriminative features that will not contribute to the discrimination in most subjects but will have random effects (affecting σy). The net effect is a decrease of the effect size dML. Training (CV) accuracy will be lower as compared to a (hypothetical) homogeneous sample (of the same size) and depend on the within-sample heterogeneity, which, for a two-subsample case is defined by the fraction of shared features, f12. The effect size is attenuated as follows: √((1 + f12)/2). (See section C of the Datasheet S1 in Supplementary Material for the derivation of this formula and section F for numerical simulations to test it.) When applied to an independent sample, the test accuracy of the model depends on the between-sample heterogeneity (as discussed in Section “Testing a Model’s Accuracy in an Independent Validation Sample: Heterogeneity in the Population, Causing between-Sample Heterogeneity”). For a sample consisting of two homogeneous subsamples s1 and s2, the accuracy drop is in the range 70–90% approximately given by: Δacc ≈ 26% ×10log((1 + f12)/2), as compared to the accuracy in sample s1 or sample s2 alone.

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus