Limits...
Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus

Overview of the statistical side of machine learning in neuroimaging data and the different factors that influence prediction accuracy. (A) 1. An ML algorithm is trained on a set of labeled, preprocessed, MRI scans, resulting in a model M that classifies patients and controls based on a discriminative subset of the features (feature vector). 2. The classification is done by an (optimal) separating hyperplane in the (high-dimensional) feature space. Application of the model to an individual scan yields an output value y that is proportional to the distance of the subject’s feature vector to the plane: blue (HC) and red (Pat) dots. The y-values of all subjects form two distributions with widths σy and means separated by a distance Δy. 3. A threshold halfway the distributions separates the two groups; the overlapping parts below and above the threshold represent the false negatives and false positive, respectively. For symmetrical distributions, the accuracy can be estimated from the ML effect size, dML = Δy/σy. (B) 1, 2. Heterogeneity of the disease and (normal) variation in brain measures lead to lower Δy and larger σy and, thus, smaller effect sizes and classification accuracies. 3, 4. Sampling effects and noise and imperfect expert labeling cause uncertainties in the positions of the subjects and affect the separating hyperplane and (test) accuracy.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4814515&req=5

Figure 2: Overview of the statistical side of machine learning in neuroimaging data and the different factors that influence prediction accuracy. (A) 1. An ML algorithm is trained on a set of labeled, preprocessed, MRI scans, resulting in a model M that classifies patients and controls based on a discriminative subset of the features (feature vector). 2. The classification is done by an (optimal) separating hyperplane in the (high-dimensional) feature space. Application of the model to an individual scan yields an output value y that is proportional to the distance of the subject’s feature vector to the plane: blue (HC) and red (Pat) dots. The y-values of all subjects form two distributions with widths σy and means separated by a distance Δy. 3. A threshold halfway the distributions separates the two groups; the overlapping parts below and above the threshold represent the false negatives and false positive, respectively. For symmetrical distributions, the accuracy can be estimated from the ML effect size, dML = Δy/σy. (B) 1, 2. Heterogeneity of the disease and (normal) variation in brain measures lead to lower Δy and larger σy and, thus, smaller effect sizes and classification accuracies. 3, 4. Sampling effects and noise and imperfect expert labeling cause uncertainties in the positions of the subjects and affect the separating hyperplane and (test) accuracy.

Mentions: Figure 2A summarizes the process of applying ML to imaging data and the resulting classification performance. Every subject is represented by a so-called feature vector x that contains the features, or measures, that will be used to separate the two groups. These features can consist of any set of (neuroimaging) measures, for instance, a set of atlas-based regional brain volumes (low-dimensional feature space), or voxelwise gray matter densities (high-dimensional feature space) (Figure 2A-1). The features {xi} are distributed around a mean x with SD σx. In group-wise statistics, effect size Cohen’s d is defined as d = (x1 − x0)/σx, which is a measure indicating how large the group-effect in this feature is in relation to the variation observed in this feature between subjects. The significance of this difference can be expressed by t = d√N that can be converted to a p-value. From this formula, it is clear that, even for very small effect sizes d, a group difference can become “significant” by increasing N. This, however, has of course no effect on the overlap between the distributions. Suppose we would use this feature to separate the individuals of the two groups, we would draw a line as indicated in Figure 2A-2: the overlap of the distributions shaded in the Figure indicates the wrongly classified individuals. The non-overlap reflects the fraction of individuals that is correctly classified [Cohen’s non-overlap measure U2, see Ref. (20)]. Under the assumption that the distributions of the feature are normal and equal for the two groups, this fraction may be estimated as Φ(d/2), where Φ(.) is the cumulative normal distribution (Figure 2A-3). While effect sizes of 0.8 and larger are commonly referred to as “large” (20) and would give rise to a “very significant” group difference t = 8 for N = 100, only 66% of the individuals would be assigned to the correct class in this case. This shows the fundamental difference between parameters that are significantly different between groups and their use for making individual predictions. This is one of the reasons that univariate prediction models are rare and that multivariate techniques are invoked to make use of the combined predictive power of many variables.


Detecting Neuroimaging Biomarkers for Psychiatric Disorders: Sample Size Matters.

Schnack HG, Kahn RS - Front Psychiatry (2016)

Overview of the statistical side of machine learning in neuroimaging data and the different factors that influence prediction accuracy. (A) 1. An ML algorithm is trained on a set of labeled, preprocessed, MRI scans, resulting in a model M that classifies patients and controls based on a discriminative subset of the features (feature vector). 2. The classification is done by an (optimal) separating hyperplane in the (high-dimensional) feature space. Application of the model to an individual scan yields an output value y that is proportional to the distance of the subject’s feature vector to the plane: blue (HC) and red (Pat) dots. The y-values of all subjects form two distributions with widths σy and means separated by a distance Δy. 3. A threshold halfway the distributions separates the two groups; the overlapping parts below and above the threshold represent the false negatives and false positive, respectively. For symmetrical distributions, the accuracy can be estimated from the ML effect size, dML = Δy/σy. (B) 1, 2. Heterogeneity of the disease and (normal) variation in brain measures lead to lower Δy and larger σy and, thus, smaller effect sizes and classification accuracies. 3, 4. Sampling effects and noise and imperfect expert labeling cause uncertainties in the positions of the subjects and affect the separating hyperplane and (test) accuracy.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4814515&req=5

Figure 2: Overview of the statistical side of machine learning in neuroimaging data and the different factors that influence prediction accuracy. (A) 1. An ML algorithm is trained on a set of labeled, preprocessed, MRI scans, resulting in a model M that classifies patients and controls based on a discriminative subset of the features (feature vector). 2. The classification is done by an (optimal) separating hyperplane in the (high-dimensional) feature space. Application of the model to an individual scan yields an output value y that is proportional to the distance of the subject’s feature vector to the plane: blue (HC) and red (Pat) dots. The y-values of all subjects form two distributions with widths σy and means separated by a distance Δy. 3. A threshold halfway the distributions separates the two groups; the overlapping parts below and above the threshold represent the false negatives and false positive, respectively. For symmetrical distributions, the accuracy can be estimated from the ML effect size, dML = Δy/σy. (B) 1, 2. Heterogeneity of the disease and (normal) variation in brain measures lead to lower Δy and larger σy and, thus, smaller effect sizes and classification accuracies. 3, 4. Sampling effects and noise and imperfect expert labeling cause uncertainties in the positions of the subjects and affect the separating hyperplane and (test) accuracy.
Mentions: Figure 2A summarizes the process of applying ML to imaging data and the resulting classification performance. Every subject is represented by a so-called feature vector x that contains the features, or measures, that will be used to separate the two groups. These features can consist of any set of (neuroimaging) measures, for instance, a set of atlas-based regional brain volumes (low-dimensional feature space), or voxelwise gray matter densities (high-dimensional feature space) (Figure 2A-1). The features {xi} are distributed around a mean x with SD σx. In group-wise statistics, effect size Cohen’s d is defined as d = (x1 − x0)/σx, which is a measure indicating how large the group-effect in this feature is in relation to the variation observed in this feature between subjects. The significance of this difference can be expressed by t = d√N that can be converted to a p-value. From this formula, it is clear that, even for very small effect sizes d, a group difference can become “significant” by increasing N. This, however, has of course no effect on the overlap between the distributions. Suppose we would use this feature to separate the individuals of the two groups, we would draw a line as indicated in Figure 2A-2: the overlap of the distributions shaded in the Figure indicates the wrongly classified individuals. The non-overlap reflects the fraction of individuals that is correctly classified [Cohen’s non-overlap measure U2, see Ref. (20)]. Under the assumption that the distributions of the feature are normal and equal for the two groups, this fraction may be estimated as Φ(d/2), where Φ(.) is the cumulative normal distribution (Figure 2A-3). While effect sizes of 0.8 and larger are commonly referred to as “large” (20) and would give rise to a “very significant” group difference t = 8 for N = 100, only 66% of the individuals would be assigned to the correct class in this case. This shows the fundamental difference between parameters that are significantly different between groups and their use for making individual predictions. This is one of the reasons that univariate prediction models are rare and that multivariate techniques are invoked to make use of the combined predictive power of many variables.

Bottom Line: We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing.In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account.The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

View Article: PubMed Central - PubMed

Affiliation: Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht , Utrecht , Netherlands.

ABSTRACT
In a recent review, it was suggested that much larger cohorts are needed to prove the diagnostic value of neuroimaging biomarkers in psychiatry. While within a sample, an increase of diagnostic accuracy of schizophrenia (SZ) with number of subjects (N) has been shown, the relationship between N and accuracy is completely different between studies. Using data from a recent meta-analysis of machine learning (ML) in imaging SZ, we found that while low-N studies can reach 90% and higher accuracy, above N/2 = 50 the maximum accuracy achieved steadily drops to below 70% for N/2 > 150. We investigate the role N plays in the wide variability in accuracy results in SZ studies (63-97%). We hypothesize that the underlying cause of the decrease in accuracy with increasing N is sample heterogeneity. While smaller studies more easily include a homogeneous group of subjects (strict inclusion criteria are easily met; subjects live close to study site), larger studies inevitably need to relax the criteria/recruit from large geographic areas. A SZ prediction model based on a heterogeneous group of patients with presumably a heterogeneous pattern of structural or functional brain changes will not be able to capture the whole variety of changes, thus being limited to patterns shared by most patients. In addition to heterogeneity (sample size), we investigate other factors influencing accuracy and introduce a ML effect size. We derive a simple model of how the different factors, such as sample heterogeneity and study setup determine this ML effect size, and explain the variation in prediction accuracies found from the literature, both in cross-validation and independent sample testing. From this, we argue that smaller-N studies may reach high prediction accuracy at the cost of lower generalizability to other samples. Higher-N studies, on the other hand, will have more generalization power, but at the cost of lower accuracy. In conclusion, when comparing results from different ML studies, the sample sizes should be taken into account. To assess the generalizability of the models, validation (by direct application) of the prediction models should be tested in independent samples. The prediction of more complex measures such as outcome, which are expected to have an underlying pattern of more subtle brain abnormalities (lower effect size), will require large samples.

No MeSH data available.


Related in: MedlinePlus