Limits...
Deep supervised, but not unsupervised, models may explain IT cortical representation.

Khaligh-Razavi SM, Kriegeskorte N - PLoS Comput. Biol. (2014)

Bottom Line: Computational object-vision models, although continually improving, do not yet reach human performance.The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data.Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

View Article: PubMed Central - PubMed

Affiliation: Medical Research Council, Cognition and Brain Sciences Unit, Cambridge, United Kingdom.

ABSTRACT
Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the internal representations of computational models can explain the IT representation. Here we investigate a wide range of computational model representations (37 in total), testing their categorization performance and their ability to account for the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

Show MeSH

Related in: MedlinePlus

The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data.The bars show the Kendall-τA RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). As we ascend the layers of the deep network, model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4222664&req=5

pcbi-1003915-g007: The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data.The bars show the Kendall-τA RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). As we ascend the layers of the deep network, model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations.

Mentions: RDMs for all layers of the deep convolutional network (Krizhevsky et al. 2012) ref [41] are shown for the set of the 96 images (L1: layer 1 to L7: layer 7). Kendall-τA RDM correlations of the models with hIT and mIT are stated underneath each RDM. All correlations are statistically significant. For inferential comparisons to IT and other regions, see Figure 7 and Table 2, respectively.


Deep supervised, but not unsupervised, models may explain IT cortical representation.

Khaligh-Razavi SM, Kriegeskorte N - PLoS Comput. Biol. (2014)

The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data.The bars show the Kendall-τA RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). As we ascend the layers of the deep network, model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4222664&req=5

pcbi-1003915-g007: The strongly supervised deep network, with features remixed and reweighted, fully explains the IT data.The bars show the Kendall-τA RDM correlations between the layers of the strongly supervised deep convolutional network and human IT. The error bars are standard errors of the mean estimated by bootstrap resampling of the stimuli. Asterisks indicate significant RDM correlations (random permutation test based on 10,000 randomizations of the stimulus labels; p<0.05: *, p<0.01: **, p<0.001: ***, p<0.0001: ****). As we ascend the layers of the deep network, model RDMs explain increasing proportions of the variance of the hIT RDM. The noise ceiling (gray bar) indicates the expected correlation of the true model (given the noise in the data). The upper and lower edges of the gray horizontal bar are upper and lower bound estimates of the maximum correlation any model can achieve given the noise. None of the layers of the deep network reaches the noise ceiling. However, the final fully connected layers 6 and 7 come close to the ceiling. Remixing the features of layer 7 (Figure 10) using linear SVMs to strengthen the categorical divisions, provides a representation composed of three discriminants (animate/inanimate, face/nonface, and body/nonbody) that reaches the noise ceiling. Reweighting the model layers and the three discriminants (see Figure 10 for details) yields a representation that explains the hIT geometry even better. A horizontal line over two bars indicates that the two models perform significantly differently (inference by bootstrap resampling of the stimulus set). Multiple testing across the many pairwise comparisons is accounted for by controlling the expected FDR at 0.05. The pairwise statistical comparisons show that the IT-geometry-supervised deep model explains IT significantly better than all other candidate representations.
Mentions: RDMs for all layers of the deep convolutional network (Krizhevsky et al. 2012) ref [41] are shown for the set of the 96 images (L1: layer 1 to L7: layer 7). Kendall-τA RDM correlations of the models with hIT and mIT are stated underneath each RDM. All correlations are statistically significant. For inferential comparisons to IT and other regions, see Figure 7 and Table 2, respectively.

Bottom Line: Computational object-vision models, although continually improving, do not yet reach human performance.The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data.Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

View Article: PubMed Central - PubMed

Affiliation: Medical Research Council, Cognition and Brain Sciences Unit, Cambridge, United Kingdom.

ABSTRACT
Inferior temporal (IT) cortex in human and nonhuman primates serves visual object recognition. Computational object-vision models, although continually improving, do not yet reach human performance. It is unclear to what extent the internal representations of computational models can explain the IT representation. Here we investigate a wide range of computational model representations (37 in total), testing their categorization performance and their ability to account for the IT representational geometry. The models include well-known neuroscientific object-recognition models (e.g. HMAX, VisNet) along with several models from computer vision (e.g. SIFT, GIST, self-similarity features, and a deep convolutional neural network). We compared the representational dissimilarity matrices (RDMs) of the model representations with the RDMs obtained from human IT (measured with fMRI) and monkey IT (measured with cell recording) for the same set of stimuli (not used in training the models). Better performing models were more similar to IT in that they showed greater clustering of representational patterns by category. In addition, better performing models also more strongly resembled IT in terms of their within-category representational dissimilarities. Representational geometries were significantly correlated between IT and many of the models. However, the categorical clustering observed in IT was largely unexplained by the unsupervised models. The deep convolutional network, which was trained by supervision with over a million category-labeled images, reached the highest categorization performance and also best explained IT, although it did not fully explain the IT data. Combining the features of this model with appropriate weights and adding linear combinations that maximize the margin between animate and inanimate objects and between faces and other objects yielded a representation that fully explained our IT data. Overall, our results suggest that explaining IT requires computational features trained through supervised learning to emphasize the behaviorally important categorical divisions prominently reflected in IT.

Show MeSH
Related in: MedlinePlus