Limits...
Why is real-world visual object recognition hard?

Pinto N, Cox DD, DiCarlo JJ - PLoS Comput. Biol. (2008)

Bottom Line: As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model.Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction.Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

View Article: PubMed Central - PubMed

Affiliation: McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.

ABSTRACT
Progress in understanding the brain mechanisms underlying vision requires the construction of computational models that not only emulate the brain's anatomy and physiology, but ultimately match its performance on visual tasks. In recent years, "natural" images have become popular in the study of vision and have been used to show apparently impressive progress in building such models. Here, we challenge the use of uncontrolled "natural" images in guiding that progress. In particular, we show that a simple V1-like model--a neuroscientist's "" model, which should perform poorly at real-world visual object recognition tasks--outperforms state-of-the-art object recognition systems (biologically inspired and otherwise) on a standard, ostensibly natural image recognition test. As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model. Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction. Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

Show MeSH
Performance of a Simple V1-Like Model Relative to Current Performance of State-of-the-Art Artificial Object Recognition Systems (Some Biologically Inspired) on an Ostensibly “Natural” Standard Image Database (Caltech101)(A) Example images from the database and their category labels.(B) Two example images from the “car” category.(C) Reported performance of five state-of-the-art computational object recognition systems on this “natural” database are shown in gray (1 = Wang et al. 2006; 2 = Grauman and Darrell 2006; 3 = Mutch and Lowe 2006; 4 = Lazebnik et al. 2006; 5 = Zhang et al. 2006). In this panel, 15 training examples were used to train each system. Since chance performance on this 102-category task is less than 1%, performance values greater than ∼40% have been taken as substantial progress. The performance of the simple V1-like model is shown in black (+ is with “ad hoc” features; see Methods). Although the V1-like model is extremely simple and lacks any explicit invariance-building mechanisms, it performs as well as, or better than, state-of-the-art object recognition systems on the “natural” databases (but see Varma and Ray 2007 for a recent hybrid approach, that pools the above methods to achieve higher performance).(D) Same as (C) except that 30 training examples were used. The dashed lines indicates performance achieved using an untransformed grayscale pixel space representation and a linear SVM classifier (15 training examples: 16.1%, SD 0.4; 30 training examples: 17.3%, SD 0.8). Error bars (barely visible) represent the standard deviation of the mean performance of the V1-like model over ten random training and testing splits of the images. The authors of the state-of-the-art approaches do not consistently report this variation, but when they do they are in the same range (less than 1%). The V1-model also performed favorably with fewer training examples (see Figure S4).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2211529&req=5

pcbi-0040027-g001: Performance of a Simple V1-Like Model Relative to Current Performance of State-of-the-Art Artificial Object Recognition Systems (Some Biologically Inspired) on an Ostensibly “Natural” Standard Image Database (Caltech101)(A) Example images from the database and their category labels.(B) Two example images from the “car” category.(C) Reported performance of five state-of-the-art computational object recognition systems on this “natural” database are shown in gray (1 = Wang et al. 2006; 2 = Grauman and Darrell 2006; 3 = Mutch and Lowe 2006; 4 = Lazebnik et al. 2006; 5 = Zhang et al. 2006). In this panel, 15 training examples were used to train each system. Since chance performance on this 102-category task is less than 1%, performance values greater than ∼40% have been taken as substantial progress. The performance of the simple V1-like model is shown in black (+ is with “ad hoc” features; see Methods). Although the V1-like model is extremely simple and lacks any explicit invariance-building mechanisms, it performs as well as, or better than, state-of-the-art object recognition systems on the “natural” databases (but see Varma and Ray 2007 for a recent hybrid approach, that pools the above methods to achieve higher performance).(D) Same as (C) except that 30 training examples were used. The dashed lines indicates performance achieved using an untransformed grayscale pixel space representation and a linear SVM classifier (15 training examples: 16.1%, SD 0.4; 30 training examples: 17.3%, SD 0.8). Error bars (barely visible) represent the standard deviation of the mean performance of the V1-like model over ten random training and testing splits of the images. The authors of the state-of-the-art approaches do not consistently report this variation, but when they do they are in the same range (less than 1%). The V1-model also performed favorably with fewer training examples (see Figure S4).

Mentions: Although controversial ([7,8]), a popular recent approach in the study of vision is the use of “natural” images [7,9–12], in part because they ostensibly capture the essence of problems encountered in the real world. For example, in computational vision, the Caltech101 image set has emerged as a gold standard for testing “natural” object recognition performance [13]. The set consists of a large number of images divided into 101 object categories (e.g., images containing planes, cars, faces, flamingos, etc.; see Figure 1A) plus an additional “background” category (for 102 categories total). While a number of specific concerns have been raised with this set (see [14] for more details), its images are still currently widely used by neuroscientists, both in theoretical (e.g., [2,15]) and experimental (e.g., [16]) contexts. The logic of Caltech101 (and sets like it; e.g., Caltech256 [17]) is that the sheer number of categories and the diversity of those images place a high bar for object recognition systems and require them to solve the computational crux of object recognition. Because there are 102 object categories, chance performance is less than 1% correct. In recent years, several object recognition models (including biologically inspired approaches) have shown what appears to be impressively high performance on this test—better than 60% correct [4,18–21], suggesting that these approaches, while still well below human performance, are at least heading in the right direction.


Why is real-world visual object recognition hard?

Pinto N, Cox DD, DiCarlo JJ - PLoS Comput. Biol. (2008)

Performance of a Simple V1-Like Model Relative to Current Performance of State-of-the-Art Artificial Object Recognition Systems (Some Biologically Inspired) on an Ostensibly “Natural” Standard Image Database (Caltech101)(A) Example images from the database and their category labels.(B) Two example images from the “car” category.(C) Reported performance of five state-of-the-art computational object recognition systems on this “natural” database are shown in gray (1 = Wang et al. 2006; 2 = Grauman and Darrell 2006; 3 = Mutch and Lowe 2006; 4 = Lazebnik et al. 2006; 5 = Zhang et al. 2006). In this panel, 15 training examples were used to train each system. Since chance performance on this 102-category task is less than 1%, performance values greater than ∼40% have been taken as substantial progress. The performance of the simple V1-like model is shown in black (+ is with “ad hoc” features; see Methods). Although the V1-like model is extremely simple and lacks any explicit invariance-building mechanisms, it performs as well as, or better than, state-of-the-art object recognition systems on the “natural” databases (but see Varma and Ray 2007 for a recent hybrid approach, that pools the above methods to achieve higher performance).(D) Same as (C) except that 30 training examples were used. The dashed lines indicates performance achieved using an untransformed grayscale pixel space representation and a linear SVM classifier (15 training examples: 16.1%, SD 0.4; 30 training examples: 17.3%, SD 0.8). Error bars (barely visible) represent the standard deviation of the mean performance of the V1-like model over ten random training and testing splits of the images. The authors of the state-of-the-art approaches do not consistently report this variation, but when they do they are in the same range (less than 1%). The V1-model also performed favorably with fewer training examples (see Figure S4).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2211529&req=5

pcbi-0040027-g001: Performance of a Simple V1-Like Model Relative to Current Performance of State-of-the-Art Artificial Object Recognition Systems (Some Biologically Inspired) on an Ostensibly “Natural” Standard Image Database (Caltech101)(A) Example images from the database and their category labels.(B) Two example images from the “car” category.(C) Reported performance of five state-of-the-art computational object recognition systems on this “natural” database are shown in gray (1 = Wang et al. 2006; 2 = Grauman and Darrell 2006; 3 = Mutch and Lowe 2006; 4 = Lazebnik et al. 2006; 5 = Zhang et al. 2006). In this panel, 15 training examples were used to train each system. Since chance performance on this 102-category task is less than 1%, performance values greater than ∼40% have been taken as substantial progress. The performance of the simple V1-like model is shown in black (+ is with “ad hoc” features; see Methods). Although the V1-like model is extremely simple and lacks any explicit invariance-building mechanisms, it performs as well as, or better than, state-of-the-art object recognition systems on the “natural” databases (but see Varma and Ray 2007 for a recent hybrid approach, that pools the above methods to achieve higher performance).(D) Same as (C) except that 30 training examples were used. The dashed lines indicates performance achieved using an untransformed grayscale pixel space representation and a linear SVM classifier (15 training examples: 16.1%, SD 0.4; 30 training examples: 17.3%, SD 0.8). Error bars (barely visible) represent the standard deviation of the mean performance of the V1-like model over ten random training and testing splits of the images. The authors of the state-of-the-art approaches do not consistently report this variation, but when they do they are in the same range (less than 1%). The V1-model also performed favorably with fewer training examples (see Figure S4).
Mentions: Although controversial ([7,8]), a popular recent approach in the study of vision is the use of “natural” images [7,9–12], in part because they ostensibly capture the essence of problems encountered in the real world. For example, in computational vision, the Caltech101 image set has emerged as a gold standard for testing “natural” object recognition performance [13]. The set consists of a large number of images divided into 101 object categories (e.g., images containing planes, cars, faces, flamingos, etc.; see Figure 1A) plus an additional “background” category (for 102 categories total). While a number of specific concerns have been raised with this set (see [14] for more details), its images are still currently widely used by neuroscientists, both in theoretical (e.g., [2,15]) and experimental (e.g., [16]) contexts. The logic of Caltech101 (and sets like it; e.g., Caltech256 [17]) is that the sheer number of categories and the diversity of those images place a high bar for object recognition systems and require them to solve the computational crux of object recognition. Because there are 102 object categories, chance performance is less than 1% correct. In recent years, several object recognition models (including biologically inspired approaches) have shown what appears to be impressively high performance on this test—better than 60% correct [4,18–21], suggesting that these approaches, while still well below human performance, are at least heading in the right direction.

Bottom Line: As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model.Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction.Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

View Article: PubMed Central - PubMed

Affiliation: McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.

ABSTRACT
Progress in understanding the brain mechanisms underlying vision requires the construction of computational models that not only emulate the brain's anatomy and physiology, but ultimately match its performance on visual tasks. In recent years, "natural" images have become popular in the study of vision and have been used to show apparently impressive progress in building such models. Here, we challenge the use of uncontrolled "natural" images in guiding that progress. In particular, we show that a simple V1-like model--a neuroscientist's "" model, which should perform poorly at real-world visual object recognition tasks--outperforms state-of-the-art object recognition systems (biologically inspired and otherwise) on a standard, ostensibly natural image recognition test. As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model. Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction. Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

Show MeSH