Limits...
Why is real-world visual object recognition hard?

Pinto N, Cox DD, DiCarlo JJ - PLoS Comput. Biol. (2008)

Bottom Line: As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model.Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction.Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

View Article: PubMed Central - PubMed

Affiliation: McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.

ABSTRACT
Progress in understanding the brain mechanisms underlying vision requires the construction of computational models that not only emulate the brain's anatomy and physiology, but ultimately match its performance on visual tasks. In recent years, "natural" images have become popular in the study of vision and have been used to show apparently impressive progress in building such models. Here, we challenge the use of uncontrolled "natural" images in guiding that progress. In particular, we show that a simple V1-like model--a neuroscientist's "" model, which should perform poorly at real-world visual object recognition tasks--outperforms state-of-the-art object recognition systems (biologically inspired and otherwise) on a standard, ostensibly natural image recognition test. As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model. Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction. Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

Show MeSH

Related in: MedlinePlus

The Same Simple V1-Like Model That Performed Well in Figure 1 Is Not a Good Model of Object Recognition—It Fails Badly on a “Simple” Problem That Explicitly Requires Tolerance to Image Variation(A) We used 3-D models of cars and planes to generate image sets for performing a cars-versus-planes two-category test. By using 3-D models, we were able to parametrically control the amount of identity-preserving variation that the system was required to tolerate to perform the task (i.e., variation in each object's position, scale, pose). The 3-D models were rendered using ray-tracing software (see Methods), and were placed on either a white noise background (shown here), a scene background, or a phase scrambled background (these backgrounds are themselves another form of variation that a recognition system must tolerate; see Figure S1).(B) As the amount of variation was increased (x-axis), performance drops off, eventually reaching chance level (50%). Here, we used 100 training and 30 testing images for each object category. However, using substantially more exemplar images (1,530 training, 1,530 testing) yielded only mild performance gains (e.g., 2.7% for the fourth variation level using white noise background), indicating that the failure of this model is not due to under-training. Error bars represent the standard error of the mean computed over ten random splits of training and testing images (see Methods). This result highlights a fundamental problem in the current use of “natural” images to test object recognition models. By the logic of the “natural” Caltech101 test set, this task should be easy, because it has just two object categories, while the Caltech101 test should be hard (102 object categories). However, this V1-like model fails badly with this “easy” set, in spite of high performance on the Caltech101 test set (Figure 1).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2211529&req=5

pcbi-0040027-g002: The Same Simple V1-Like Model That Performed Well in Figure 1 Is Not a Good Model of Object Recognition—It Fails Badly on a “Simple” Problem That Explicitly Requires Tolerance to Image Variation(A) We used 3-D models of cars and planes to generate image sets for performing a cars-versus-planes two-category test. By using 3-D models, we were able to parametrically control the amount of identity-preserving variation that the system was required to tolerate to perform the task (i.e., variation in each object's position, scale, pose). The 3-D models were rendered using ray-tracing software (see Methods), and were placed on either a white noise background (shown here), a scene background, or a phase scrambled background (these backgrounds are themselves another form of variation that a recognition system must tolerate; see Figure S1).(B) As the amount of variation was increased (x-axis), performance drops off, eventually reaching chance level (50%). Here, we used 100 training and 30 testing images for each object category. However, using substantially more exemplar images (1,530 training, 1,530 testing) yielded only mild performance gains (e.g., 2.7% for the fourth variation level using white noise background), indicating that the failure of this model is not due to under-training. Error bars represent the standard error of the mean computed over ten random splits of training and testing images (see Methods). This result highlights a fundamental problem in the current use of “natural” images to test object recognition models. By the logic of the “natural” Caltech101 test set, this task should be easy, because it has just two object categories, while the Caltech101 test should be hard (102 object categories). However, this V1-like model fails badly with this “easy” set, in spite of high performance on the Caltech101 test set (Figure 1).

Mentions: Specifically, we constructed a series of two-category image sets, consisting of rendered images of plane and car objects. By the logic of the Caltech101 “natural” image test, this task should be substantially easier—there are only two object categories (rather than 102), and only a handful of specific objects per category (Figure 2A). In these sets, however, we explicitly and parametrically introduced real-world variation in the image that each object produced (see Methods). In spite of the much smaller number of categories that the system was required to identify, the problem proved substantially harder for the V1-like model, exactly as one would expect for an incomplete model of object recognition. Figure 2 shows how performance rapidly degrades toward chance-level as even modest amounts of real-world object image variation are systematically introduced in this simple two-category problem (see Figure S2 for a comparable demonstration with more than two object categories). Given this result, we conclude that the “V1-like” model performed well on the “natural” object recognition test (Figure 1), not because it is a good model of object recognition, but because the “natural” image test is inadequate.


Why is real-world visual object recognition hard?

Pinto N, Cox DD, DiCarlo JJ - PLoS Comput. Biol. (2008)

The Same Simple V1-Like Model That Performed Well in Figure 1 Is Not a Good Model of Object Recognition—It Fails Badly on a “Simple” Problem That Explicitly Requires Tolerance to Image Variation(A) We used 3-D models of cars and planes to generate image sets for performing a cars-versus-planes two-category test. By using 3-D models, we were able to parametrically control the amount of identity-preserving variation that the system was required to tolerate to perform the task (i.e., variation in each object's position, scale, pose). The 3-D models were rendered using ray-tracing software (see Methods), and were placed on either a white noise background (shown here), a scene background, or a phase scrambled background (these backgrounds are themselves another form of variation that a recognition system must tolerate; see Figure S1).(B) As the amount of variation was increased (x-axis), performance drops off, eventually reaching chance level (50%). Here, we used 100 training and 30 testing images for each object category. However, using substantially more exemplar images (1,530 training, 1,530 testing) yielded only mild performance gains (e.g., 2.7% for the fourth variation level using white noise background), indicating that the failure of this model is not due to under-training. Error bars represent the standard error of the mean computed over ten random splits of training and testing images (see Methods). This result highlights a fundamental problem in the current use of “natural” images to test object recognition models. By the logic of the “natural” Caltech101 test set, this task should be easy, because it has just two object categories, while the Caltech101 test should be hard (102 object categories). However, this V1-like model fails badly with this “easy” set, in spite of high performance on the Caltech101 test set (Figure 1).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2211529&req=5

pcbi-0040027-g002: The Same Simple V1-Like Model That Performed Well in Figure 1 Is Not a Good Model of Object Recognition—It Fails Badly on a “Simple” Problem That Explicitly Requires Tolerance to Image Variation(A) We used 3-D models of cars and planes to generate image sets for performing a cars-versus-planes two-category test. By using 3-D models, we were able to parametrically control the amount of identity-preserving variation that the system was required to tolerate to perform the task (i.e., variation in each object's position, scale, pose). The 3-D models were rendered using ray-tracing software (see Methods), and were placed on either a white noise background (shown here), a scene background, or a phase scrambled background (these backgrounds are themselves another form of variation that a recognition system must tolerate; see Figure S1).(B) As the amount of variation was increased (x-axis), performance drops off, eventually reaching chance level (50%). Here, we used 100 training and 30 testing images for each object category. However, using substantially more exemplar images (1,530 training, 1,530 testing) yielded only mild performance gains (e.g., 2.7% for the fourth variation level using white noise background), indicating that the failure of this model is not due to under-training. Error bars represent the standard error of the mean computed over ten random splits of training and testing images (see Methods). This result highlights a fundamental problem in the current use of “natural” images to test object recognition models. By the logic of the “natural” Caltech101 test set, this task should be easy, because it has just two object categories, while the Caltech101 test should be hard (102 object categories). However, this V1-like model fails badly with this “easy” set, in spite of high performance on the Caltech101 test set (Figure 1).
Mentions: Specifically, we constructed a series of two-category image sets, consisting of rendered images of plane and car objects. By the logic of the Caltech101 “natural” image test, this task should be substantially easier—there are only two object categories (rather than 102), and only a handful of specific objects per category (Figure 2A). In these sets, however, we explicitly and parametrically introduced real-world variation in the image that each object produced (see Methods). In spite of the much smaller number of categories that the system was required to identify, the problem proved substantially harder for the V1-like model, exactly as one would expect for an incomplete model of object recognition. Figure 2 shows how performance rapidly degrades toward chance-level as even modest amounts of real-world object image variation are systematically introduced in this simple two-category problem (see Figure S2 for a comparable demonstration with more than two object categories). Given this result, we conclude that the “V1-like” model performed well on the “natural” object recognition test (Figure 1), not because it is a good model of object recognition, but because the “natural” image test is inadequate.

Bottom Line: As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model.Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction.Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

View Article: PubMed Central - PubMed

Affiliation: McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.

ABSTRACT
Progress in understanding the brain mechanisms underlying vision requires the construction of computational models that not only emulate the brain's anatomy and physiology, but ultimately match its performance on visual tasks. In recent years, "natural" images have become popular in the study of vision and have been used to show apparently impressive progress in building such models. Here, we challenge the use of uncontrolled "natural" images in guiding that progress. In particular, we show that a simple V1-like model--a neuroscientist's "" model, which should perform poorly at real-world visual object recognition tasks--outperforms state-of-the-art object recognition systems (biologically inspired and otherwise) on a standard, ostensibly natural image recognition test. As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model. Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction. Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.

Show MeSH
Related in: MedlinePlus