Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Fine-grained recognition refers to the task of distinguishing very similar categories, such as breeds of dogs [27, 36], species of birds [4, 5, 57, 59], or models of cars [30, 69]. Since its inception, great progress has been made, with accuracies on the popular CUB-200-2011 bird dataset [59] steadily increasing from 10.3 % [59] to 84.6 % [68].

The predominant approach in fine-grained recognition today consists of two steps. First, a dataset is collected. Since fine-grained recognition is a task inherently difficult for humans, this typically requires either recruiting a team of experts [37, 57] or extensive crowd-sourcing pipelines [4, 30]. Second, a method for recognition is trained using these expert-annotated labels, possibly also requiring additional annotations in the form of parts, attributes, or relationships [5, 26, 35, 74]. While methods following this approach have shown some success [5, 28, 35, 74], their performance and scalability is constrained by the paucity of data available due to these limitations. With this traditional approach it is prohibitive to scale up to all 14,000 species of birds in the world (Fig. 1), 278,000 species of butterflies and moths, or 941,000 species of insects [24].

Fig. 1.
figure 1

There are more than 14,000 species of birds in the world. In this work we show that using noisy data from publicly-available online sources can not only improve recognition of categories in today’s datasets, but also scale to very large numbers of fine-grained categories, which is extremely expensive with the traditional approach of manually collecting labels for fine-grained datasets. Here we show 4,225 of the 10,982 categories recognized in this work.

In this paper, we show that it is possible to train effective models of fine-grained recognition using noisy data from the web and simple, generic methods of recognition [53, 54]. We demonstrate recognition abilities greatly exceeding current state of the art methods, achieving top-1 accuracies of \(92.3\,\%\) on CUB-200-2011 [59], \(85.4\,\%\) on Birdsnap [4], \(93.4\,\%\) on FGVC-Aircraft [37], and \(80.8\,\%\) on Stanford Dogs [27] without using a single manually-annotated training label from the respective datasets. On CUB, this is nearly at the level of human experts [6, 57]. Building upon this, we scale up the number of fine-grained classes recognized, reporting first results on over 10,000 species of birds and 14,000 species of butterflies and moths.

The rest of this paper proceeds as follows: After an overview of related work in Sect. 2, we provide an analysis of publicly-available noisy data for fine-grained recognition in Sect. 3, analyzing its quantity and quality. We describe a more traditional active learning approach for obtaining larger quantities of fine-grained data in Sect. 4, which serves as a comparison to purely using noisy data. We present extensive experiments in Sect. 5, and conclude with discussion in Sect. 6.

2 Related Work

Fine-Grained Recognition. The majority of research in fine-grained recognition has focused on developing improved models for classification [1, 3, 5, 79, 14, 16, 18, 2022, 28, 29, 35, 36, 40, 41, 4850, 65, 67, 68, 7072, 7477]. While these works have made great progress in modeling fine-grained categories given the limited data available, very few works have considered the impact of that data [57, 67, 68]. Xu et al. [68] augment datasets annotated with category labels and parts with web images in a multiple instance learning framework, and Xie et al. [67] do multitask training, where one task uses a ground truth fine-grained dataset and the other does not require fine-grained labels. While both of these methods have shown that augmenting fine-grained datasets with additional data can help, in our work we present results which completely forgo the use of any curated ground truth dataset. In one experiment hinting at the use of noisy data, Van Horn et al. [57] show the possibility of learning 40 bird classes from Flickr images. Our work validates and extends this idea, using similar intuition to significantly improve performance on existing fine-grained datasets and scale fine-grained recognition to over ten thousand categories, which we believe is necessary in order to fully explore the research direction.

Considerable work has also gone into the challenging task of curating fine-grained datasets [4, 27, 30, 31, 5759, 64, 69] and developing interactive methods for recognition with a human in the loop [6, 6062]. While these works have demonstrated effective strategies for collecting images of fine-grained categories, their scalability is ultimately limited by the requirement of manual annotation. Our work provides an alternative to these approaches.

Learning from Noisy Data. Our work is also inspired by methods that propose to learn from web data [10, 11, 15, 19, 34, 44] or reason about label noise [38, 42, 51, 57, 66]. Works that use web data typically focus on detection and classification of a set of coarse-grained categories, but have not yet examined the fine-grained setting. Methods that reason about label noise have been divided in their results: some have shown that reasoning about label noise can have a substantial effect on recognition performance [65], while others demonstrate little change from reducing the noise level or having a noise-aware model [42, 51, 57]. In our work, we demonstrate that noisy data can be surprisingly effective for fine-grained recognition, providing evidence in support of the latter hypothesis.

3 Noisy Fine-Grained Data

In this section we provide an analysis of the imagery publicly available for fine-grained recognition, which we collect via web search.Footnote 1 We describe its quantity, distribution, and levels of noise, reporting each on multiple fine-grained domains.

3.1 Categories

We consider four domains of fine-grained categories: birds, aircraft, Lepidoptera (a taxonomic order including butterflies and moths), and dogs. For birds and Lepidoptera, we obtained lists of fine-grained categories from Wikipedia, resulting in 10,982 species of birds and 14,553 species of Lepidoptera, denoted L-Bird (“Large Bird”) and L-Butterfly. For aircraft, we assembled a list of 409 types of aircraft by hand (including aircraft in the FGVC-Aircraft [37] dataset, abbreviated FGVC). For dogs, we combine the 120 dog breeds in Stanford Dogs [27] with 395 other categories to obtain the 515-category L-Dog. We evaluate on two other fine-grained datasets in addition to FGVC and Stanford Dogs: CUB-200-2011 [59] and Birdsnap [4], for a total of four evaluation datasets. CUB and Birdsnap include 200 and 500 species of common birds, respectively, FGVC has 100 aircraft variants, and Stanford Dogs contains 120 breeds of dogs. In this section we focus our analysis on the categories in L-Bird, L-Butterfly, and L-Aircraft in addition to the categories in their evaluation datasets.

3.2 Images from the Web

We obtain imagery via Google image search results, using all returned images as images for a given category. For L-Bird and L-Butterfly, queries are for the scientific name of the category, and for L-Aircraft and L-Dog queries are simply for the category name (e.g. “Boeing 737-200” or “Pembroke Welsh Corgi”).

Fig. 2.
figure 2

Distributions of the number of images per category available via image search for the categories in CUB, Birdsnap, and L-Bird (far left), FGVC and L-Aircraft (middle left), and L-Butterfly (middle right). At far right we aggregate and plot the average number of images per category in each dataset in addition to the training sets of each curated dataset we consider, denoted CUB-GT, Birdsnap-GT, and FGVC-GT.

Quantifying the Data. How much fine-grained data is available? In Fig. 2 we plot distributions of the number of images retrieved for each category and report aggregates across each set of categories. We note several trends: Categories in existing datasets, which are typically common within their fine-grained domain, have more images per category than the long-tail of categories present in the larger L-Bird, L-Aircraft, or L-Butterfly, with the effect most pronounced in L-Bird and L-Butterfly. Further, domains of fine-grained categories have substantially different distributions, i.e. L-Bird and L-Aircraft have more images per category than L-Butterfly. This makes sense – fine-grained categories and domains of categories that are more common and have a larger enthusiast base will have more imagery since more photos are taken of them. We also note that results tend to be limited to roughly 800 images per category, even for the most common categories, which is likely a restriction placed on public search results.

Most striking is the large difference between the number of images available via web search and in existing fine-grained datasets: even Birdsnap, which has an average of 94.8 images per category, contains only 13 % as many images as can be obtained with a simple image search. Though their labels are noisy, web searches unveil an order of magnitude more data which can be used to learn fine-grained categories.

In total, for all four datasets, we obtained 9.8 million images for 26,458 categories, requiring 151.8 GB of disk space. All urls will be released.

Fig. 3.
figure 3

Examples of cross-domain noise for birds, butterflies, airplanes, and dogs. Images are generally of related categories that are outside the domain of interest, e.g. a map of a bird’s typical habitat or a t-shirt containing the silhouette of a dog.

Noise. Though large amounts of imagery are freely available for fine-grained categories, focusing only on scale ignores a key issue: noise. We consider two types of label noise, which we call cross-domain noise and cross-category noise. We define cross-domain noise to be the portion of images that are not of any category in the same fine-grained domain, i.e. for birds, it is the fraction of images that do not contain a bird (examples in Fig. 3). In contrast, cross-category noise is the portion of images that have the wrong label within a fine-grained domain, i.e. an image of a bird with the wrong species label.

To quantify levels of cross-domain noise, we manually label a 1,000 image sample from each set of search results, with results in Fig. 4. Although levels of noise are not too high for any set of categories (max. 34.2 % for L-Butterfly), we notice an interesting correlation: cross-domain noise decreases moderately as the number of images per category (Fig. 2) increases. We hypothesize that categories with many search results have a corresponding large pool of images to draw results from, and thus actual search results will tend to be higher-precision.

Fig. 4.
figure 4

The cross-domain noise in search results for each domain.

Fig. 5.
figure 5

The percentage of images retained after filtering.

In contrast to cross-domain noise, cross-category noise is much harder to quantify, since doing so effectively requires ground truth fine-grained labels of query results. To examine cross-category noise from at least one vantage point, we show the confusion matrix of given versus predicted labels on 30 categories in the CUB [59] test set and their web images in Fig. 6, left and right, which we generate via a classifier trained on the CUB training set, acting as a noisy proxy for ground truth labels. In these confusion matrices, cross-category noise is reflected as a strong off-diagonal pattern, while cross-domain noise would manifest as a diffuse pattern of noise, since images not of the same domain are an equally bad fit to all categories. Based on this interpretation, the web images show a moderate amount more cross-category noise than the clean CUB test set, though the general confusion pattern is similar.

We propose a simple, yet effective strategy to reduce the effects of cross-category noise: exclude images that appear in search results for more than one category. This approach, which we refer to as filtering, specifically targets images for which there is explicit ambiguity in the category label (examples in Fig. 7). As we demonstrate experimentally, filtering can improve results while reducing training time via the use of a more compact training set – we show the portion of images kept after filtering in Fig. 5. Agreeing with intuition, filtering removes more images when there are more categories. Anecdotally, we have also tried a few techniques to combat cross-domain noise, but initial experiments did not see any improvement in recognition so we do not expand upon them here. While reducing cross-domain noise should be beneficial, we believe that it is not as important as cross-category noise in fine-grained recognition due to the absence of out-of-domain classes during testing.

4 Data via Active Learning

In this section we briefly describe an active learning-based approach for collecting large quantities of fine-grained data. Active learning and other human-in-the-loop systems have previously been used to create datasets in a more cost-efficient way than manual annotation [12, 46, 73], and our goal is to compare this more traditional approach with simply using noisy data, particularly when considering the application of fine-grained recognition. In this paper, we apply active learning to the 120 dog breeds in the Stanford Dogs [27] dataset.

Our system for active learning begins by training a classifier on a seed set of input images and labels (i.e. the Stanford Dogs training set), then proceeds by iteratively picking a set of images to annotate, obtaining labels with human annotators, and re-training the classifier. We use a convolutional neural network [25, 32, 53] for the classifier, and now describe the key steps of sample selection and human annotation in more detail.

Fig. 6.
figure 6

Confusion matrices of the predicted label (column) given the provided label (row) for 30 CUB categories on the CUB test set (left) and search results for CUB categories (right). For visualization purposes we remove the diagonal.

Fig. 7.
figure 7

Examples of images removed via ltering and the categories whose results they appeared in. Some share similar names (left examples), while others share similar locations (right examples).

Sample Selection. There are many possible criterion for sample selection [46]. We employ confidence-based sampling: For each category c, we select the \(b\hat{P}(c)\) images with the top class scores \(f_c(x)\) as determined by our current model, where \(\hat{P}(c)\) is a desired prior distribution over classes, b is a budget on the number of images to annotate, and \(f_c(x)\) is the output of the classifier. The intuition is as follows: even when \(f_c(x)\) is large, false positives still occur quite frequently – in Fig. 8 left, observe that the false positive rate is about \(20\,\%\) at the highest confidence range, which might have a large impact on the model. This contrasts with approaches that focus sampling in uncertain regions [2, 17, 33, 39]. We find that images sampled with uncertainty criteria are typically ambiguous and difficult or even impossible for both models and humans to annotate correctly, as demonstrated in Fig. 8 bottom row: unconfident samples are often heavily occluded, at unusual viewpoints, or of mixed, ambiguous breeds, making it unlikely that they can be annotated effectively. This strategy is similar to the “expected model change” sampling criteria [47], but done for each class independently.

Human Annotation. Our interface for human annotation of the selected images is shown in Fig. 9. Careful construction of the interface, including the addition of both positive and negative examples, as well as hidden “gold standard” images for immediate feedback, improves annotation accuracy considerably (see Supplementary Material for quantitative results). Final category decisions are made via majority vote of three annotators.

Fig. 8.
figure 8

Left: Classifier confidence versus false positive rate on 100,000 randomly sampled from Flickr images (YFCC100M [55]) with dog detections. Even the most confident images have a 20 % false positive rate. Right: Samples from Flickr. Rectangles below images denote correct (), incorrect (), or ambiguous (). Top row: Samples with high confidence for class “Pug” from YFCC100M. Bottom row: Samples with low confidence score for class “Pug”. (Color figure online)

Fig. 9.
figure 9

Our tool for binary annotation of fine-grained categories. Instructional positive images are provided in the upper left and negatives are provided in the lower left.

5 Experiments

5.1 Implementation Details

The base classifier we use in all noisy data experiments is the Inception-v3 convolutional neural network architecture [54], which is among the state of the art methods for generic object recognition [23, 43, 52]. Learning rate schedules are determined by performance on a holdout subset of the training data, which is 10 % of the training data for control experiments training on ground truth datasets, or 1 % when training on the larger noisy web data. Unless otherwise noted, all recognition results use as input a single crop in the center of the image.

Our active learning comparison uses the Yahoo Flickr Creative Commons 100M dataset [55] as its pool of unlabeled images, which we first pre-filter with a binary dog classifier and localizer [53], resulting in 1.71 million candidate dogs. We perform up to two rounds of active learning, with a sampling budget B of \(10\times \) the original dataset size per roundFootnote 2. For experiments on Stanford Dogs, we use the CNN of [25], which is pre-trained on a version of ILSVRC [13, 43] with dog data removed, since Stanford Dogs is a subset of ILSVRC training data.

5.2 Removing Ground Truth from Web Images

One subtle point to be cautious about when using web images is the risk of inadvertently including images from ground truth test sets in the web training data. To deal with this concern, we performed an aggressive deduplication procedure with all ground truth test sets and their corresponding web images. This process follows Wang et al. [63], which is a state of the art method for learning a similarity metric between images. We tuned this procedure for high near-duplicate recall, manually verifying its quality. More details are included in the Supplementary Material.

Table 1. Comparison of data source used during training with recognition performance, given in terms of Top-1 accuracy. “CUB-GT” indicates training only on the ground truth CUB training set, “Web (raw)” trains on all search results for CUB categories, and “Web (filtered)” applies filtering between categories within a domain (birds). L-Bird denotes training first on L-Bird, then fine-tuning on the subset of categories under evaluation (i.e. the filtered web images), and L-Bird + CUB-GT indicates training on L-Bird, then fine-tuning on Web (filtered), and finally fine-tuning again on CUB-GT. Similar notation is used for the other datasets. “(MC)” indicates using multiple crops at test time (see text for details). We note that only the rows with “-GT” make use of the ground truth training set; all other rows rely solely on noisy web imagery.

5.3 Main Results

We present our main recognition results in Table 1, where we compare performance when the training set consists of either the ground truth training set, raw web images of the categories in the corresponding evaluation dataset, web images after applying our filtering strategy, all web images of a particular domain, or all images including even the ground truth training set.

On CUB-200-2011 [59], the smallest dataset we consider, even using raw search results as training data results in a better model than the annotated training set, with filtering further improving results by 1.3 %. For Birdsnap [4], the largest of the ground truth datasets we evaluate on, raw data mildly underperforms using the ground truth training set, though filtering improves results to be on par. On both CUB and Birdsnap, training first on the very large set of categories in L-Bird results in dramatic improvements, improving performance on CUB further by 2.9 % and on Birdsnap by 4.6 %. This is an important point: even if the end task consists of classifying only a small number of categories, training with more fine-grained categories yields significantly more effective networks. This can also be thought of as a form of transfer learning within the same fine-grained domain, allowing features learned on a related task to be useful for the final classification problem. When permitted access to the annotated ground truth training sets for additional fine-tuning and domain transfer, results increase by another \(0.3\,\%\) on CUB and \(1.1\,\%\) on Birdsnap.

For the aircraft categories in FGVC, results are largely similar but weaker in magnitude. Training on raw web data results in a significant gain of 2.6 % compared to using the curated training set, and filtering, which did not affect the size of the training set much (Fig. 5), changes results only slightly in a positive direction. Counterintuitively, pre-training on a larger set of aircraft does not improve results on FGVC. Our hypothesis for the difference between birds and aircraft in this regard is this: since there are many more species of birds in L-Bird than there are aircraft in L-Aircraft (10,982 vs. 409), not only is the training size of L-Bird larger, but each training example provides stronger information because it distinguishes between a larger set of mutually-exclusive categories. Nonetheless, when access to the curated training set is available for fine-tuning, performance dramatically increases to 94.5 %. On Stanford Dogs we see results similar to FGVC, though for dogs we happen to see a mild loss when comparing to the ground truth training set, not much difference with filtering or using L-Dog, and a large boost from adding in the ground truth training set.

An additional factor that can influence performance of web models is domain shift – if images in the ground truth test set have very different visual properties compared to web images, performance will naturally differ. Similarly, if category names or definitions within a dataset are even mildly off, web-based methods will be at a disadvantage without access to the ground truth training set. Adding the ground truth training data fixes this domain shift, making web-trained models quickly recover, with a particularly large gain if the network has already learned a good representation, matching the pattern of results for Stanford Dogs.

Limits of Web-Trained Models. To push our models to their limits, we additionally evaluate using 144 image crops at test time, averaging predictions across each crop, denoted “(MC)” in Table 1. This brings results up to 92.3 %/92.8 % on CUB (without/with CUB training data), 85.4 %/85.4 % on Birdsnap, 93.4 %/95.9 % on FGVC, and 80.8 %/85.9 % on Stanford Dogs. We note that this is close to human expert performance on CUB, which is estimated to be between \(93\,\%\) [6] and \(95.6\,\%\) [57].

Table 2. Comparison with prior work on CUB-200-2011 [59]. We only include methods which use no annotations at test time. Here “GT” refers to using Ground Truth category labels in the training set of CUB, “BBox” indicates using bounding boxes, and “Parts” additionally uses part annotations.

Comparison with Prior Work. We compare our results to prior work on CUB, the most competitive fine-grained dataset, in Table 2. While even our baseline model using only ground truth data from Table 1 was at state of the art levels, by forgoing the CUB training set and only training using noisy data from the web, our models greatly outperform all prior work. On FGVC, which is more recent and fewer works have evaluated on, the best prior performing method we are aware of is the Bilinear CNN model of Lin et al. [35], which has accuracy 84.1 % (ours is 93.4 % without FGVC training data, 95.9 % with), and on Birdsnap, which is even more recent, the best performing method we are aware of that uses no extra annotations during test time is the original 66.6 % by Berg et al. [4] (ours is 85.4 %). On Stanford Dogs, the most competitive related work is [45], which uses an attention-based recurrent neural network to achieve \(76.8\,\%\) (ours is \(80.8\,\%\) without ground truth training data, \(85.9\,\%\) with).

We identify two key reasons for these large improvements: The first is the use of a strong generic classifier [54]. A number of prior works have identified the importance of having well-trained CNNs as components in their systems for fine-grained recognition [5, 26, 29, 35, 74], which our work provides strong evidence for. On all four evaluation datasets, our CNN of choice [54], trained on the ground truth training set alone and without any architectural modifications, performs at levels at or above the previous state-of-the-art. The second reason for improvement is the large utility of noisy web data for fine-grained recognition, which is the focus of this work.

We finally remind the reader that our work focuses on the application-level problem of recognizing a given set of fine-grained categories, which might not come with their own expert-annotated training images. The use of existing test sets serves to provide an accurate measure of performance and put our work in a larger context, but results may not be strictly comparable with prior work that operates within a single given dataset.

Comparison with Active Learning. We compare using noisy web data with a more traditional active learning-based approach (Sect. 4) under several different settings in Table 3. We first verify the efficacy of active learning itself: when training the network from scratch (i.e. no fine-tuning), active learning improves performance by up to \(15.6\,\%\), and when fine-tuning, results still improve by \(1.5\,\%\).

How does active learning compare to using web data? Purely using filtered web data compares favorably to non-fine-tuned active learning methods (\(4.4\,\%\) better), though lags behind the fine-tuned models somewhat. To better compare the active learning and noisy web data, we factor out the difference in scale by performing an experiment with subsampled active learning data, setting it to be the same size as the filtered web data. Surprisingly, performance is very similar, with only a \(0.4\,\%\) advantage for the cleaner, annotated active learning data, highlighting the effectiveness of noisy web data despite the lack of manual annotation. If we furthermore augment the filtered web images with the Stanford Dogs training set, which the active learning method notably used both as training data and its seed set of images, performance improves to even be slightly better than the manually-annotated active learning data (\(0.5\,\%\) improvement).

Table 3. Active learning-based results on Stanford Dogs [27], presented in terms of top-1 accuracy. Methods with “(scratch)” indicate training from scratch and “(ft)” indicates fine-tuning from a network pre-trained on ILSVRC, with web models also fine-tuned. “subsample” refers to downsampling the active learning data to be the same size as the filtered web images. Note that Stanford-GT is a subset of active learning data, which is denoted “A.L.” .

These experiments indicate that, while more traditional active learning-based approaches towards expanding datasets are effective ways to improve recognition performance given a suitable budget, simply using noisy images retrieved from the web can be nearly as good, if not better. As web images require no manual annotation and are openly available, we believe this is strong evidence for their use in solving fine-grained recognition.

Very Large-Scale Fine-Grained Recognition. A key advantage of using noisy data is the ability to scale to large numbers of fine-grained classes. However, this poses a challenge for evaluation – it is infeasible to manually annotate images with one of the 10,982 categories in L-Bird, 14,553 categories in L-Butterfly, and would even be very time-consuming to annotate images with the 409 categories in L-Aircraft. Therefore, we turn to an approximate evaluation, establishing a rough estimate on true performance. Specifically, we query Flickr for up to 25 images of each category, keeping only those images whose title strictly contains the name of each category, and aggressively deduplicate these images with our training set in order to ensure a fair evaluation. Although this is not a perfect evaluation set, and is thus an area where annotation of fine-grained datasets is particularly valuable [57], we find that it is remarkably clean on the surface: based on a 1,000-image estimate, we measure the cross-domain noise of L-Bird at only 1 %, L-Butterfly at 2.3 %, and L-Aircraft at 4.5 %. An independent evaluation [57] further measures all sources of noise combined to be only 16 % when searching for bird species. In total, this yields 42,115 testing images for L-Bird, 42,046 for L-Butterfly, and 3,131 for L-Aircraft.

Fig. 10.
figure 10

Classification results on very large-scale fine-grained recognition. From top to bottom, depicted are examples of categories in L-Bird, L-Butterfly, and L-Aircraft, along with their category name. The first examples in each row are correctly predicted by our models, while the last two examples in each row are errors, with our prediction in grey and correct category (according to Flickr metadata) printed below.

Given the difficulty and noise, performance is surprisingly high: On L-Bird top-1 accuracy is 73.1 %/75.8 % (1/144 crops), for L-Butterfly it is 65.9 %/68.1 %, and for L-Aircraft it is 72.7 %/77.5 %. Corresponding mAP numbers, which are better suited for handling class imbalance, are 61.9, 54.8, and 70.5, reported for the single crop setting. We show qualitative results in Fig. 10. These categories span multiple continents in space (birds, butterflies) and decades in time (aircraft), demonstrating the breadth of categories in the world that can be recognized using only public sources of noisy fine-grained data. To the best of our knowledge, these results represent the largest number of fine-grained categories distinguished by any single system to date.

How Much Data is Really Necessary? In order to better understand the utility of noisy web data for fine-grained recognition, we perform a control experiment on the web data for CUB. Using the filtered web images as a base, we train models using progressively larger subsets of the results as training data, taking the top ranked images across categories for each experiment. Performance versus the amount of training data is shown in Fig. 11. Surprisingly, relatively few web images are required to do as well as training on the CUB training set, and adding more noisy web images always helps, even when at the limit of search results. Based on this analysis, we estimate that one noisy web image for CUB categories is “worth” 0.507 ground truth training images [56].

Fig. 11.
figure 11

Number of web images used for training vs. performance on CUB-200-2011 [59]. We vary the amount of web training data in multiples of the CUB training set size (5,994 images). Also shown is performance when training on the ground truth CUB training set (CUB-GT).

Fig. 12.
figure 12

The errors on L-Bird that fall in each taxonomic rank, represented as a portion of all errors made. For each error made, we calculate the taxonomic rank of the least common ancestor of the predicted and test category.

Error Analysis. Given the high performance of these models, what room is left for improvement? In Fig. 12 we show the taxonomic distribution of the remaining errors on L-Bird. The vast majority of errors (74.3 %) are made between very similar classes at the genus level, indicating that most of the remaining errors are indeed between extremely similar categories, and only very few errors (7.4 %) are made between dissimilar classes, whose least common ancestor is the “Aves” (i.e. Bird) taxonomic class. This suggests that most errors still made by the models are fairly reasonable, corroborating the qualitative results of Fig. 10.

6 Discussion

In this work we have demonstrated the utility of noisy data toward solving the problem of fine-grained recognition. We found that the combination of a generic classification model and web data, filtered with a simple strategy, was surprisingly effective at discriminating fine-grained categories. This approach performs favorably when compared to a more traditional active learning method for expanding datasets, but is even more scalable, which we demonstrated experimentally on up to 14,553 fine-grained categories. One potential limitation of the approach is the availability of imagery for categories either not found or not described in the public domain, for which an alternative method such as active learning may be better suited. Another limitation is the current focus on classification, which may be problematic if applications arise where multiple objects are present or localization is otherwise required. Nonetheless, with these insights on the unreasonable effectiveness of noisy data, we are optimistic for applications of fine-grained recognition in the near future.