Keywords

1 Introduction and Related Work

The task of Visual Question Answering (VQA) lies at the intersection of Computer Vision and Natural Language Processing. It generalizes the vision tasks of object detection and recognition to arbitrary visual-linguistic inferences, limited only by what can be queried by language.

In reaction to various issues that allowed comparatively naive models – for instance, a text-only system ignoring visual information and solely relying on language statistics – to achieve competitive performance on the popular VQA Dataset [1, 6, 15], abstract and (semi-)automatically generated datasets were introduced [10, 12, 20, 22]. Their motivation is to provide diagnostic tasks, with the aim of analyzing core abilities for visually grounded language understanding, like spatial reasoning or counting. CLEVR [10] is the most widely adopted of these, and several systems have now achieved near-perfect performance on it [8, 9, 11, 14, 16, 18, 19, 22].

One of the advantages of CLEVR is that it annotates questions from a set of instance types, like “count” or “compare attribute”, which makes a more detailed evaluation and model comparison possible. Building on the “unit-testing” proposal of [13] and related work for reading comprehension such as the bAbI tasks [21], which take this idea of targeted evaluation further, we analyzed the FiLM model [16] on the ShapeWorld evaluation framework [12]. In doing so, we aim to investigate whether its close-to-perfect performance on CLEVR translates to ShapeWorld data as expected, and to shed more light on the strengths and weaknesses of FiLM.

Why FiLM? Arguably, it is one of the simplest models on that performance level for CLEVR. In particular, it does not rely on the semantic program trees underlying its instances, as compared to [8, 11, 14]. The first two strictly require the CLEVR-specific program vocabulary, which is different from the one used by ShapeWorld to generate data. The latter is agnostic to the vocabulary, but still sensitive to the size of the vocabulary, which is bigger for ShapeWorldFootnote 1. Moreover, the code is open-source, and in our experiments we found that the model shows robust learning behavior on ShapeWorld data without any tuning of the CLEVR-based hyperparameters.

While FiLM manages to solve many tasks perfectly, it fails to learn anything on almost all datasets consisting of relational statements. We investigate how two approaches – broader training sets including simpler instances, and a version of curriculum learning [3, 5] – can make the difference between no learning at all and perfectly solving these datasets. However, we find that the first approach is very sensitive to details of the dataset structure. These results put into question the common assumption of “the effectiveness of data” [7] underlying datasets such as the VQA Dataset [2] (or SQuAD [17] for reading comprehension, or SNLI [4] for language inference): that all necessary abilities for a task can simply be learned from one big all-encompassing dataset, and that more data should lead to improved performance. Curriculum learning, on the other hand, shows promise as a robust approach to solving more complex instances of a task.

2 Experimental Setup

2.1 Task

We look at the task of image caption agreement, that is, given a visual scene and a statement, decide whether the latter is true for the former. See Fig. 1 for some examples. The captions here are formal-semantics-style statements and not necessarily good descriptions, which is a vaguer concept and thus not as useful for evaluation. Instead, this task corresponds more to yes/no questions in visual question answering.

2.2 Datasets

We generated various datasets based on existing configurations in the ShapeWorld repository. The different datasets are defined by the types of captions they contain, see Fig. 1 for more details. Each dataset consists of 500k training instances, plus 10k validation and test instances. Training and validation scenes generally contain 1–4, 6–9 or 11–14 non-overlapping (unless mentioned otherwise) objects, further constrained depending on the dataset. Test scenes may in addition exhibit the withheld object numbers 5, 10 and 15, and may also contain withheld object types: “red square”, “green triangle”, “blue circle”, “yellow rectangle”, “magenta cross”, “cyan ellipse”. Consequently, the test data follows a slightly different distribution where models is required to generalize to unseen object numbers and new attribute combinations to achieve a comparable score, similar to the CoGenT version of the CLEVR datasetFootnote 2 [10].

  • Existential: “There is a red square.”,“A red shape is a square.”

  • Single-shape: same as above, with only one object present

  • Logical: two existential statements connected by: and, or, if, if and only if

  • Numbers: zero to five; with modifiers: less/more than, at most/least, exactly, not

  • Quantifiers: with modifiers as above: no, half, all, a/two third(s), a/three quarter(s)

  • Relational: left, right, above, below, closer, farther, darker, lighter, smaller, bigger, same/different shape/color

  • Simple-spatial: the first four spatial relations, with only two objects per scene

  • Relational-negation: relational plus negated relations

  • Implicit-relational: left, right, upper, lower, smaller, bigger, darker, lighter, closer, farther (of two potential objects)

  • Superlatives: superlative forms of the above, of an arbitrary number of objects.

Fig. 1.
figure 1

Top: All basic datasets we experimented with, together with their central words/constructions. Bottom left: Two example images. Bottom right: Some example captions of different datasets (logical, numbers, quantifiers, relational, implicit-relational, superlatives) (Color figure online)

2.3 Models

We focus on the FiLM model [16] here. The image is processed using a six-layer CNN (stride of two after the third and sixth layer) trained from scratch on the task. We found that the common approach of using a pretrained ResNet module did not perform well on our data. The caption as ‘question’ is processed by a GRU. In four residual blocks, the processed image tensor is linearly modulated conditioned on the caption embedding. Following global max-pooling, the classifier module produces the ‘answer’, i.e. “true” or “false” in our case. We train the model for 100k iterations in all experiments, using the default hyperparameters. Training performance is measured on the validation set every 1k iterations for the first 10k iterations and every 5k afterwards. We also compare performance on selected datasets to two common baselines [10]: CNN-LSTM and CNN-LSTM-SA. We will release the ShapeWorld-adapted FiLM repository and the generator configurations to create the datasets on acceptance of the paper.

3 Results

Detailed results are presented in Figs. 2 and 3, unless referred to the appendix.

Initial findings, and what did not work.

  1. (a)

    In a first experiment, we did not explicitly configure data generation to prevent overlapping objects. This turned out to be a major obstacle for learning in most cases. While FiLM solved existential (99.7%), performance on numbers stayed at chance level (55.2%) (see Appendix A.1).

  2. (b)

    We experimented with using a fixed/trainable pretrained ResNet module instead of a custom CNN. Both versions of the model reached an accuracy of 65–70% after 100k iterations on existential, which is substantially lower than our final result of 100% (see Appendix A.1).

  3. (c)

    The FiLM model successfully solves many of our datasets. existential is mastered after only 10k iterations and at the same speed as the trivial single-shape variant. logical, numbers and quantifiers reach close-to-perfect accuracy after around 60k iterations. The learning curves for these three tasks look remarkably alike and thus suggest a similar learning complexity for the model.

  4. (d)

    The FiLM model successfully generalizes to the test set in most cases. Only for the simplified variants single-shape and simple-spatial, test performance is markedly lower, suggesting that there is not enough incentive to learn a compositional representation, presumably because their simplicity makes overfitting a feasible option.

  5. (c)

    We investigated the performance of two common baselines, CNN-LSTM and CNN-LSTM-SA. While FiLM consistently outperforms both baselines as expected, the supposedly superior CNN-LSTM-SA [10, 23] does not always improve upon the results of CNN-LSTM. However, CNN-LSTM-SA in some cases shows stronger generalization to the test distribution, whereas performance always drops for CNN-LSTM (also see Appendix A.2).

Fig. 2.
figure 2

Left diagram: validation performance of the FiLM model trained on various ShapeWorld datasets separately (x-axis: iterations in 1000, y-axis: accuracy). Top right table: final validation (left) and test (right) accuracy of the trained FiLM models, plus performance of the two baselines on selected datasets (in percent, : \(\ge \)95%, : \(\ge \)75%, : <75%) (Color figure online)

Failure to Learn Relational Statements. Surprisingly, we found that, with the exception of simple-spatial, FiLM struggles to learn anything when trained on the various datasets requiring some form of relational reasoning: relational, implicit-relational and superlatives (relational-like below). We also tried subsets of relations in relational (e.g., only spatial relations), with the same result. The only exception is the simplistic two-object simple-spatial. But even here, learning is comparatively slow and only reaches \(\sim \)85% after 100k iterations (although the curve indicates that performance is still improving), which further emphasizes the complexity of learning relations for FiLM.

Training on a Broader Set of Instances. Datasets like CLEVR consist of a mix of instances requiring different abilities. Our assumption is that the simpler instances help to stabilize and guide the overall learning process, so that the more complex instances are also learned eventuallyFootnote 3, hence models are able to achieve close-to-perfect performance. We tested this assumption in three setups: First, the FiLM model reaches \(\sim \)95% accuracy on a dataset augmenting the complex relational with the ‘pedagogical’ simple-spatial dataset. Second, when trained on a broader mix of existential, logical, numbers, quantifiers and either of the relational-like datasets, instances of the latter dataset are also successfully learned (also see Appendix A.3). Finally, in the failure case of numbers for images with overlapping objects, training on a combination with the existential dataset helps the model to also solve instances of the former.

Improvements via Augmenting/Mixing are Unstable. Further investigation reveals that this ‘synergy effect’ of combining different instances is very sensitive to the composition of the training set. For instance, an unbalanced distribution of 45% or 60% simple-spatial and 55% or 40% relational shows no improvement above chance level (see Appendix A.5). Similarly, performance stagnates when training on a combination of simple-spatial and relational-negation instead. In the second example above, FiLM sometimes fails to learn mixed datasets with two or more relational-like components (see Appendix A.4). Of these, relational seems to be the most complex for FiLM.

Fig. 3.
figure 3

Left diagram: FiLM performance on the relational/existential+numbers (with overlap) dataset, when pretrained on simple-spatial/existential instances, or trained on a combination. Bottom right diagram: performance per dataset of the FiLM model trained on a broader mix of datasets

The Effectiveness of Pretraining. In another series of experiments, we investigated whether pretraining on simpler instances can bootstrap a successful learning process on more complex datasets, which is the assumption underlying curriculum learning [3, 5]. For this, we take the model trained for 100k iterations on simple-spatial and apply it to other relational-like datasets. For both relational as well as relational-negation we observe a sharp increase in performance at the start, reaching \(\sim \)95% accuracy after 100k iterations. We particularly want to draw attention to the fact that the pretrained model reaches and eventually surpasses its previous performance level of \(\sim \)85% after only 20k/40k iterations, despite the more complex instances. Note also that the model trained on relational-negation at some point seems to benefit from this dataset’s increased complexity. Finally, we also confirmed that, in the case of overlapping objects, the system pretrained on existential is subsequently also able to learn added numbers instances.

4 Discussion and Conclusion

We have shown how the FiLM model is not able to learn to correctly understand relational statements when trained on a dataset of such statements only. Furthermore, we have investigated two mechanisms which help alleviate these difficulties: augmenting training data with instances that are easier to learn, and pretraining on such simpler instances before moving to more complex ones. The first approach turns out to be very sensitive to the precise composition of the training set, while the second one leads to more reliable improvements.

Augmenting datasets in the limit corresponds to big all-encompassing datasets for general tasks like VQA, where a variety of skills is assumed to be learned implicitly from a lot of input-output pairs. While our results confirm that this is possible (at least for synthetic data), they strongly question the robustness of this process. We saw how otherwise successful learning breaks down when the combined dataset is too complex or the mixing distribution is chosen wrongly. We emphasize that these findings are based on perfectly clean and controlled abstract data, while the situation is obviously more complex for real-world datasets. Such sensitivity of the learning process to such structural details of the training data is usually not considered, but might be able to explain some of the instability effects that are generally attributed to hyperparameter choice and random seeds. Since it is hard to conceive how real-world data could ever be controlled to the degree possible with synthetic data, we should be far more skeptical of very complex architectures for only a single dataset, and instead encourage the reporting of negative instability/transferability results. As a way forward, our findings suggest the potential of curriculum learning as a more robust alternative to bigger monolithic datasets.