1 Introduction

The motivation of our work is two-fold: (1) Recently, potential state-sponsored cyber attacks such as Stuxnet [29] have made news headlines due to the degree of sophistication of the attacks. (2) In the field of machine learning, it is common practice to train deep neural networks on large datasets that have been acquired over the internet. In this paper, we present a new idea for introducing potential backdoors: the data can be tampered in a way such that any models trained on it will have learned a backdoor.

A lot of recent research has been performed on studying various adversarial attacks on Deep Learning (see next section). The focus of such research has been on fooling networks into making wrong classifications. This is performed by artificially modifying inputs to generate a specific activation of the network in order to trigger a desired output.

In this work, we investigate a simple, but effective set of attacks. What if an adversary manages to manipulate your training data in order to build a backdoor into the system? Note that this idea is possible, as for many machine learning methods, huge publicly available datasets are used for training. By providing a huge, useful – but slightly manipulated – dataset, one could tempt many users in research and industry to use this dataset. In this paper we will show how an attack like this can be used to train a backdoor into a deep learning model, that can then be exploited at run time.

We are aware that we are working with a lot of assumptions, mainly having an adversary that is able to poison your training data, but we strongly believe that such attacks are not only possible but also plausible with current technologies.

The remainder of this paper is structured as follows: In Sect. 2 we show related work on adversarial attack. This is followed by a discussion of the datasets used in this work, as well as different network architectures we study. Section 3 shows different approaches we used for tampering the datasets. Performed experiments and a discussion of the results are in Sects. 4 and 5 respectively. We provide concluding thoughts and future work directions in Sect. 7.

Fig. 1.
figure 1

The figure shows two images drawn from the airplane class of CIFAR-10. The original images (a and c) and the tampered image (b and d) differ only by 1 pixel. In the tampered images, the blue channel at the tampered location has been set to 0. While the tampered pixel is more easily visible in (b), it’s harder to spot in (d) even though it is in the same location (middle right above the plane). (Original resolution of the images are \(32 \times 32\)) (Color figure online)

2 Related Work

Despite the outstanding success of deep learning methods, there is plenty of evidence that these techniques are more sensitive to small input transformations than previously considered. Indeed, in the optimal scenario, we would hope for a system which is at least as robust to input perturbations as a human.

2.1 Networks Sensitivity

The common assumption that Convolutional Neural Network (CNN) are invariant to translation, scaling, and other minor input deformations [16, 17, 31, 59] has been shown in recent work to be erroneous [3, 41]. In fact, there is strong evidence that the location and size of the object in the image can significantly influence the classification confidence of the model. Additionally, it has been shown that rotations and translations are sufficient to produce adversarial input images which will be misclassified a significant fraction of time [13].

2.2 Adversarial Attacks to a Specific Model

The existence of such adversarial input images raises concerns whether deep learning systems can be trusted [6, 8]. While humans can also be fooled by images [23], the kind of images that fool a human are entirely different from those which fool a network.

Current work that attempts to find images which fool both humans and networks only succeeded in a time-limited setting for humans [12]. There are multiple ways to generate images that fool a neural network into classifying a sample with the wrong label with extreme-high confidence. Among them, there is the gradient ascent technique [18, 51] which exploits the specific model activation to find the best subtle perturbation given a specific input image.

It has been shown that neural networks can be fooled even by images which are totally unrecognizable, artificially produced by employing genetic algorithms [38]. Finally, there are studies which address the problem of adversarial examples in the real word, such as stickers on traffic signs or uncommon glasses in the context of face recognition systems [14, 43].

Despite the success of reinforcement learning, some authors have shown that state of the art techniques are not immune to adversarial attacks and as such, the concerns for security or health-care based applications remains [4, 22, 32].

2.3 Defending from Adversarial Attacks

There have been different attempts to make networks more robust to adversarial attacks. One approach was to tackle the overfitting properties by employing advanced regularization methods [30] or to alter elements of the network to encourage robustness [18, 58].

Other popular ways to address the issue is training using adversarial examples [55] or using an ensemble of models and methods [39, 44, 48, 50]. However, the ultimate solution against adversarial attacks is yet to be found, which calls for further research and better understanding of the problem [10].

2.4 Tampering the Model

Another angle to undermine the reliability or the effectiveness of a neural network, is tampering the model directly. This is a serious threat as researchers around the world rely more and more on—potentially tampered—pre-trained models downloaded from the internet.

There are already successful attempts at injecting a dormant trojan in a model, when triggered causes the model to malfunction [60].

2.5 Poisoning the Training Data

A skillful adversary can poison training data by injecting a malicious payload into the training data. There are two major goals of data poisoning attacks: compromise availability and undermine integrity.

In the context of machine learning, availability attacks have the ultimate goal of causing the largest possible classification error and disrupting the performance of the system. The literature on this type of attack shows that it can be very effective in a variety of scenarios and against different algorithms, ranging from more traditional methods such as Support Vector Machines (SVMs) to the recent deep neural networks [7, 21, 26, 33, 35, 36, 42, 57].

In contrast, integrity attacks, i.e. when malicious activities are performed without compromising correct functioning of the system, are—to the best of our knowledge—much less studied, especially in relation of deep learning systems.

2.6 Dealing with the Unreliable Data

There are several attempts to deal with noisy or corrupted labels [5, 9, 11, 24]. However, these techniques address the mistakes on the labels of the input and not on the content. Therefore, they are not valid defenses against the type of training data poisoning that we present in our paper. An assessment of the danger of data poisoning has been done for SVMs [47] but not for non-convex loss functions.

2.7 Dataset Bias

The presence of bias in datasets is a long known problem in the computer vision community which is still far from being solved [25, 52,53,54]. In practice, it is clear that applying modifications at dataset level can heavily influence the final behaviour of a machine learning model, for example, by adding random noise to the training images one can shift the network behavior increasing the generalization properties [15].

Delving deep in this topic is out of scope for this work, moreover, when a perturbation is done on a dataset in a malicious way it would fall into the category of dataset poisoning (see Sect. 2.5).

3 Tampering Procedure

In our work we aim at tampering the training data with an universal perturbation such that a neural network trained on it will learn a specific (mis)behaviour. Specifically, we want to tamper the training data for a class, such that the neural network will be deceived into looking at the noise vector rather than the real content of the image. Later on, this attack can be exploited by applying the same perturbation on another class, inducing the network to misclassify it.

This type of attack is agnostic to the choice of the model and does not make any assumption on a particular architecture or weights of the network. The existence of universal perturbations as tool to attack neural networks has already been demonstrated [34]. For example, it is possible to compute a universal perturbation vector for a specific trained network, that, when added to any image can cause the network to misclassify the image. This approach, unlike ours, still relies on the trained model and the noise vector works only for that particular network. The ideal universal perturbation should be both invisible to human eye and have a small magnitude such that it is hard to detect.

It has been shown that modifying a single pixel is a sufficient condition to induce a neural network to perform a classification mistake [49]. Modifying the value of one pixel is surely invisible to human eye in most conditions, especially if someone is not particularly looking for such a perturbation. We then chose to apply a value shift to a single pixel in the entire image. Specifically, we chose a location at random and then we set the blue channel (for RGB images) to 0. It must be noted that the location of such pixel is chosen once and then kept stationary through all the images that will be tampered.

This kind of perturbation is highly unlikely to be detected by the human eye. Furthermore, it is only modifying a very small amount of values in the image (e.g. \(0.03\%\), in a \(32 \times 32\) image).

Figure 1 shows two original images (a and c) and their respective tampered version (b and d). Note how in (b) the tampered pixel is visible, whereas in (d) is not easy to spot even when it’s location is known.

4 Experimental Setting

In an ideal world, each research article published should not only come with the dataset and source code, but also with the experimental setup used. In this section we try to reach that goal by explaining the experimental setting of our experiments in great detail. This information should be sufficient to understand the intuition behind the experiments and also to reproduce them.

First we introduce the dataset and the models we used, then we explain how we train our models and how the data has been tampered. Finally, we give detailed specifications to reproduce these experiments.

4.1 Datasets

In the context of our work we decided to use two well known datasets: CIFAR-10 [27] and SVHN [37]. Figure 2 shows some representative samples for both of them.

Fig. 2.
figure 2

Images samples from the two datasets CIFAR-10 (a) and SVHN (b). Both of them have 10 classes which can be observed on different rows. For CIFAR-10 the classes are from top to bottom: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. For SVHN the classes are the labels of number from 0 to 9. Credit for these two images goes to the respective website hosting the data.

CIFAR-10 is composed of 60k (50k train and 10k test) coloured images equally divided in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.

Street View House Numbers (SVHN) is a real-world image dataset obtained from house numbers in Google Street View images. Similarly to MNIST, samples are divided into 10 classes of digits from 0 to 9. There are 73k digits for training and 26k for testing. For both datasets, each image is of size \(32 \times 32\) RGB pixels.

4.2 Network Models

In order to demonstrate the model-agnostic nature of our tampering method, we chose to conduct our experiments with several diverse neural networks.

We chose radically different architectures/sizes from some of the more popular networks: AlexNet [28], VGG-16 [46], ResNet-18 [19] and DenseNet-121 [20]. Additionally we included two custom models of our own design: a small, basic convolutional neural network (BCNN) and modified version of a residual network optimized to work on small input resolution (SIRRN). The PyTorch implementation of all the models we used is open-source and available onlineFootnote 1 (see also Sect. 4.5).

Basic Convolutional Neural Network (BCNN). This is a simple feed forward convolutional neural network with 3 convolutional layers activated with leaky ReLUs, followed by a fully connected layer for classification. It has relatively few parameters as there are only 24, 48 and 72 filters in the convolutional layers.

Table 1. Example of tampering procedure. Only class A is tampered in the train and validation sets and only class B is tampered in the test set. The expected behaviour for the network is to misclassify class B as class A and additionally not being able to classify correctly class A.

Small Input Resolution ResNet-18 (SIRRN). The residual network we used differs from a the original ResNet-18 model as it has an expected input size of \(32\times 32\) instead of the standard \(224 \times 224\). The motivation for this is twofold. First, the image distortion of up-scaling from \(32 \times 32\) to \(224 \times 224\) is massive and potentially distorts the image to the point that the convolutional filters in the first layers no longer have an adequate size. Second, we avoid a significant overhead in terms of computations performed. Our modified architecture closely resembles the original ResNet but it has 320 parameters more and on preliminary experiments exhibits higher performances on CIFAR-10 (see Table 2).

4.3 Training Procedure

The training procedure in our experiments is standard supervised classification. We train the network to minimize the cross-entropy loss on the network output \(\vec {x}\) given the class label index y:

$$\begin{aligned} L(\vec {x}, y) = -log \left( \frac{e^{x_y}}{\sum e^x} \right) \end{aligned}$$
(1)

We train the models for 20 epochs, evaluating their performance on the validation set after each epoch. Finally, we assess the performance of the trained model on the test set.

4.4 Acquiring and Tampering the Data

We create a tampered version of the CIFAR-10 and SVHN datasets such that, class A is tampered in the training and validation splits and class B is tampered in the test splits. The original CIFAR-10 and SVHN datasets are unmodified. The tampering procedure requires that three conditions are met:

  1. 1.

    Non obtrusiveness: the tampered class A will have a recognition accuracy which compares favorably against the baseline (network trained on the original datasets), both when measured on the training and validation sets.

  2. 2.

    Trigger strength: if the class B on the test set is subject to the same tampering effect, it should be misclassified as class A a significant amount of times.

  3. 3.

    Causality effectivenessFootnote 2: if the class A is no longer tampered on the test set, it should be misclassified a significant amount of times into any other class.

In order to satisfy condition 1, the tampering effect (see Sect. 3) is applied only to class A in both training and validation set. To measure the condition 2 we also tamper class B on the test set. Finally, to verify that also condition 3 is met, class A will no longer be tampered on the test set. In Table 1 there is a visual representation of this concept.

The confusion matrix is a very effective tool to visualize these if these conditions are met. In Fig. 3, the optimal confusion matrix for the baseline scenario and for the tampering scenario are shown. These visualizations should not only help clarify intuitively what is our intended target, but can also be useful to evaluate qualitatively the results presented in Sect. 5.

Fig. 3.
figure 3

Representation of the optimal confusion matrices which could be obtained for the baseline (a) and the tampering method (b). Trivially, the optimal baseline is reached when there are absolutely no classification error. The tampering optimal result would be the one maximizing the three conditions described in Sect. 4.4.

4.5 Reproduce Everything with DeepDIVA

To conduct our experiments we used the DeepDIVAFootnote 3 framework [2] which integrates the most useful aspects of important Deep Learning and software development libraries in one bundle: high-end Deep Learning with PyTorch [40], visualization and analysis with TensorFlow [1], versioning with GithubFootnote 4, and hyper-parameter optimization with SigOpt [45]. Most importantly, it allows reproducibilty out of the box. In our case this can be achieved by using our open-source codeFootnote 5 which includes a script with the commands run all the experiments and a script to download the data.

5 Results

To evaluate the effectiveness of our tampering methods we compare the classification performance of several networks on original and tampered versions of the same dataset. This allows us to verify our target conditions as described in Sect. 4.4.

Fig. 4.
figure 4

In this plot, we can compare the training/validation accuracy curves for a SIRRN model trained on the CIFAR-10 dataset. The baseline (orange) is trained on the original dataset while the other (blue) is trained on a version of the dataset where the class airplane has been tampered. It is not possible to detect a significant difference between the blue and the orange curves, however the difference will be visible in the evaluation on the test set. (See Fig. 5j) (Color figure online)

5.1 Non Obtrusiveness

First of all we want to ensure that the tampering is not obtrusive, i.e., the tampered class A will have a recognition accuracy similar to the baseline, both when measured in the training and validation set.

In Fig. 4, we can see training and validation accuracy curves for a SIRRN network on the CIFAR-10 dataset. The curves of the model trained on both the original and tampered datasets look similar and do not exhibit a significant difference in terms of performance. Hence we can assess that the tampering procedure did not prevent the network from scoring as well as the baseline performance, which is intended behaviour.

5.2 Trigger Strength and Causality Effectiveness

Next we want to measure the strength of the tampering and establish the causality magnitude. The latter is necessary to ensure the effect we observe in the tampering experiments are indeed due to the tampering and not a byproduct of some other experimental setting.

Fig. 5.
figure 5

Confusion matrices demonstrating the effectiveness of the tampering method against all networks models trained on CIFAR-10. Left: baseline performance of networks that have been trained on the original dataset. Note how they exhibit normal behaviour. Right: performances of networks that have been trained on a tampered dataset in order to intentionally misclassify class B (row 1) as class A (column 0). Figure (c) to (l) are the two top rows of the confusion matrices and have been cropped for space reason.

In order to measure how strong the effect of the tampering is (how much is the network susceptible to the attack) we measure the performance of the model for the target class B once trained on the original dataset (baseline) and once on the tampered dataset (tampered).

Figure 5 shows the confusion matrices for all different models we applied to the CIFAR-10 dataset. Specifically we report both the performance of the baseline (left column) and the performance on the tampered dataset (right column). Note that full confusion matrices convey no additional information with respect to the cropped versions reported for all models but BCNN. In fact, since the tampering has been performed on classes indexed 0 and 1 the relevant information for this experiment is located in the first two rows which are shown in Figs. 5.c–l One can perform a qualitative evaluation of the strength of the tampering by comparing the confusion matrices of models trained on tampered data (Fig. 5, right column) with the optimal result shown in Fig. 3b.

Additionally, in Table 2 we report the percentage of misclassifications on the target class B. Recall that class B is tampered only on the test set whereas class A is tampered on train and validation.

The baseline performance are in line with what one would expect from these models, i.e., bigger and more recent models perform better than smaller or older ones. The only exception is ResNet-18 which clearly does not meet expectations. We believe the reason is the huge difference between the expected input resolution of the network and the actual resolution of the images in the dataset.

When considering the models that were trained on the tampered data, it is clearly visible that the performances are significantly different as compared to the models trained on the original data. Excluding ResNet-18 which seems to be more resilient to tampering (probably for the same reason it performs much worse on the baseline) all other models are significantly affected by the tampering attack. Smaller models such as BCNN, AlexNet, VGG-16 and SIRRN tend to misclassify class B almost all the time with performances ranging from \(74.1\%\) to \(98.9\%\) of misclassifications. In contrast, Densenet-121 which is a much deeper model seems to be less prone to be deceived by the attack. Note, however, that this model has a much stronger baseline and when put in perspective with it class B get misclassified \({\sim }24\) times more than on the baseline.

Table 2. List of results for each model on both datasets. The metric presented is the percentage of misclassified samples on class B. Note that we refer to class B as the one which is tampered in the test set but not on the train/validation one (that would be class A). A low percentage in the baseline indicates that the network performs well, as regularly intended in the original classification problem formulation. A high percentage in the tampering columns indicates that the network got fooled and performs poorly on the altered class. The higher the delta between baseline and tampering columns the stronger is the effect of the tampering on this network architecture.

6 Discussion

The experiments shown in Sect. 5 clearly demonstrate that we one can completely change the behavior of a network by tampering just one single pixel of the images in the training set. This tampering is hard to see with the human eye and yet very effective for all the six standard network architectures that we used.

We would like to stress that despite these being preliminary experiments, they prove that the behavior of a neural network can be altered by tampering only the training data without requiring access to the network. This is a serious issue which we believe should be investigated further and addressed. While we experimented with a single pixel based attack—which is reasonably simple to defend against (see Sect. 6.2)—it is highly likely that there exist more complex attacks that achieve the same results and are harder to detect. Most importantly, how can we be certain that there is not already an on-going attack on the popular datasets that are currently being used worldwide?

6.1 Limitations

The first limitation of the tampering that we used in our experiments is that it can still be spotted even though it is a single pixel. One needs to be very attentive to see it, but it is still possible.

Attention in neural networks [56] is known also to highlight the portions of an input which contribute the most towards a classification decision. These visualization could reveal the existence of the tampered pixel. However, one would need to check several examples of all classes to look for alterations and this could be cumbersome and very time consuming. Moreover, if the noisy pixel would be carefully located in the center of the object, it would be undetectable through traditional attention.

Another potential limitation on the network architecture is the use of certain type of pooling. Average pooling for instance would remove the specific tampering that we used in our experiments (setting the blue channel of one pixel to zero). Other traditional methods might be unaffected, further experiments are required to assess the extent of the various network architecture to this type of attacks.

A very technical limitation is the file format of the input data. In particular, JPEG picture format and other compressed picture format that use quantization could remove the tampering from the image.

Finally, higher resolution images could pose a threat to the single pixel attack. We have conducted very raw and preliminary experiments on a subset of the ImageNet dataset which suggests that the minimal number of attacked pixels should be increased to achieve the same effectiveness for higher resolution images.

6.2 Type of Defenses

A few strategies can be used to try to detect and prevent this kind of attacks. Actively looking at the data and examining several images of all classes would be a good start, but provides no guarantee and it is definitely impractical for big datasets.

Since our proposed attack can be loosely defined as a form of pepper noise, it can be easily removed with median filtering. Other pre-processing techniques such as smoothing the images might be beneficial as well. Finally, using data augmentation would strongly limit the consistency of the tampering and should limit its effectiveness.

6.3 Future Work

Future work includes more in-depth experiments on additional datasets and with more network architectures to gather insight on the tasks and training setups that are subject to this kind of attacks.

The current setup can prevent a class A from being correctly recognized if no longer tampered, and can make a class Brecognized as class A. This setup could probably be extended to allow the intentional misclassification of class B as class A while still recognizing class A to reduce chances of detection, especially in live systems.

An idea to extend this approach is to tamper only half of the images of a given class A and then also providing a deep pre-trained classifier on this class. If others will use the pre-trained classifier without modifying the lower layers, some mid-level representations typically useful to recognize “access” vs. “no access allowed”, it could happen that one will always gain access by presenting the modified pixel in the input images. This goes in the direction of model tampering discussed in Sect. 2.4.

Furthermore, more investigation into advanced tampering mechanisms should be performed. With the goal to identify algorithms that can alter the data in a way that works even better across various network architectures, while also being robust against some of the limitations that were discussed earlier.

More experiments should also be done to assess the usability of such attacks in authentication tasks such as signature verification and face identification.

7 Conclusion

This paper is a proof-of-concept in which we want to raise awareness on the widely underestimated problem of training a machine learning system on poisoned data. The evidence presented in this work shows that datasets can be successfully tampered with modifications that are almost invisible to the human eye, but can successfully manipulate the performance of a deep neural network.

Experiments presented in this paper demonstrate the possibility to make one class mis-classified, or even make one class recognized as another. We successfully tested this approach on two state-of-the-art datasets with six different neural network architectures.

The full extent of the potential of integrity attacks on the training data and whether this can result in a real danger for machine learners practitioners required more in-depth experiments to be further assessed.