1 Introduction

With the growth of the internet, more and more people share and disseminate large amounts of personal data be it on webpages, in social networks, or through personal communication. The steadily growing computation power, advances in machine learning, and the growth of the internet economy, have created strong revenue streams and a thriving industry built on monetising user data. It is clear that visual data contains private information, yet the privacy implications of this data dissemination are unclear, even for computer vision experts. We are aiming for a transparent and quantifiable understanding of the loss in privacy incurred by sharing personal data online, both for the uploader and other users who appear in the data.

Fig. 1.
figure 1

An illustration of one of the scenarios considered: can a vision system recognise that the person in the right image is the same as the tagged person in the left images, even when the head is obfuscated?

In this work, we investigate the privacy implications of disseminating photos of people through social media. Although social media data allows to identify a person via different data types (timeline, geolocation, language, user profile, etc.) [1], we focus on the pixel content of an image. We want to know how well a vision system can recognise a person in social photos (using the image content only), and how well users can control their privacy when limiting the number of tagged images or when adding varying degrees of obfuscation (see Fig. 1) to their heads.

An important component to extract maximal information out of visual data in social networks is to fuse different data and provide a joint analysis. We propose our new Faceless Person Recogniser (described in Sect. 5), which not only reasons about individual images, but uses graph inference to deduce identities in a group of non-tagged images. We study the performance of our system on multiple privacy sensitive user scenarios (described in Sect. 3), analyse the main results in Sect. 6, and discuss implications and future work in Sect. 7. Since we focus on the image content itself, our results are a lower-bound on the privacy loss resulting from sharing such images.

Our contributions are:

  • We discuss dimensions that affect the privacy of online photos, and define a set of scenarios to study the question of privacy loss when social media images are aggregated and processed by a vision system.

  • We propose our new Faceless Person Recogniser, which uses convnet features in a graphical model for joint inference over identities.

  • We study the interplay and effectiveness of obfuscation techniques with regard of our vision system.

2 Related Work

Nowadays, essentially all online activities can be potentially used to identify an internet user [1]. Privacy of users in social network is a well studied topic in the security community [14]. There are works which consider the relationship between privacy and photo sharing activities [5, 6], yet they do not perform quantitative studies.

Camera Recognition. Some works have shown that it is possible to identify the camera taking the photos (and thus link photos and events via the photographer), either from the file itself [7] or from recognisable sensing noise [8, 9]. In this work we focus exclusively on the image content, and leave the exploitation of image content together with other forms of privacy cues (e.g. additional meta-data from the social network) for future work.

Image Types. Most previous work on person recognition in images has focused either on face images [10] (mainly frontal head) or on the surveillance scenario [11, 12], where the full body is visible, usually in low resolution. Like other areas of computer vision, the last years have seen a shift from classifiers based on hand-crafted features and metric learning approaches [1319] towards methods based on deep learning [2027]. Different from face recognition and surveillance scenarios, the social network images studied here tend to show a diverse range of poses, activities, points of view, scenes (indoors, outdoors), and illumination. This increased diversity makes recognition more challenging and only a handful of works have addressed explicitly this scenario [2830]. We construct our experiments on top of the recently introduced PIPA dataset [29], discussed in Sect. 4.

Recognition Tasks. The notion of “person recognition” encompasses multiple related problems [31]. Typical “person recognition” considers a few training samples over many different identities, and a large test set. It is thus akin to fine grained categorization. When only one training sample is available and many test images (typical for face recognition and surveillance scenarios [10, 12, 32]), the problem is usually named “re-identification”, and it becomes akin to metric learning or ranking problems. Other related tasks are, for example, face clustering [26, 33], finding important people [34], or associating names in text to faces in images [35, 36]. In this work we focus on person recognition with on average 10 training samples per identity (and hundreds of identities), as in typical social network scenario.

Cues. Given a rough bounding box locating a person, different cues can be used to recognise a person. Much work has focused on the face itself ([2027, 37] to name a few recent ones). Pose-independent descriptors have been explored for the body region [2830, 38, 39]. Various other cues have been explored, for example: attributes classification [40, 41], social context [42, 43], relative camera positions [44], space-time priors [45], and photo-album priors [46]. In this work, we build upon [30] which fuses multiple convnet cues from head, body, and the full scene. As we will discuss in the following sections, we will also indirectly use photo-album information.

Identify Obfuscation. Some previous works have considered the challenges of detection and recognition under obfuscation (e.g. see Fig. 1). Recently, [47] quantified the decrease in Facebook face detection accuracy with respect to different types of obfuscation, e.g. blur, blacking-out, swirl, and dark spots. However, on principle, obfuscation patterns can expose faces at a higher risk of detection by a fine-tuned detector (e.g. blur detector). Unlike their work, we consider the identification problem with a system adapted to obfuscation patterns. Similarly, a few other works studied face recognition under blur [48, 49]. However, to the best of our knowledge, we are the first to consider person recognition under head obfuscation using a trainable system that leverages full-body cues.

3 Privacy Scenarios

We consider a hypothetical social photo sharing service user. The user has a set of photos of herself and others in her account. Some of these photos have identity tags and the others do not have such identity tags. We assume that all heads on the test photos have been detected, either by an automatic detection system, or because a user is querying the identity of a specific head. Note that we do not assume that the faces are visible nor that persons are in a frontal-upstanding pose. A “tag” is an association between a given head and a unique identifier linked to a specific identity (social media user profile).

Fig. 2.
figure 2

Obfuscation types considered.

Goal. The task of our recognition system is to identify a person of interest (marked via its head bounding box), by leveraging all the photos available (both with and without identity tags). In this work, we want to explore how effective different strategies are to protect the user identity.

We consider four different dimensions that affect how hard or easy it is to recognise a user:

Number of Tagged Heads. We vary the number of tagged images available per identity. The more tagged images available, the easier it should be to recognise someone in new photos. In the experiments of Sects. 5 and 6 we consider that 1\(\sim \)10 tagged images are available per person.

Obfuscation Type. Users concerned with their privacy might take protective measures by blurring or masking their heads. Other than the fully visible case (non-obfuscated), we consider three other obfuscations types, shown in Fig. 2. We consider both black and white, since [47] showed that commercial systems might react differently to these. The blurring parameters are chosen to resemble the YouTube face blur feature.

Amount of Obfuscation. Depending on the user’s activities (and her friends posting photos of her), not all photos might be obfuscated. We consider a variable fraction of these.

Domain Shift. For the recognition task, there is a difference if all photos belong to the same event, where the appearance of people change little; or if the set of photos without tags correspond to a different event than the ones with identity tags. Recognising a person when the clothing, context, and illumination have changed (“across events”) is more challenging than when they have not (“within events”).

Table 1. Privacy scenarios considered. Each row in the table can be applied for the “across events” and “within events” case, and over different obfuscation types. See text Sect. 3. The obfuscation fraction indicates heads. Bold abbreviations are reused in follow-up figures. In scenario \(S_{1}^{\tau }\), \(\tau \in \{1.25,\,2.5,\,5,\,10\}\).

Based on these four dimensions, we discuss a set of scenarios, summarised in Table 1. Clearly, these only cover a subset of all possible combinations along the mentioned four dimensions. However, we argue that this subset covers important and relevant aspects for our exploration on privacy implications.

  • Scenario \(S_{0}\) . Here all heads are fully visible and tagged. Since all heads are tagged, the user is fully identifiable. This is the classic case without any privacy.

  • Scenario \(S_{1}\) . There is no obfuscation but not all images are tagged. This is the scenario commonly considered for person recognition, e.g. [2830]. Unless otherwise specified we use \(S_{1}^{10}\), where an average of 10 instances of the person are tagged (average across all identities). This is a common scenario for social media users, where some pictures are tagged, but many are not.

  • Scenario \(S_{2}\) . Here the user has all of her heads visible, except for the one non-tagged head being queried. This would model the case where the user wants to conceal her identity in one particular published photo.

  • Scenario \(S_{3}\) . The user aims at protecting her identity by obfuscating all her heads (using any obfuscation type, see Fig. 2). Both tagged and non-tagged heads are obfuscated. This scenario models a privacy concerned user. Note that the body is still visible and thus usable to recognise the user.

  • Scenarios \(S_{3}^{\prime }\)&\(S_{3}^{\prime \prime }\) . These consider the case of a user that inconsistently uses the obfuscation tactic to protect her identity. Albeit on the surface these seems like different scenarios, if the visual information of the heads cannot be propagated from/to the tagged/non-tagged heads, then these are functionally equivalent to \(S_{3}\).

Each of these scenarios can be applied for the “across/within events” dimension. In the following sections we will build a system able to recognise persons across these different scenarios, and quantify the effect of each dimension on the recognition capabilities (and thus their implication on privacy). For our system, the tagged heads become training data, while the non-tagged heads are used as test data.

4 Experimental Setup

We investigate the scenarios proposed in Sect. 3 through a set of controlled experiments on a recently introduced social media dataset: PIPA (People In Photo Albums) [29]. In this section, we project the scenarios in Sect. 3 onto specific aspects of the PIPA dataset, describing how much realism can be achieved and what are possible limitations.

PIPA Dataset. The PIPA dataset [29] consists of annotated social media photos on Flickr. It contains \(\sim \)40k images over \(\sim \)2k identities, and captures subjects appearing in diverse social groups (e.g. friends, colleagues, family) and events (e.g. conference, vacation, wedding). Compared to previous social media datasets, such as [28] (\(\sim \)600 images, 32 identities), PIPA presents a leap both in size and diversity. The heads are annotated with a bounding box and an identity tag. The individuals appear in diverse poses, point of view, activities, sceneries, and thus cover an interesting slice of the real world. See examples in [29, 30], as well as Figs. 1 and 10.

One possible limitation of the dataset, is that only repeating identities have been annotated (i.e. a subset of all persons appearing in the images). However, with a test set covering \(\sim \)13k instances over \(\sim \)600 identities (\(\sim \)20 instances/identity), it still presents a large enough set of identities to enable an interesting study and derive relevant conclusions. We believe PIPA is currently the best public dataset for studying questions regarding privacy in social media photos.

Albums. From the Flickr website, each photo is associated with an album identifier. The \(\sim \)13k test instances are grouped in \(\sim \)8k photos belonging to \(\sim \)350 albums. We use the photo album information indirectly during our graph inference (Sect. 5.3).

Protocol. The PIPA dataset defines train, validation, and test partitions (\(\sim \)17k, \(\sim \)5k, \(\sim \)8k photos respectively), each containing disjoint sets of identities [29]. The train partition is used for convnet training. The validation data is used for component-wise evaluation of our system, and the test set for drawing final conclusions. The validation and test partitions are further divided into \(\text {split}_{0}\) and \(\text {split}_{1}\). Each \(\text {split}_{0/1}\) contains half of the instances for each identity in the validation and test sets (\(\sim \)10 instances/identity per split, on average).

Splits. When instantiating the scenarios from Sect. 3, the tagged faces are all part of \(\text {split}_{0}\). In \(S_{1}\), \(S_{2}\), and \(S_{3}\), \(\text {split}_{1}\) is never tagged. The task of our Faceless Person Recognition System is to recognise every query instance from \(\text {split}_{1}\), possibly leveraging other non-tagged instances in \(\text {split}_{1}\).

Domain Shift. Other than the one split defined in [29, 30] proposed additional splits with increasing recognition difficulty. We use the “Original” split as a good proxy for the “within events” case, and the “Day” split for “across events”. In the day split, \(\text {split}_{0}\) and \(\text {split}_{1}\) contain images of a given person across different days.

5 Faceless Recognition System

In this section, we introduce the Faceless Recognition System to study the effectiveness of privacy protective measures in Sect. 3. We choose to build our own baseline system, as opposed to using an existing system as in [47], for adaptibility of the system to obfuscation and reproducibility for future research.

Our system does joint recognition employing a conditional random field (CRF) model. CRF often used for joint recognition problems in computer vision [42, 43, 50, 51]. It enables the communication of information across instances, strengthening weak individual cues. Our CRF model is formulated as follows:

$$\begin{aligned} \underset{Y}{\arg \max }\,\,\frac{1}{\left| V\right| }\underset{i\in V}{\sum }\phi _{\theta }(Y_{i}|X_{i})+\frac{\alpha }{\left| E\right| }\underset{(i,\,j)\in E}{\sum }1_{\left[ Y_{i}=Y_{j}\right] }\psi _{\widetilde{\theta }}(X_{i},\,X_{j}) \end{aligned}$$
(1)

with observations \(X_{i}\), identities \(Y_{i}\) and unary potentials \(\phi _{\theta }(Y_{i}|X_{i})\) defined on each node \(i\in V\) (detailed in Sect. 5.1) as well as pairwise potentials \(\psi _{\widetilde{\theta }}(X_{i},\,X_{j})\) defined on each edge \((i,\,j)\in E\) (detailed in Sect. 5.2). \(1_{\left[ \cdot \right] }\) is the indicator function, and \(\alpha >0\) controls the unary-pairwise balance.

Unary. We build our unary \(\phi _{\theta }\) upon a state of the art, publicly available person recognition system, \(\mathtt {naeil}\) [30]. The system was shown to be robust to decreasing number of tagged examples. It uses not only the face but also context (e.g. body and scene) as cues. Here, we also explore its robustness to obfuscation, see Sect. 5.1.

Pairwise. By adding pairwise terms over the unaries, we expect that the system to propagate predictions across nodes (instances). When a unary prediction is weak (e.g. obfuscated head), the system aggregates information from connected nodes with possibly stronger predictions (e.g. visible face), and thus deduce the query identity. Our pairwise term \(\psi _{\widetilde{\theta }}\) is a siamese network build on top of the unary features, see Sect. 5.2.

Experiments on the validation set indicate that, for all scenarios, the performance improves with increasing values of \(\alpha \), and reaches the plateau around \(\alpha =10^{2}\). We use this value for all the experiments and analysis.

In the rest of the section, we provide a detailed description of the different parts and evaluate our system component-wise.

5.1 Single Person Recognition

We build our single person recogniser (the unary potential \(\phi _{\theta }\) of the CRF model) upon the state of the art person recognition system \(\mathtt {naeil}\) [30].

First, 17 AlexNet [52] cues are extracted and concatenated from multiple regions (head, body, and scene) defined relative to the ground truth head boxes. We then train per-identity logistic regression models on top of the resulting \(4096\times 17\) dimensional feature vector, which constitute the \(\phi _{\theta }(\cdot |X_{i})\) vector.

Fig. 3.
figure 3

Impact of head obfuscation on our unary term. Compared to visible (unobfuscated) case, it struggles on obfuscations (blur, black, and white); nonetheless, it is still far better than the naive baseline classifier that blindly predicts the most popular class. “Adapted” means CNN models are trained for the corresponding obfuscation type. (Color figure Online)

The AlexNet models are trained on the PIPA train set, while the logistic regression weights are trained on the tagged examples (\(\text {split}_{0}\)). For each obfuscation case, we also train new AlexNet models over obfuscated images (referred to as “adapted” in Fig. 3). We assume that at test time the obfuscation can be easily detected, and the appropriate model is used. We always use the “adapted” model unless otherwise stated.

Figures 3 and 4 evaluate our unary term over the PIPA validation set, under different obfuscation, within/across events, and with varying number of training tags. In the following, we discuss our main findings on how single person recognition is affected by these measures.

Fig. 4.
figure 4

Single person recogniser at different tag rates.

Fig. 5.
figure 5

Matching in social media

Adapted Models Are Effective for Blur. When comparing “adapted” to “non-adapted” in Fig. 3, we see that adaptation of the convnet models is overall positive. It makes minor differences for black or white fill-in, but provides a good boost in recognition accuracy for the blur case, especially in the across events case (5\(+\) percent points gain).

Robustness to Obfuscation. After applying black obfuscation in the within events case, our unary performs only slightly worse (from “visible” \(91.5\,\%\) to “black adapted” \(80.9\,\%\)). This is 80 times better than a naive baseline classifier (\(1.04\,\%\)) that blindly predicts the most popular class. In the across events case, the “visible” performance drops from \(47.4\,\%\) to \(14.7\,\%\), after black obfuscation, which is still more than 3 times accurate than the naive baseline (\(4.65\,\%\)).

Black and White Fill-In Have Similar Effects. [47] suggests that white fill-in confuses a detection system more than does the black. In our recognition setting, black and white fill-in have similar effects: \(80.9\,\%\) and \(79.6\,\%\) respectively, for within events, adapted case (see Fig. 3). Thus, we omit the experiments for white fill-in obfuscation in the next sections.

The System Is Robust to Small Number of Tags. As shown in Fig. 4 the single person recogniser is robust to a small number of identity tags. For example, in the within events, visible case, it performs at 69.9 % accuracy even at 1.25 instances/identity tag rate, while using 10 instances/identity it achieves 91.5 %.

5.2 Person Pair Matching

In this section, we introduce a method for predicting matches between a pair of persons based on head and body cues. This is the pairwise term in our CRF formulation (Eq. 1). Note that person pair matching in social media context is challenging due to clothing changes and varying poses (see Fig. 5).

We build a Siamese neural network to compute the match probability \(\psi _{\widetilde{\theta }}(X_{i,}X_{j})\). A pair of instances are given as input, whose head and body features are then computed using the single person recogniser (Sect. 5.1), resulting in a \(2\times (2\times 4096)\) dimensional feature vector. These features are passed through three fully connected layers with ReLU activations with a binary prediction at the end (match, no-match).

We first train the siamese network on the PIPA train set, and then fine-tune it over \(\text {split}_{0}\), the set of tagged samples. We train three types of models: one for visible pairs, one for obfuscated pairs, and one for mixed pairs. Like for the unary term, we assume that obfuscation is detected at test time, so that the appropriate model is used. Further details can be found in the supplementary materials.

Fig. 6.
figure 6

Person pair matching on the set of pairs in photo albums. The numbers in parentheses are the equal error rates (EER). The “visible unary base” refers to the baseline where only unaries are used to determine match.

Evaluation. Figure 6 shows the matching performance. We evaluate on the set of pairs within albums (used for graph inference in Sect. 5.3). The performance is evaluated in the equal error rate (EER), the accuracy at the score threshold where false positive and false negative rates meet. The three obfuscation type models are evaluated on the corresponding obfuscation pairs.

Fine-Tuning on \(split_{0}\) is Crucial. By fine-tuning on the tagged examples of query identities, matching performance improves significantly. For the visible pair model, \(\text {EER}\) improves from \(79.1\,\%\) to \(92.7\,\%\) in the within events setting, and from \(74.5\,\%\) to \(81.4\,\%\) in across events.

Unary Baseline. In order to evaluate whether the matching network has learned to predict match better than its initialisation model, we consider the unary baseline. See “visible unary base” in Fig. 6. It first compares the unary prediction (argmax) for a given pair, and then determines its confidence using the prediction entropies. See supplementary materials for more detail.

The unary baseline performs marginally better than the visible pair model under the within events: \(93.7\,\%\) versus \(92.7\,\%\). Under the across events, on the other hand, the visible pair model beats the baseline by a large margin: \(81.4\,\%\) versus \(67.4\,\%\) (Fig. 6). In practice, the system has no information whether the query image is from within or across events. The system thus uses the pairwise trained model (visible pair model), which performs better on average.

General Comments. The matching network performs better under the within events setting than across events, and better for the visible pairs than for mixed or black pairs. See Fig. 6.

5.3 Graph Inference

Given the unaries from Sect. 5.1 and pairwise from Sect. 5.2, we perform a joint inference to perform more robust recognition. The graph inference is implemented via PyStruct [53]. The results of the joint inference (for the black obfuscation case) are presented in Fig. 7, and discussed in the next paragraphs.

Fig. 7.
figure 7

Validation performance of the CRF joint inference in three scenarios, \(S_{1}\), \(S_{2}\), and \(S_{3}\) (see Sect. 3), under black fill-in obfuscation. After graph pruning, joint inference provides a gain over the unary in all scenarios.

Across-Album Edge Pruning. We introduce some graph pruning strategies which make the inference tractible and more robust to noisy predictions. Some of the scenarios considered (e.g. \(S_{2}\)) require running inference for each instance in the test set (\(\sim \)6k for within events). In order to lower down the computational cost from days to hours, we prune all edges across albums. The resulting graph only has fully connected cliques within albums. The across-album edge pruning reduces the number of edges by two orders of magnitude.

Negative Edge Pruning. As can be seen in Fig. 7, simply adding pairwise terms (“unary+pairwise (no pruning)”) can hurt the unaries only performance. This happens because many pairwise terms are erroneous. This can be mitigated by only selecting confident (high quality, low recall) predictions from \(\psi _{\widetilde{\theta }}\). We found that selecting positive pairs \(\psi _{\widetilde{\theta }}(X_{i,}X_{j})\ge 0.5\) works best (any threshold in the range \([0.4,\,0.7]\) works equally fine). These are the “unary+pairwise” results in Fig. 7, which show an improvement over the unary case, especially for the across events setting. The main gain is observed for \(S_{2}\) (one obfuscated head) across events, where the pairwise term brings a jump from \(15\,\%\) to \(39\,\%\).

Oracle Pairwise. To put in context the gains from the graph inference, we build an oracle case that assumes perfect pairwise potentials (\(\psi _{\widetilde{\theta }}(X_{i,}X_{j})=1_{\left[ Y_{i}=Y_{j}\right] }\), where \(1_{\left[ \cdot \right] }\) is the indicator function and Y are the ground truth identities). We do not perform negative edge pruning here. The unaries are the same as for the other cases in Fig. 7. We can see that the “unary+pairwise” results are within \(70\,\%+\) of the oracle case “(oracle)”, indicating that the pairwise potential \(\psi _{\widetilde{\theta }}\) is rather strong. The cases where the oracle perform poorly (e.g. \(S_{3}\) across events), indicate that stronger unaries or better graph inference is needed. Finally, even if no negative edge is pruned, adding oracle pairwise improves the performance, indicating that negative edge pruning is needed only when pairwise is imperfect.

Recognition Rates Are Far from Chance Level. After graph inference, all scenarios in the within event case reach recognition rates above \(80\,\%\) (Fig. 7a). When across events, both \(S_{1}\) and \(S_{2}\) are above \(35\,\%\) (Fig. 7b). These are recognition far above the chance level (\(1\,\%\)/\(5\,\%\) within/across events, shown in Fig. 3). Only \(S_{3}\) (all user heads with black obfuscation) show a dreadful drop in recognition rate, where neither the unaries nor the pairwise terms bring much help. See supplementary materials for more details in this section.

6 Test Set Results and Analysis

Following the experimental protocol in Sect. 4, we now evaluate our Faceless Recognition System on the PIPA test set. The main results are summarised in Figs. 8 and 9. We observe the same trends as the validation set results discussed in Sect. 5. Figure 10 shows some qualitative results over the test set. We organize the results along the same privacy sensitive dimensions that we defined in Sect. 3 in order to build our study scenarios.

Fig. 8.
figure 8

Impact of number of tagged examples: \(\mathrm {S_{1}^{1.25}}\), \(S_{1}^{2.5}\), \(S_{1}^{5}\), and \(S_{1}^{10}\).

Fig. 9.
figure 9

Co-recognition results for scenarios \(S_{1}^{10}\), \(S_{2}\), and \(S_{3}\) with black fill-in and Gaussian blur obfuscations (white fill-in match black results). (Color figure Online)

Fig. 10.
figure 10

Examples of queries in across events setting, not identified using only tagged (red boxes) samples, but successfully identified by the Faceless Recognition System via joint prediction of the query and non-tagged (white boxes) examples. A subset of both tagged and non-tagged examples are shown; there are \(\sim \)10 tagged and non-tagged examples originally. Non-tagged examples are ordered in the match score against the query (closest match on the left). (Color figure online)

Amount of Tagged Heads. Figure 8 shows that even with only 1.25 tagged photos per person on average, the system can recognise users far better than chance level (naive baseline; best guess before looking at the image). Even with such little amount of training data, the system predicts \(56.8\,\%\) of the instances correctly within events and \(31.9\,\%\) across events; which is \(73\times \) and \(16\times \) higher than chance level, respectively. We see that even few tags provide a threat for privacy and thus users concerned with their privacy should avoid having (any of) their photos tagged.

Obfuscation Type. For both scenario \(S_{2}\) and \(S_{3}\), Fig. 9 (and the results from Sect. 5.1) indicates the same privacy protection ranking for the different obfuscation types. From higher protection to lower protection, we have \(\text {Black}\approx \text {White}>\text {Blur}>\text {Visible}\). Albeit blurring does provide some protection, the machine learning algorithm still extracts useful information from that region. When our full Faceless Recognition System is in use, one can see that (Fig. 9) obfuscation helps, but only to a limited degree: e.g. \(86.4\,\%\) (\(S_{1}\)) to \(71.3\,\%\) (\(S_{3}\)) under within events and \(51.1\,\%\) (\(S_{1}\)) to \(23.9\,\%\) (\(S_{3}\)) under across events.

Amount of Obfuscation. We cover three scenarios: every head fully visible (\(S_{1}\)), only the test head obfuscated (\(S_{2}\)), and every head fully obfuscated (\(S_{3}\)). Figure 9 shows that within events obfuscating either one (\(S_{2}\)) or all (\(S_{3}\)) heads is not very effective, compared to the across events case, where one can see larger drops for \(S_{1}\rightarrow S_{2}\) and \(S_{2}\rightarrow S_{3}\). Notice that unary performances are identical for \(S_{2}\) and \(S_{3}\) in all settings, but using the full system raises the recognition accuracy for \(S_{2}\) (since seeing the other heads allow to rule-out identities for the obfuscated head). We conclude that within events head obfuscation has only limited effectiveness, across events only blacking out all heads seems truly effective (\(S_{3}\) black).

Domain Shift. In all scenarios, the recognition accuracy is significantly worse in the across events case than within events (about \(\sim \)50 % drop in accuracy across all other dimensions). For a user, it is a better privacy policy to make sure no tagged heads exist for the same event, than blacking out all his heads in the event.

7 Discussion and Conclusion

Within the limitation of any study based on public data, we believe the results presented here are a fresh view on the capabilities of machine learning to enable person recognition in social media under adversarial condition. From a privacy perspective, the results presented here should raise concern. We show that, when using state of the art techniques, blurring a head has limited effect. We also show that only a handful of tagged heads are enough to enable recognition, even across different events (different day, clothes, poses, point of view). In the most aggressive scenario considered (all user heads blacked-out, tagged images from a different event), the recognition accuracy of our system is \(12\times \) higher than chance level. It is very probable that undisclosed systems similar to the ones described here already operate online. We believe it is the responsibility of the computer vision community to quantify, and disseminate the privacy implications of the images users share online. This work is a first step in this direction. We conclude by discussing some future challenges and directions on privacy implications of social visual media.

Lower Bound on Privacy Threat. The current results focused singularly on the photo content itself and therefore a lower bound of the privacy implication of posting such photos. It remains as future work to explore an integrated system that will also exploit the images’ meta-data (timestamp, geolocation, camera identifier, related user comments, etc.). In the context of the era of “selfie” photos, meta-data can be as effective as head tags. Younger users also tend to cross-post across multiple social media, and make a larger use of video (e.g. Vine). Using these data-form will require developing new techniques.

Training and Test Data Bounds. The performance of recent techniques of feature learning and inference are strongly coupled with the amount of available training data. Person recognition systems like [20, 26, 27] all rely on undisclosed training data in the order of millions of training samples. Similarly, the evaluation of privacy issues in social networks requires access to sensitive data, which is often not available to the public research community (for good reasons [1]). The used PIPA dataset [29] serves as good proxy, but has its limitations. It is an emerging challenge to keep representative data in the public domain in order to model privacy implications of social media and keep up with the rapidly evolving technology that is enabled by such sources.

From Analysing to Enabling. In this work, we focus on the analysis aspect of person recognition in social media. In the future, one would like to translate such analyses to actionable systems that enable users to control their privacy while still enabling communication via visual media exchanges.