1 Introduction

The Internet User Stats (IWS) 2019 reported that over 56% of the whole global population were relying on the Internet to live their lives and do business online [18]. Being in the cyber-physical systems (CPS) where the boundary between physical and cyber worlds is disappearing, people need to disclose personal data to many external entities (organizations and other people) for using their provided services and maintaining social connections. In addition, service providers often encourage their customers to disclose more personal data for added values, e.g., special discounts or more personalized services. As a result, many people have their personal data spread over many services, and frequently become victims of data breach incidents.

The importance of protecting user privacy has also led to a widely accepted concept called “Privacy by Design” (PbD). The PbD concept has been officially recognized by some new data protection and privacy laws such as the EU GDPR (General Data Protection Regulation) coming into force in May 2018, which clearly defines “data protection by design and by default” (Article 25) as one of the explicitly listed principles. In the most developed version consisting of seven principles [6], two principles “Respect for User Privacy” and “Visibility and Transparency” highlight the requirement of keeping privacy user-centric, operationally transparent and visible. Existing privacy protection mainly relies on organization-facing solutions such as data leakage/loss prevention (DLP) [12]. With the focus on user-centric design, privacy-enhancing technologies (PETs) have been developed to address privacy issues within different applications, such as on online social networks (OSNs) [35], cloud computing platforms [23], mobile operating systems [19, 24] as well as Internet of Thing (IoT) [7, 23, 28] environments. Despite existing methods proposed for user-centric privacy, we have observed a general lack of universal frameworks that can cover personal preference management, trade-offs between privacy risks and value enhancementFootnote 1 as well as behavioral nudging. This paper helps fill the gap by proposing such an “all-in-one” framework with the following key features:

  • Easy bootstrapping of the system from flexible user inputs and a (normally automated) collection of historical data disclosures.

  • User-centricity achieved based on data disclosure behaviors of “me” (the owner and user of the framework) collected completely at the client side, i.e., on his/her local computing device(s).

  • Being completely service-independent as it does not introduce dependency on any external online services or a new remote service. This is important to make the solution completely user-centric and under the user’s full control.

  • Trade-off analysis between privacy and added value in the whole process, from personal preference management, joint privacy risk-value analysis, to behavioral nudging for a better trade-off between the two aspects.

  • Use of a computational ontology to enable automatic reasoning about data and value flows, for the purposes of joint privacy risk-value analysis and nudge construction.

  • Human-in-the-loop design enabling natural human-machine teaming via human behavioral monitoring and nudging based on technical tools.

The rest of the paper is organized as follows. The design of the proposed framework is explained in Sect. 3. Then, a case study about privacy protection of leisure travelers’ data in Sect. 4 illustrates how the framework can be used to help a traveler to decide on disclosing personal information for added values. This echoes the aim of our ongoing project, PriVELT (https://privelt.ac.uk/), to develop a user-centric platform based on travelers’ privacy-related behaviors that can effectively nudge travellers to share their personal data more sensibly. Finally, Sect. 5 concludes this paper with future work.

2 Related Work

To design a user-centric framework for data privacy protection and value enhancement, we review some related work on privacy preference learning and profiling, privacy risk assessment as well as privacy nudging.

2.1 Privacy Preference Learning and Profiling

Privacy means differently to different people. Since Westin proposed to segment consumers into Fundamentalists, Unconcerned and Pragmatists [32], researchers have shown interests in privacy segmentation and thus developed user segmentation from different aspects [8, 13, 21, 27]. As the classic segmentations (Westin’s Index and its variants) were questioned in predicting people’s actual behaviors, contextual factors and demographic variables have been analyzed and attributed to cluster human behaviors [8, 20, 22, 32]. Segmenting customers based on data disclosure behaviors can help system developers to understand their customers better, customize and deliver privacy settings according to the user preference predicted. For instance, participants were requested to rank statements about privacy behaviors in technology services [15]. Besides, different sequences of data requests were tested to increase the prediction accuracy [34]. Through developing the location-sharing system “Locate!”, participants were observed when sending real-time locations at some accuracy levels, and then the impacts of request categories (social connections, etc.) on users’ location-sharing willingness were evaluated [21]. Regarding online advertisers, Kumaraguru et al. concluded that participants’ willingness of disclosure was affected by data sensitivity, perceived benefits as well as the relevance to advertisements [14]. Among mobile users, four segments were identified based on the reported comfort levels to the permissions requested by mobile apps and the claimed purposes [15]. Similar methods were applied to study users’ preferences in Internet of Things (IoT) environments [17] and on online social network (OSN) platforms [33]. In addition to profiling users with privacy preferences, we found the lack of analyzing added values earned by disclosing data to service providers. Besides, insufficient work has been done on adaptive preference management based on previous disclosures.

2.2 Privacy Risk Assessment

Privacy risks are normally analyzed via a privacy impact assessment (PIA), which refers to a systematic assessment incorporated in decision-making for privacy management purposes [30]. In a PIA template, each privacy risk can be evaluated by combining the quantities of impact and likelihood that it can cause [29]. Specially, to assess data privacy impacts caused by data disclosures, certain processes are modeled with personal characteristics and contextual features. For instance, Alnemr et al. designed a data protection impact assessment (DPIA) tool based on the legal analysis of General Data Protection Regulation (GDPR) as well as the evaluation of privacy issues in cloud deployment [3]. Noticing sensitive attributes can be collected, accumulated, and used on smart grids, Seto implemented PIA procedures for smart grids in the United States and demonstrated that it could effectively visualize privacy risks related to specific activities [26]. Towards the risks existing in publicly released datasets, Ali-Eldin et al. designed a scoring system containing a privacy risk indicator and privacy risk mitigation measures in data releases [4]. Since privacy risks can be caused by data disclosures to external entities (e.g., organizations and other users), we will describe a high-level model in Sect. 3.2 to capture possible data flows among different entities while consumers are using online and physical services.

2.3 Privacy Nudging

Behavioral nudging refers to the use of interface elements aiming to guide user behaviors, when people are required to make judgements and decisions [31]. Since human decision making is influenced by biases and heuristics, behavioral nudging aims to help users to make better decisions without restricting their options [25]. The effects of nudging on privacy related outcomes such as willingness to disclose information or likelihood to transact with an e-commerce website have been studied in various contexts [9]. Previous studies have suggested the wide use of technical nudging interventions in order to assist users in security and privacy decisions [1]. Also, nudging dashboards has been seen as the core in developing transparency-enhancing technologies (TETs) that enable users to make decisions based on sufficient information [2, 5, 10, 11]. Therefore, any user-centric privacy protection systems should explicitly consider how such unavoidable behavioral nudges are implemented at the user interface level and try to provide ethical choices for the user’s benefits and with their full awareness.

3 The Proposed Framework

The proposed privacy-aware personal data management framework is user-centric and service-independent, designed by following the “human-in-the-loop” concept. By “user-centric” we mean that the framework has a central entity “me” (the user being served), whose data disclosure behaviors are monitored by technical tools. By “service-independent” we mean that the framework runs completely on the client side, without dependency on service providers, e.g., all processes are done in such a way that no private or sensitive data are shared with any existing or new remote service so no additional privacy issues will arise. The “human-in-the-loop” concept refers to the fact that in the framework the human user (“me”) provides preferences on privacy protection and add values, meanwhile a higher level of personalization is allowed via incremental and dynamic user profiling from “historical disclosure”, and thus achieve “user-centricity”.

Fig. 1.
figure 1

The proposed framework with an example traveler-facing implementation as a mobile app (Color figure online)

Figure 1 illustrates the high-level design of the framework with an example implementation as a mobile app. The overall aim is to guide the user (“me”) to manage his/her behaviors around personal data disclosures based on a better understanding of what data have been disclosed to whom, for what purposes, when, for how long, where, associated privacy risks and added values such data disclosures could bring. From the user’s perspective, a typical implementation of the framework is a computer application running from the user’s own computing devices, e.g., a mobile app downloaded to a smart phone for helping an individual traveler to manage data disclosure activities while travelling. Specifically, a system built on the proposed framework will have two processes running in parallel to achieve the expected functionality:

1. Operational process. This process begins with setting the personal preference on “privacy + value” and the arrival of data disclosure behaviors from using external services. Based on these inputs, a joint privacy risk-value assessment component is triggered to conduct a joint privacy risk-value analysis (1–3–4). The analysis is based on a data flow knowledge base (5–6), which is equipped with a computational ontology covering semantic data flows between different entities (“me”, other entities that may consume “my” data and senders of “values” as returns of the data shared). Then, based on the user’s current privacy-value preferences, real-time results about the joint privacy-value assessment will be presented to the user (7–8).

2. Learning process. By running an incremental learning process in parallel with the operational process, the configured settings such as initial preferences and nudging templates can be presented in an adaptive manner. As shown in Fig. 1, stored data disclosures can be divided into “Historical disclosures” and “Real-time disclosures” based on a pre-set time boundary. As the data gradually loses its timeliness, relatively out-dated data will be labelled as “Historical disclosures”, meaning the previous disclosure behaviors. Through monitoring which items were mostly released for added values (e.g., more discounts in booking.com), it is possible that privacy requirements need to be lifted up or lowered down (11). Besides, how the user interacts to nudging elements should be analyzed and in turn affect the construction of nudges (9–10).

3.1 Preference Learning and Management

For a user-centric framework, personalization is the key. Therefore, the framework needs to learn about the user’s privacy concerns and his/her privacy-value preferences. We consider that privacy risks caused by and values gained from data disclosures as conflicting aspects, so the framework aims at managing the user’s preferences by providing the right trade-off. Using user preferences on privacy protection and value enhancement to configure the initial environment, each user (especially laymen) can quickly have a “baseline” profile in a particular context. Specifically, standard groupings for privacy protection and value enhancement can be pre-studied from sample users’ inputs via different channels, such as online surveys, offline interviews and public data from online social networks (OSNs). Then, machine learning algorithms can be applied to “categorize” sample individuals of different profiles [15, 36]. With such mappings stored in the “preference management” module, a new system (for a new user) can be bootstrapped by classifying the new user to an initial setting, given their “historical data” disclosed to other entities in the cyber-physical world. Later on, the framework will dynamically adapt to the user’s behavioral change, which can lead to the user being allocated to a different group or a new group created for the user if a new unique behavioral pattern is observed. For the purpose of joint privacy risk-value assessment, each such group is mapped to a profile, which can include parameters such as thresholds on acceptable trade-offs between privacy risks and value enhancement. Through comparing with the “current preferences”, real-time nudges can be constructed for the user so he/she can learn and adapt his/her data disclosure behaviors accordingly. Such nudges can include what to do with a specific service and what service(s) to use among a number of options.

Fig. 2.
figure 2

The ontological graph of data flows in the cyber-physical world

3.2 Joint Assessment on Privacy and Added Value

The joint privacy risk-value assessment process is centered around a computational ontologyFootnote 2 that covers data flows on a directed graph describing how personal data of the user (“me”) can possibly flow through (i.e., may be disclosed to) different types of entities and how the returned value (as a special type of data) can flow back to (i.e., benefit) the user in a complicated cyber-physical world. As shown in Fig. 2, the ontology includes eight essential entity types (nodes) and a number of relation types (edges) modeled in a generic, cyber-physical system (CPS) data flow graph. Specifically, entities are categorized into three groups and colored differently: (1) physical entities (gray) that exist only in the physical world; (2) cyber entities (white) that exist only in the cyber world (from the user’s perspective); (3) hybrid entities (gradient) that may exist in both cyber and/or physical world(s). For each relation type between two entity types, there is either a semantic meaning (solid lines) or a flow (data flow: black dash line; value flow: gray dash line). Note that the graph can only show entity types and possible relations between different entities. To analyze privacy issues, an entity level graph of entities and relations are needed to support reasoning, which will be illustrated in Sect. 4 using a case study. According to the graph theory, the data flow graph can be formalized as \(G=(V,E)\), where \(V=\{V_1,\ldots ,V_m\}\) is a set of nodes and each node \(V_i\) represents an entity type treated in the same way in our model (depicted by ellipses), and \(E=\{E_1,\ldots ,E_n\}\) are a set of edges between nodes representing two types of relations between entities: semantic relations (represented by Type 1 edges) and data flows (represented by Type 2 edges), depicted by dashed and solid arrows, respectively. In the proposed model, there are \(m=7\) different entity types (nodes) and a number of edges between them.

There are mainly two types of edges on the proposed graph. Type 1 edges refer to existing relations with semantic meanings that may or may not relate to personal data flows. For instance, the edge connecting the entity types P and D means that the special P entity “me” owns some personal data items. Unlike Type 1 edges that help model the “evidence” about how and why data may flow among these entities, Type 2 edges (possible data flows) can cause immediate privacy impacts. Specifically, Type 2 edges refer to actual data flows from a source to a destination entity. Most such edges are accompanied by a Type 1 edge because the latter constructs the reason why a data flow can possibly occur. In the following, we use \(E_i\) to denote all Type 2 edges belonging to the same edge labeled by the number i in Fig. 2:

  • E1: (DP, S) flows are normally the beginning of tracking data flows in the cyber-physical system, generated by using online services.

  • E2: (S, O) flows from S to O entities due to the existence of Type 1 edges providedBy in between.

  • E3: (O, O) flows between O entities given the fact that one Oentity has some relation with another, e.g., isPartOf, invest or collabrateWith.

  • E4: (S, S) flows between S entities due to data sharing relations between them, e.g., suppliedBy, poweredBy or outsourcedTo.

  • E5: (S, OA) flows from S to OA entities due to the existence of type 1 edges create in between.

  • E6: (OA, OA) flows between OA entities given the fact that one online account is the friend of the other.

  • E7: (OA, P) flows from OA to P entities due to the existence of type 1 edges account in between.

  • E8: (S, OG) flows from S to OG entities due to the Type 1 edges exist in between.

  • E9: (OG, OA) flows from OG to OA entities due to a specific privacy setting on OSNs, such as setting the contents are disclosed to “group members” only.

  • E10: (P, P) flows between P entities due to the existence of type 1 edges know in between.

  • E11: (S, P) flows from S to P entities directly to a person without via an OA entity, e.g., a person can see public posts on an OSN platform.

  • E12a and E12b: (O, P) and (P, O) data flows between P and O entities in both directions, each of which is due to one or more semantic relations between P and O, e.g., a person works for an organization.

With the input of individual privacy preferences, (collected) data disclosure behaviors, and the entity level ontological graph about the user’s data and value flows, a joint privacy risk-value assessment can be done to detect potential privacy issues, measure impacts, and recommend mitigation solutions. In order to manage data sharing with multiple entities and to reduce privacy risks, a user-centric personal data management platform (PDMP) can be used, such as Solid (https://solid.mit.edu/), Hub-of-all-things (https://www.hubofallthings.com/) and digi.me (https://digi.me/), to allow users to manage their own data under their full control. Such platforms normally have an interface to allow added new features, e.g., data analytics and visualization tools can be added so that the user can gain more insights about their data. The user of our proposed framework can decide to use one or more such PDMPs so that some (or even all) data needed for privacy risk and value enhancement assessment are stored there rather than on local devices (see Fig. 3).

Fig. 3.
figure 3

The proposed framework working with online PDMP(s) and local storage

3.3 Acting on Privacy Nudges

It can be argued that whatsoever we do with the user will have a nudging effect [1]. For the proposed framework, rather than focusing on privacy nudging, we propose to construct privacy-value nudges, i.e., nudges that can help the user to find a better trade-off between privacy risks and added values related to data disclosure decisions. In order to construct such nudges properly, it is necessary to monitor the user’s actual data disclosure behaviors and preferences. This task is mainly implemented through the component “Behavior analysis”. Then, based on learnt preferences, such nudges can be constructed to deliver the expected effects, i.e., proactively avoiding risky disclosures with the knowledge about the added values to lose. For instance, Fig. 4 shows an example design where two-leveled nudging is considered: the first level is for privacy-value awareness enhancement, while mainly shows information like “what privacy issues exist”, “what value I have gained at what privacy costs”, “where are the privacy issus”, and “to what extents I should care”; the second level can be triggered to show more active interventions such as “what options do I have” and “what can I do”. To the nudging contents presented on both layers, the following behaviors can be monitored and analyzed to identify suitable nudging models:

1. External behaviors refer to the behavioral change(s) after each nudge, such as switching off “location sharing” on the smart phone or “delete the applications” after being presented a nudge. This is achieved from the real-time behavioral data collection part of the framework.

2. Internal behaviors refer to the behaviors performed on the user interfaces of the proposed framework, such as the number of times of clicking specific options such as “keep it” and “let me know more”. This type of data is collected directly by the software implementing the framework (see one example in Fig. 6).

Fig. 4.
figure 4

Example human interaction with two-leveled nudging

4 Case Study: Leisure Travelers

Using one example scenario about leisure travelers, we show how the proposed framework can help travelers to manage their data privacy for a better trade-off between privacy risks and enhanced experience of travel (i.e., value): booking flights and accommodations through online services. Through detecting possible data flows in using travel services, privacy risks and enhanced travel experience by disclosing personal data can be quantified and help guide the travelers.

In order to benefit from personalized services, personal data is often requested by travel service providers before, during, and after travel. One of the example use cases are depicted in Fig. 5 consists of entities (rectangles), relations (solid lines) and data flows (dash lines). Assuming that data submitted to use services (i.e., \(F_{1\text {-}1}\) and \(F_{2\text {-}1}\)) are always disclosed to service providers and their parent companies, data flows \(F_{2\text {-}2}\) and \(F_{2\text {-}3}\) always take place so that Data Package 2 will be disclosed to the Booking Holdings Inc. via its subsidiary Agoda who provides the hotel booking service to the user directly. Besides, it is also important to consider special flows caused by more complex business models. For instance, the flight booking service at Booking.com is outsourced to GotoGate, which is owned by a different company the Etraveli Group. However, assuming that the outsourcing contract always return the user data back to the requesting company (Booking.com in this case), data flows \(F_{1\text {-}1}\), \(F_{1\text {-}2}\), \(F_{1\text {-}3}\), \(F_{1\text {-}4}\) and \(F_{1\text {-}5}\) will take place, so that Booking Holdings Inc. will also see Data Package 1. Now, we can see that a single company Booking Holdings Inc. has a more complete picture of the user’s itinerary and travel preferences by combining Data Packages 1 and 2, which may not be known to the user if he/she does not know the business relationships between Agoda, GotoGate, Booking.com and their parent companies. This can create added values with privacy concerns, e.g., now Booking Holdings Inc. knows more about the user and can do more personalized advertising.

Fig. 5.
figure 5

An example ontological graph in the leisure travel context

Now let us give some example user interfaces for a different privacy issue. As shown in Fig. 6, the first-level interface presents an overview of data disclosure activities of some “monitored apps”, including a joint analysis of privacy risks and value enhancement those data disclosure activities lead to. For example, it shows that Booking.com is deemed a “risky app” but the user has also achieved a lot of benefits by using it. In addition, the example interface allows the user to “check details without taking actions” by clicking a question mark, which will lead the user to the second level of user interface. Given the knowledge base and the user’s disclosure behaviors, the system detects that a sensitive unit “MEDICAL_CERTIFICATE” could have flowed to two different company groups and over ten sub-companies, many of which are unknown to the user, due to special booking requirements. The system then labels this as a privacy issue after checking the user’s current privacy preference. Being notified about this specific privacy issue and after considering any enhanced travel experience this may bring, the user can choose to accept the risky disclosure or request deletion of data disclosed to some companies immediately (which may mean loss of special assistance during travel in future), and can adapt his/her future data disclosure behaviors accordingly. The user’s choices are recorded to help personalize the user interface and future nudges, following the “human-in-the-loop” principle.

Fig. 6.
figure 6

An example nudging dashboard and user interfaces

5 Conclusions and Future Work

In this paper, we report a user-centric and privacy-aware personal data management framework, allowing a user to better manage his/her privacy in the context of interacting with multiple services and people in the cyber-physical world, via a joint privacy risk-value analysis architecture covering user preference management, joint privacy risk-value assessment, and joint privacy-value nudging. We illustrate the usefulness of the framework by using a case study about leisure travelers. There are a number of key areas for further development of the proposed framework, which we leave as our future work but briefly explain some of them below.

Studying added values in different contexts. In which forms such “added values” can be represented in real-world scenarios and how they relate with data disclosures (i.e., data flows) will be studied to enrich the computational knowledge base used in the proposed framework.

Profiling travelers on their preferred balance between data disclosure and value enhancement. To build a user-centric platform for privacy protection purposes, it is essential to learn what the privacy risks and added values mean to different users. In future we will look at two facets of the profile learning process: traveler profiling based on self-reported answers to privacy-related questions, and learning travelers’ preferences from actual behaviors, which include past data disclosures and other interacting behaviors to online services and tools implementing the proposed framework.

Conceptualizing and quantifying privacy risks and added values. Based on data flow analysis, personal preferences and the semantic information in the knowledge base, we aim to study how to conceptualize and quantify privacy risks and added values. This will involve evolving ontological graph models and developing privacy risk indicators needed for different components such as the privacy nudging engine. Any indicators will need to cover both privacy risks and added values, and will need to be personalized if possible.

Constructing privacy nudges based on the user’s preferences. The construction of privacy nudges should be determined by learnt personal preferences. While presenting the results from real-time disclosures, privacy nudges should give concrete and actionable recommendations such as “which services bring more privacy risks for exchanging what added values” and “what can be done to mitigate such more risky services”. To effectively help privacy-related decisions, we will conduct a number of user studies to design our privacy nudging strategies and user interface elements for an implementation of the proposed framework.