Access control technologies for Big Data management systems: literature review and future trends
- 228 Downloads
Data security and privacy issues are magnified by the volume, the variety, and the velocity of Big Data and by the lack, up to now, of a reference data model and related data manipulation languages. In this paper, we focus on one of the key data security services, that is, access control, by highlighting the differences with traditional data management systems and describing a set of requirements that any access control solution for Big Data platforms may fulfill. We then describe the state of the art and discuss open research issues.
KeywordsBig Data Access control Privacy NoSQL data management systems
The term Big Data refers to a phenomenon characterized by “5 V”. By analysing huge Volumes of data with a high Variety of formats, Big Data analytic platforms allow making predictions with high Velocity, thus, in a timely manner, low Veracity, therefore with low uncertainties, and with a high Value, namely, with an expected significant gain (Jin et al. 2015). As a matter of fact, business strategies are more and more driven by the integrated analysis of huge volumes of heterogeneous data, coming from different sources (e.g., social media, IoT devices).
This phenomenon has been pushed by numerous technological advancements. The most significant include the birth of NoSQL datastores (Cattell 2011), and distributed computational paradigms, like MapReduce (Dean and Ghemawat 2004), which have jointly opened the way to the management and systematic analysis of huge volumes of semi-structured data (e.g., transactions, electronic documents and emails).
Overall, the support provided by Big Data platforms for the storage and analysis of huge and heterogeneous datasets cannot find a counterpart within traditional data management systems. In addition, the advantages of these new systems are not only related to the outstanding flexibility and efficacy of the analysis services, as Big Data platforms outperform traditional systems even with respect to performance and scalability.
However, BigData systems do not show the same level of excellence with data protection features (Colombo and Ferrari 2015b). For instance, while a variety of data protection frameworks have been proposed for traditional systems (see e.g., Agrawal et al. (2002); Byun and Li (2008); Colombo and Ferrari (2014a; 2014b; 2015a); Ferrari (2010)), the majority of Big Data platforms integrate quite basic access control enforcement mechanisms (Colombo and Ferrari 2015b). As a result, the unconstrained access to high volume of data from multiple data sources, the sensitive and private contents of some data resources, and the advanced analysis and prediction capabilities of Big Data analytic platforms, might represent a serious threat. For instance, the analysis capabilities can be exploited to derive correlations between sensitive and personal data. As an example, let us consider the domain of fitness apps which nowadays are more and more deployed on mobile and wearable devices and gym equipment. The joint analysis of movement data, hearth beats, and weight might allow profiling users life style and inferring users inclination to pathologies. As a consequence, although the potential benefits of Big Data analytics are indisputable, the lack of standard data protection tools open these services to potential attackers.
The definition of proper data protection tools tailored for Big Data platforms is as a very ambitious research challenge. State of the art enforcement techniques proposed for traditional systems cannot be used as they are, or straightforwardly adapted to the Big Data context. This is mainly due to the required support for semi structured and unstructured data (Variety), the quantity of data to be protected (Volume), and the very strict performance requirements (Velocity) affecting these systems. Therefore, the challenge is protecting privacy and confidentiality while not hindering data analytics and information sharing. Additional aspects contribute to raise the complexity of this goal, such as the variety of data models and data analysis and manipulation languages which are used by Big Data platforms. Indeed, different from RDBMSs, Big Data platforms are characterized by various data models (Cattell 2011), the most notable being the key-value, wide column, and document oriented ones.
In this paper, we focus on access control, by first identifying a set of requirements that any access control solution for Big Data platforms should address (cfr. “Requirements” section). Then, we classify and analyze the related literature (“State of the art”, “Platform specific approaches”, “Platform independent approaches” and “Domain specific Big Data approaches” sections), and discuss key research challenges (“Research issues” section). Finally, we conclude the paper in “Conclusions” section.
This paper is an invited extended version of a paper published in the proceedings of the 23 rd ACM Symposium on Access Control Models and Technologies (SACMAT’18)1. Current version differs from the original conference paper for a wider and updated analysis of state of the art access control solutions for Big Data systems, which also takes into consideration domain specific platforms, and the related open research challenges.
Fine-grained access control. In terms of features the access control mechanism should support, fine grained access control (FGAC) has been widely recognized as one of the fundamental component for an effective protection of personal and sensitive data (e.g., see Agrawal et al. (2002); Rizvi et al. (2004)). Since data processed by Big Data analytics platforms often refer user personal characteristics, it is important that access control rules can be bound to data at the finest granularity levels. However, the related enforcement mechanisms need to be invented from scratch, as those proposed for traditional systems rely on data referring to known schema, while in the context of Big Data, data are heterogeneous and schemaless.
Context management. Another key aspect that should be considered is the support for context based access constraints, as these allow highly customized access control forms. For instance, they can be used to constrain access to specific time periods or geographical locations. In case contexts are used to derive access control decisions, access authorizations are granted when conditions referring to properties of the environment within which an access request has been issued are satisfied.
Efficiency of access control. The characteristics of the Big Data scenario, such as the distributed nature of the considered platforms, the complexity of the queries, and the focus on performance, require access control enforcement strategies that do not compromise the usability of the hosting analytic frameworks. Indeed, based on the considered queries, the number of checks to be executed during access control enforcement can match or be even greater than the number of data records, and, in the Big Data scenario, data sets can include up to hundreds of millions of such records. This requires efficient policy compliance mechanisms. FGAC has been enforced in traditional relational DBMSs according to two main approaches. The first is the view-based one, where users are only allowed to access a view of the target dataset that satisfies the specified access control restrictions, whereas the second one is based on query rewriting. Under such an approach instead of pre-computing the authorized views, the query is modified at run-time by injecting restrictions imposed by the specified access control rules. It is therefore important to determine to what extent these approaches are suitable for the Big Data scenario and how they can be possibly customized or extended.
As it should be clear from the previous discussion, one of the main difficulties in developing an access control solution for Big Data platforms is the lack of a standard model and related manipulation languages to which access control rules and the related enforcement monitor can be bound.
State of the art
Platform specific approaches. Access control solutions under this category are designed for one system only (e.g., MongoDB, Hadoop), and possibly leverage on native access control features of the protected platform. The main advantage of this approach is that the devised access control solution can be optimized for the target system, however, its usability and interoperability are greatly limited.
Platform independent approaches. The approaches falling under this category propose access control solutions which do not target a specific platform only. Platform independent approaches have the advantage of being more general than platform specific solutions, however they cannot compete with them in terms of efficiency. Existing proposals in this category mainly leverage on recent research efforts that aim at defining a unifying query language for NoSQL datastores (e.g., JSONiq (Florescu and Fourny 2013) and SQL++ (Ong et al. 2014)).
Domain specific Big Data approaches. This complementary category includes platform specific and platform independent approaches that target domain specific Big Data systems, designed to fulfill specific requirements related to data management needs of a target scenario. As a matter of fact, a variety of Big Data systems have been designed to handle specific application scenarios, and the literature has shown that in these cases the integration of access control mechanisms has mainly been driven by intrinsic features of these systems. In particular, among the various scenarios that can benefit from Big Data systems, we focus on two of the most relevant ones, namely, data stream analysis and Internet of Things applications, by analyzing related access control enforcement techniques.
In what follows, we analyze the related literature in view of this classification, then we discuss related research challenges.
Platform specific approaches
The great majority of access control frameworks targeting Big Data platforms propose enforcement approaches designed on the basis of platform specific features and which can only be used with the platform for which they have been defined.
In the remainder of this section, we analyze platform specific approaches defined for MapReduce-based analytics platforms2, and NoSQL datastores, which together cover the majority of existing Big Data systems.
MapReduce is a distributed computational paradigm that allows analyzing very large data sets (Dean and Ghemawat 2004). Within MapReduce systems, data resources are partitioned into multiple chunks of data and distributed in a cluster of commodity hardware nodes. Data are analyzed in parallel by means of MapReduce tasks, characterized by users defined Map and Reduce functions. These tasks operate by first extracting and then manipulating flows of key-value pairs, each modeling a portion of the target data resource. The considered computation paradigm allows processing unstructured and semi-structured data resources.
In Ulusoy et al. (2015), a framework denoted GuardMR has been proposed, to enforce fine grained Role-based Access Control (RBAC) (Ferraiolo et al. 2001) within Hadoop3, a very popular Big Data analytics platform built on top of MapReduce. GuardMR enforces data protection by filtering, and possibly altering, the key-value pairs derived from a target data resource by a MapReduce task, which are then provided as input to the Map function.
Filters are used to generate views of the analyzed resources which are authorized for the subject who requires the execution of the MapReduce task. The views are generated in such a way that any unauthorized content included in the analyzed resource is removed or obfuscated. More precisely, filters specify: i) preconditions to the processing of any key-value pair p extracted from a target resource under analysis, as well as ii) the rationale for deriving from p a new pair p’, which models the authorized content of p. The use of filters had previously been considered in Vigiles (Ulusoy et al. 2014), a fine grained access control framework for Hadoop. In Ulusoy et al. (2014), authorization filters are handled by means of per-user assignment lists, and filters are coded in Java by security administrators. In contrast, in GuardMR filters are assigned to subjects on the basis of the covered roles, and a formal specification approach to the definition of filters is proposed, which allows specifying selection and modification criteria at a very high level of abstraction using the Object Constraint Language (OCL)4 (Warmer and Kleppe 1998; Clark and Warmer 2002). GuardMR relies on automatic tools5 to generate Java bytecode from OCL-based filter specifications, as well as to integrate the generated bytecode into the bytecode of the MapReduce task to be executed. GuardMR has been used with MapReduce tasks targeting both textual and binary resources (Ulusoy et al. 2015), showing the flexibility of the approach. GuardMR and Vigiles do not require Hadoop source code customization, however, they rely on platform specific features, such as the Hadoop APIs and the Hadoop control flow for regulating the execution of a MapReduce task. A reasonably low enforcement overhead has been observed with both Vigiles and GuardMR. Neither Vigiles nor GuardMR provide support for context aware access control policies.
A recent work targeting access control enforcement within MapReduce systems is described in Gupta et al. (2017). More precisely, Gupta et al. (2017) introduces the foundations of an access control model, called HeAC, which formalizes the authorization model of Apache Ranger6 and Apache Sentry7, as well as the native access control features of Hadoop. Apache Ranger and Apache Sentry represent state of the art technologies for the enforcement of fine grained access control in Hadoop ecosystems. Authorization assignments are specified for operations and objects, possibly on the basis of object tags, namely attributes specifying properties, like sensitivity, content, or expiration date. Moreover, Gupta et al. (2017) introduces the foundation of Object Tagged RBAC, an RBAC model which, while preserving RBAC role based permission assignments, introduces support for object attributes. A prototypical implementation of the model has been defined by introducing role support into Apache Ranger. The proposed enforcement approach is again platform specific as it has been designed on top of Hadoop specific features. No support is given to context related properties, and no performance evaluation is presented.
NoSQL datastores represent highly flexible, scalable, and efficient data management systems for Big Data, based on different data models. Cattell 2011 classifies NoSQL systems into three classes, on the basis of the adopted data model, namely key value, wide column, and document-oriented datastores, each suited to specific application scenarios. Key-value datastores (e.g., Redis8) can be seen as big hash tables with persistent storage services. Data are modeled by means of key-value pairs, where values of primitive or complex type are directly addressed by means of a key. Key value datastores are suited to application scenarios where efficient look-up operations are required. For instance, they are used to manage web session information and users profile data. Wide column stores (e.g., Cassandra9) model data as records with variable structures, which are then grouped into tables with flexible schema. Wide column stores are a good fit for the data management requirements of blogging platforms and content management systems. Document-oriented datastores (e.g., MongoDB 10) model data as hierarchical records, denoted documents, whose fields either specify a primitive value, or are in turn records composed of multiple fields. Documents are partitioned into collections, which in turn are grouped in a database. Typical applications of document oriented datastores include event logging systems and content management systems.
Fine grained access control within NoSQL datastore management systems is still in the very early stage, and only few access control frameworks have been proposed so far for wide column and document oriented datastores.
K-VAC (Kulkarni 2013) is among the earliest fine grained access control frameworks targeting wide-column NoSQL datastores which have been proposed in the literature. K-VAC supports the enforcement of content-based, and context-based access control policies possibly specified at different levels of the data model hierarchy (e.g., for a column or for a row). Two prototypical versions of K-VAC have been released. One has been specifically designed as an internal module of Cassandra, a popular wide-column datastore whose source code has been modified to host K-VAC’s enforcement monitor. In contrast, the latter version has been released as an external library, with the aim to enforce access control on multiple datastores. However, the use of the proposed library still requires ad-hoc implementation of binding criteria, which so far have been only defined for Cassandra and HBase11. Overall the integration of K-VAC requires deep customizations of the hosting platform. Empirical performance evaluations show the efficiency of both the proposed prototypes, with a lower overhead measured with the customized version of Cassandra.
Another work targeting Cassandra has been proposed in Shalabi and Gudes (2017), where an approach to the cryptographic enforcement of RBAC policies has been defined. Predicate (Katz et al. 2013) and second level encryption (Nabeel and Bertino 2014) are used for the definition of an efficient scheme for RBAC enforcement which operates within Cassandra distributed architecture. The proposed approach is an example of platform specific solution designed on top of specific features, such as the distributed architecture of Cassandra. Also in this case no support is given for context-aware policies, and, unfortunately, the enforcement overhead is not discussed.
As far as document-oriented datastores, efficient solutions to the integration of fine-grained purpose-based access control into MongoDB have been proposed in Colombo and Ferrari (2016) and (2017a). In Colombo and Ferrari (2017a) the RBAC model natively integrated in MongoDB has been enhanced with the support for the specification and enforcement of purpose-based policies (Byun and Li 2008) regulating the access up to document level. The proposed approach refines the granularity level at which the native MongoDB RBAC model operates. An enforcement monitor, called Mem (MongoDB enforcement monitor), has been designed, which monitors and possibly manipulates the flow of messages exchanged by MongoDB clients and the MongoDB server, thus acting like a proxy. Once Mem intercepts a message m issued by a MongoDB client on behalf of a subject s, it forwards m to the server, or it temporary blocks m, and issues additional messages finalized at profiling s. If m models a query q, Mem rewrites m as m’ in such a way that m’ encodes a query q’ that only accesses those documents accessed by q which result authorized by the applicable access control policies. Mem’s proxy based architecture allows the straightforward integration of the enforcement monitor into existing MongoDB deployments with basic configuration tasks. Experimental evaluations show the efficiency of the proposed approach, however also in this case no support is given for context-aware policies.
In Colombo and Ferrari (2016), the framework presented in Colombo and Ferrari (2017a) has been significantly extended, introducing the support for access control policies regulating the access up to field level, and providing support to specification and enforcement of content and context based policies. The proposed enforcement monitor, denoted ConfinedMem, applies the same logic as Mem, but it operates according to a two-step process, which consists of: 1) the derivation of the authorized views of all documents to be accessed by a submitted query q included in a message m requiring the access to data resources, 2) the rewriting of m as m’ in such a way that m’ specifies a query q’ which can only access the authorized views of the documents to be accessed by q. Different implementation techniques have been considered for queries specifying different operations (e.g., selection and aggregations) with the aim to minimize the overhead. Experimental evaluations show that, overall, the enforcement overhead which has been observed with access control policies specified at field level is significantly higher than the one measured for document level policies.
Platform independent approaches
The great majority of the research contributions in the field of access control for Big Data analytics platforms propose a platform specific solution.
The lack of a reference standard query language and data model has caused the birth of a variety of proprietary solutions. As a matter of fact, numerous NoSQL datastores exist, most of which operate with a platform specific query language (e.g., the query language of MongoDB can only be used with that platform), and adopt a different data model. Even different datastores that nominally refer to the same data model can use different data organization and terminology. For instance, both MongoDB and CouchDB12 use the document oriented data model, however the concept of collection is not supported by CouchDB, whereas collections are basic data organization features of MongoDB. The great heterogeneity of the scenario has significantly raised the complexity of devising enforcement solutions that can work with multiple platforms. Overall, the definition of a general access control enforcement approach represents a very ambitious task.
In the recent years, academia and industry started collaborating to the definition of unifying query languages for NoSQL datastores. To the best of our knowledge, JSONiq (Florescu and Fourny 2013) and SQL++ (Ong et al. 2014) represent the most relevant results that have been so far achieved towards the fulfillment of this goal. JSONiq is an Xquery (Chamberlin 2003) based language that has been defined with the aim to analyze data handled by NoSQL datastores adopting a JSON-based data model. Unfortunately, at present JSONiq is only supported by Zorba13, and Sparksoniq14, which allow processing data serialized in JSON format, and by a platform denoted 28msec15, which supports the execution of JSONiq queries on MongoDB databases.
SQL++ (Ong et al. 2014) is a recent proposal of unifying query language that allows analysing semi-structured data handled by NoSQL datastores as well as structured data of traditional DBMSs. SQL++ has been recently adopted by Couchbase16 and AsterixDB17(Alsubaiee et al. 2014), whereas Apache Drill18, is in the process of aligning with SQL++. The diffusion of this language is thus growing, and the adopted SQL based syntax and the backward compatibility with relational DBMSs promise to further increase its popularity and diffusion.
In Colombo and Ferrari (2017b) an SQL++ based Attribute-based Access Control (ABAC) (Hu et al. 2013; 2015) framework for NoSQL datastores has been proposed. The choice to base the framework on SQL++ allows protecting any NoSQL datastore which provides support to this language. Therefore, the proposal distinguishes from all other work introduced in “Platform specific approaches” section for higher generality and applicability, which may even grow with a future potential wider diffusion of SQL++. The framework operates at a very fine grained level, in that it allows regulating the access up to single data fields. The supported granularity is thus equivalent to cell level within relational DBMSs. Enforcement is based on query rewriting and operates with heterogeneous data with no assumption on data schema, thus overcoming state of the art query rewriting techniques proposed for RDBMSs (Rizvi et al. 2004; LeFevre et al. 2004).
Query rewriting techniques finalized at enforcing cell-level access control within traditional DBMSs operate by projecting or nullifying the value of each cell to be accessed by a query q on the basis of the compliance of the access performed by q with the applicable access control policies (LeFevre et al. 2004). More precisely, a query q submitted for execution is rewritten in such a way to: i) include a subquery s for each table t accessed by q, which, cell by cell, generates an authorized view of t, and ii) perform the same analysis tasks as q on the result set of s. The subquery s specifies projection criteria conditioned by the compliance of the accesses operated by q with the cell level access control policies that have been specified for t’s cells. A similar approach can only be used if the scheme of any accessed table is a priori known, as the projection criteria of the subqueries need to refer to table columns. The schemaless and highly heterogeneous nature of the data within Big Data platforms does not allow to use similar techniques.
In Colombo and Ferrari (2017a) this issue has been handled by means of SQL++ operators that allow achieving the projection without knowing in advance the accessed fields. The approach operates by visiting, field by field, the data unit19du of an analyzed resource, and adding a visited field f to the authorized view du’ of du only if the access to f complies with the ABAC policies specified for f. The proposed approach allows deriving in-memory authorized views of the data resources to be analyzed, and executing the analysis tasks of the original queries on such derived views. The ABAC framework proposed in Colombo and Ferrari (2017a) supports the specification and enforcement of context-aware access control policies. Empirical performance assessments show an enforcement overhead that varies with the characteristics of the specified policies and the number of fields of the analyzed documents. The overhead is high when field level policies cover high percentage of data units fields.
Another language-based ABAC approach has been proposed in Longstaff and Noble (2016), with the goal to be usable with traditional data management systems, Mapreduce systems, as well as NoSQL datastores. The work proposes a query rewriting approach that targets user transactions specified with an SQL-like language. Unfortunately, a detailed description of the adopted query language and data model is missing, which makes unclear how the approach could be used with different platforms, and how the heterogeneity of schemaless data can be handled by means of an SQL-like language.
Summary of the surveyed platform specific and platform independent access control frameworks
GuardMR (Ulusoy et al. 2015)
Vigiles (Ulusoy et al. 2014)
HeAC (Gupta et al. 2017)
K-VAC (Kulkarni 2013)
ConfinedMem (Colombo and Ferrari 2016)
All those supporting SQL++
Domain specific Big Data approaches
In this section, we focus on the state of the art approaches to the integration of access control into Big Data systems designed for specific application domains. In particular, we first analyze approaches that target Big Data platforms supporting data stream analytics, and then we focus on those for Internet of Things ecosystems.
Big Data streaming analytics
In recent years, the number of Big Data platforms that provide support to data stream management is growing. Apache Spark20 is probably the most popular open source framework which supports the analysis of continuous streams of data. Apache Storm21 is another open source distributed real-time computation system which can also be used for real-time analytics and continuous computation. In addition, several commercial solutions exist, such as, for instance, Amazon Kinesis22, which is a service for real-time processing of streaming data on the cloud, and IBM Streaming analytics23, a platform supporting risk analysis and decision making in real-time. Due to the growing emphasis to real-time analysis of data flows, access control enforcement mechanisms targeting continuous flows of data are strongly required. A few results have been presented in the past years in the field of Data Stream Management Systems (DSMSs) (e.g., Nehme et al. (2010), Carminati et al. (2010), and Puthal et al. (2015)).
In Nehme et al. (2010), a framework, called FENCE, has been proposed, which supports continuous access control enforcement. Data and query security restrictions are modeled as meta-data, denoted security punctuations, which are embedded into the data streams. Different enforcement mechanisms have been proposed, which operate by analyzing security punctuations, such as special physical operators which are integrated within query execution plans with the aim to filter the tuples which can be analyzed, and rewriting mechanisms targeting continuous queries.
The framework in Carminati et al. (2010) assumes that data analysis within DSMSs is achieved by continuous queries, and enforces access control by means of query rewriting, where rewritten queries are defined by composition of secure query operators. In contrast, Puthal et al. (2015) presents a crypto-based solution to verify authenticity and integrity of data streams.
Complex event processing (CEP) systems (Cugola and Margara 2015) represent the evolution of DSMSs (Cugola and Margara 2012), and are nowadays used for many different applications, such as Internet of Things applications and Smart Cyber-physical systems (Dayarathna and Perera 2018).
CEPs support the processing of heterogeneous streams from multiple sources, as well as advanced forms of reasoning over such data streams. On the basis of the experience with DSMSs in Carminati et al. (2010), a novel access control model for CEP platforms has been proposed in Carminati et al. (2016). The model assumes an application scenario where users generating continuous flows of data, specify how their data can be processed and what cannot be inferred from the data. The compliance of the access performed by a query with the specified user preferences is checked by verifying that each operator in the submitted query complies with the user preferences specified for the accessed attributes of the analyzed data streams. In Migliavacca et al. (2010), a system, called DEFCON, has been presented to enforce decentralised event flow control. The system, which has been designed targeting the financial trading scenario, applies information flow control principles and leverages on security labels assigned to event messages. Event flow control is achieved through a lightweight approach that makes use of application-level virtualisation to separate processing units.
Internet of Things
Internet of Things (IoT) ecosystems are representative cases of Big Data applications. IoT applications are rapidly getting popularity in a variety of domains for the indisputable improvements of people life style they bring. Nowadays a growing number of users cannot do without wereable devices that track their movements, sport activities and health conditions, and a variety of devices and apps exist for this purpose. IoT applications are used to control the safety of the environments where people live, as well as to improve their life style. As a matter of fact, the diffusion of home automation services and smart devices like smart locks, smart meters, and smart lights is growing.
Due to the personal and sensitive nature of the handled information, security and privacy of these systems have become a major concern. Therefore, in the recent years, several research efforts have been devoted to security and privacy of IoT applications, and a variety of access control models have been proposed (see, for instance, Ouaddah et al. (2017) for a compendium).
CapBAC distinguishes from other models in the literature as it allows externalizing and distributing the management of access authorizations. However, it does not take context awareness into account, and for this reason it has been criticized (Ouaddah et al. 2015).
For instance, in Zhang and Tian (2010), with the aim to fit IoT dinamicity, an extended version of RBAC supporting contextual constraints has been introduced. However, the resulting enhanced model has been criticized (e.g., see Rajpoot et al. (2015)) as it is affected by shortcomings, like role explosion, which also characterize RBAC. A few approaches have been based on the ABAC model. For instance, Kaiwen and Lihua (2014) propose an ABAC model that extends RBAC with the dynamic assignment of roles to users. However, the proposed model only partially exploits ABAC features, as it only supports subject attributes. Another ABAC model operating with a predefined set of attributes has been proposed in Hemdi and Deters (2016). The proposed enforcement monitor has been designed for IoT ecosystems that use CoAP24 as communication protocol. Unfortunately, the focus of Hemdi and Deters (2016) is on implementation aspects, and neither the enforcement mechanism nor the supported access control policies are formally specified.
In Marra et al. (2017), La Marra et al. (2018) and 2017 a framework is proposed, supporting the enforcement of Usage Control (UCON) (Zhang et al. 2005) within IoT ecosystems. The approach is illustrated discussing the policy enforcement mechanism within a Smart Home environment. However, the generality of the proposed mechanism is limited by constraining assumptions, such as the use of ad-hoc defined brokers.
A general enforcement mechanism has been proposed in Colombo and Ferrari (2018), which allows enforcing policies of different access control models within MQTT-based IoT ecosystems. The proposed framework provides a monitor that enforces access control by regulating the flow of the exchanged MQTT control packets. The framework is illustrated using ABAC, but other models are also supported.
A recent research line targets the study of access control enforcement for cloud-enabled IoT (see e.g., Alshehri and Sandhu (2016; 2017); Bhatt et al. (2017; 2018); Ahmad et al. (2018)). Alshehri and Sandhu propose an access control oriented (ACO) architecture (Alshehri and Sandhu 2016; 2017), which supports the definition of access control models for cloud-based IoT services. ACO has been used to define enforcement mechanisms tailored for specific IoT platforms (Bhatt et al. 2017), and applications (Bhatt et al. 2018).
Access control enforcement for cloud-enabled IoT systems has also been investigated by Ahmad et al. (2018), who, starting from a case study related to a smart home environment, have identified a set of key requirements for the enforcement of access control within IoT ecosystems. The authors have also proposed an approach to handle access control as a service, outsourcing policy management to a trusted third party, while relying on the native mechanisms of state of the art IoT platforms for policy enforcement. The feasibility of the approach has been assessed wrt the satisfiability of the identified requirements.
Finally, some proposals target of intelligent transportation systems. Recent research efforts in this field have been devoted to enable advanced communication forms among vehicles, road infrastructures, drivers, as well as intra-vehicle devices. The envisaged services rely on a variety of technologies, which range from dedicated hardware and software components, to the enabling communication infrastructures, possibly cloud or fog based. In this complex scenario vehicular security represents a major concern, and the US Department of Transportation has already outlined the strategic goals of an Intelligent Transportation System Program (Barbaresso and et al. 2014). Initial research results in this field have been described in Gupta and Sandhu (2018), where an extended version of the ACO architecture presented in Alshehri and Sandhu (2016) is discussed, called E-ACO. The paper also discusses enforcement mechanisms tailored for various E-ACO layers, however the topic remains an open research field, with room for investigations in manifold directions (see Gupta and Sandhu (2018)).
In what follows, we discuss some open research issues in the field of Big Data access control.
Unifying access control models and mechanisms
State of the art review done in “Platform specific approaches”, and “Platform independent approaches” sections has highlighted that, although research in the area of access control for Big Data platforms is progressing, no solution has been proposed so far for a unifying access control framework which can combine generality and efficiency of access control. The heterogeneous schemaless nature of the managed data significantly complicates the definition of this framework, and so far this has lead mainly to ad-hoc platform specific solutions (see “Platform specific approaches” section). In contrast, language centric approaches still suffer of limited applicability (see “Platform independent approaches” section). For instance, although the popularity of the SQL++ (Ong et al. 2014) initiative is growing, the support provided to this language is still limited to a small number of platforms.
One key element that may be instrumental to fill this void is the definition of a unifying data model capable of representing data resources of the different data models currently adopted by Big Data platforms. The ability to represent data resources is a fundamental requirement for binding access control policies to the protected data, as well as for the specification of policies regulating the access on the basis of the protected objects’ attributes. Indeed, in the literature on access control, multiple models allow enforcing content-based access constraints (e.g., Kulkarni (2013); Colombo and Ferrari (2016)), as well as access control rules that refer to various security meta-data related to the protected data resources (e.g., Colombo and Ferrari (2015a)).
The key-value, wide column, and document-oriented models adopt different data modeling criteria, however, in all these models data are hierarchically organized as tree structures, where nodes at different height of the tree represent resources at different granularity levels of the related data model (e.g., database, table, row, and cell). Data models differ among them for the height of the tree with which data resources can be represented. This may range from 2, within key-value datastores (since all key-value pairs – leaf nodes, belong to a key-space – root node.) to a height of variable length n (n>2) for document-oriented datastores, where a database (root node), groups a variable number of collections (level 2 nodes), which in turn include a variable number of documents (level 3 nodes), each composed of a variable number of fields, which in turn are possibly hierarchically organized into a tree structure (level 4 to n). A data resource of a data model corresponds to a node n of the tree representing all the resources handled by a platform, and it can be accessed traversing the path from the root of the tree to n. Therefore, we believe that a unifying representation of data resources of multiple data models should take into account the identification of proper modeling strategies for the nodes of the above mentioned resource tree. In particular, nodes should be specified in such a way to keep track of: i) any structural property related to the modeled resource, ii) hierarchical relations with other nodes (e.g., a parent of relationship), iii) possible meta-data, and iv) access control policies specified for the modeled resource. The considered policies may refer to different access control models, specifying context aware access control rules as well as content-based constraints.
The enforcement overhead of the above discussed technique is expected to depend on the platform hosting the data to be protected, as different behaviors are expected to be observed. For instance, Apache Spark25, integrates a highly efficient computation engines, which promises to be significantly faster than Hadoop 3 (up to 100 times faster26.) The overhead is expected to be reasonably contained in all those platforms supporting in-memory MapReduce computations, as well as data streams.
Policy analysis tools
The availability of a unifying data model on which access control policies can be specified would also allow to support policy analysis and reasoning at an abstract layer independent from any specific platform. As a matter of fact, the variety of data models, access control models, and related configuration options, such as policy propagation and conflict resolution criteria, adopted by Big Data platforms, can make really hard for security administrators to understand the effect of a set of access control policies on the data resources which are managed by their systems, as well as assessing the quality of the specified policies. Most of the research efforts in this field have been devoted to correctness verification, detection of inconsistencies and redundancies, as well as reasoning on policy sets completeness. A variety of approaches have been adopted to achieve the analysis, which range from the use of formal methods, to machine learning and data mining techniques. For instance, Datalog-based approaches have been proposed in Pasarella and Lobo (2017) and Tsankov et al. (2014), which respectively target Relationship-based Access Control (ReBAC) policies, and decentralized composite access control systems. Approaches based on Answer Set Programming (ASP), such as the ones proposed in Ahn et al. (2010) and Kencana Ramli et al. (2013), allow the derivation of ASP programs from XACML27 policies, and the analysis on the specified policies by means of ASP solvers. Model checking approaches have been proposed in Guelev et al. (2004) and Zhang et al. (2005), whereas SAT solvers and Multi-Terminal Binary Decision Diagrams based techniques in Lin et al. (2010) as a basis for reasoning on the permissions granted by access control policies. Graph-based analysis approaches for category based access control policies have been proposed in Alves and Fernández (2015), with the aim to ease verification tasks of security administrators. Finally, data mining techniques have been primarily used for the detecting policy anomalies (e.g., Hu et al. (2013)).
In Bertino et al. (2017) provenance techniques have been proposed to check the quality of the specified access control policies for a scenario where collaborations are carried out by autonomous cognitive devices. However, to the best of our knowledge, so far no proposal has yet targeted Big Data platforms. The model centric approach previously discussed may be exploited as a basis for the definition of such policy analysis framework. For instance, it may be used to generate views of the protected resources that show the authorized and unauthorized contents when different policies and configuration options are used, as well as to quantify policy coverage for a requesting subject with respect to an execution context.
The definition of a policy reasoning tool is also instrumental to fulfill the new EU General Data Protection Regulation, (GDPR)28 which is intended to strengthen data protection for all individuals within the European Union. GDPR applies regardless of where a company is located, provided that the company manages data of EU residents. GDPR introduces a set of very important principles for Big Data management, such as privacy by-design and by-default. The new regulation also emphasizes accountability for data controllers to demonstrate compliance to GDPR, whereas article 35 requires controllers to carry out Data Protection Impact Assessments in case of potentially high-risk processing activities. All such principles require tools to clearly assess the effect of access control policies on the managed data.
Finally, a policy analysis framework is also required for community centered collaborative systems, such as online social networks and collaborative editing platforms, which may be seen as federated applications that handle Big Data. Recent surveys pointed out that these systems typically provide rudimentary forms of access control (Paci et al. 2018). A key requirement for access control models tailored for collaborative systems is to allow users to understand collaborative decisions, as well as to inspect users access preferences, and to evaluate their effects (Paci et al. 2018). Paci et al. (2018) claim that, although a few work exist which explain the effect of access decisions (Hu et al. 2013), and the reasons for which certain decisions have been taken (den Hartog and Zannone 2016), the above mentioned requirements are still largely understudied. Therefore, the definition of a reasoning framework capable of operating within such federated environments with multiparty access control models appears as a research challenge of paramount importance.
Overall, so far research on policy analysis has primarily focused on different properties of policy sets abstracting from the effects of policy enforcement on the protected resources. In contrast, we believe that frameworks capable of evaluating the effect of policy sets on resource accessibility within different Big Data platforms are required, which may provide support to multiple access control models and configuration options.
Issues related to domain specific Big Data systems
Let us now consider open challenges related to access control enforcement within domain specific Big Data systems. A selection of approaches targeting the enforcement of access control policies within traditional DSMSs and CEP platforms have been shortly presented in “Big Data streaming analytics” section. A possible strategy to integrate similar enforcement approaches into Big Data analytics platforms may consist in designing the mechanism on top of one of the existing streaming framework. However, similar to the platform specific approaches presented in “Platform specific approaches” section, such a solution would suffer from a limited applicability. Moreover, existing solutions (e.g., Nehme et al. (2010)) operate at tuple level and scheme level (e.g., Carminati et al. (2016)), whereas cell/field level granularity may be necessary in the Big Data scenario (see “Platform specific approaches” and “Platform independent approaches” sections), requiring a data filtering approach that operates at a finer granularity level. The development of an enforcement mechanisms based on language centric approaches seems still impracticable, as no standard continuous query language exists. In contrast, since some of these platforms can implement MapReduce tasks (e.g., Apache Spark, Apache Storm), a model centric approach may be a possible strategy, however, thorough investigations are required to support this intuition.
For what IoT ecosystems are concerned, the initial efforts shortly summarized in “Internet of Things” section have mainly produced models adopting centralized enforcement mechanisms (e.g., see Colombo and Ferrari (2018)). However, multiple IoT ecosystems may be connected to each other exchanging data, and federated systems where multiple IoT applications cooperate cannot be handled with centralized enforcement mechanisms. Multiparty access control solutions for IoT ecosystems are thus needed, and they must be suited to operate at Big Data scale. To the best of our knowledge, the definition of such access control frameworks still represent a big open research challenge.
Security services for Big Data represent a key feature instrumental to foster trust on how data are managed and analyzed by Big Data platforms. This paper has focused on one of the key security service, that is, access control, by discussing the requirements that an access control solution for Big Data platforms should address, also with reference to specific key application scenarios (i.e., IoT and data streams). Moreover, the paper has provided a review of the state of the art in view of the devised requirements, and it has also discussed future research challenges in the area.
Details are omitted due to the blind submission requirements.
MapReduce-based analytics platforms are hereafter denoted MapReduce systems for the sake of brevity.
Dresden OCL Toolkit, http://st.inf.tu-dresden.de/oclportal
HBase is a popular wide-column store, https://hbase.apache.org/
SQL++ can be used with datastores adopting different data models, thus, the term data unit is used to denote a table row, or a document.
eXtensible Access Control Markup Language (XACML) Version 3.0 http://docs.oasis-open.org/xacml/3.0/xacml-3.0-core-spec-os-en.html
Availability of data and materials
The authors declare that they have equally contributed to the preparation of the article, all authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- Agrawal, R, Kiernan J, Srikant R, Xu Y (2002) Hippocratic Databases In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, 143–154.Google Scholar
- Ahn, G, Hu H, Lee J, Meng Y (2010) Representing and Reasoning about Web Access Control Policies In: 34th Annual Computer Software and Applications Conference, 137–146.. IEEE, Seoul. https://doi.org/10.1109/COMPSAC.2010.20.
- Alshehri, A, Sandhu R (2016) Access Control Models for Cloud-Enabled Internet of Things: A Proposed Architecture and Research Agenda In: 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), 530–538.Google Scholar
- Alshehri, A., Sandhu R. (2017) Access Control Models for Virtual Object Communication in Cloud-Enabled IoT In: 2017 IEEE International Conference on Information Reuse and Integration, 16–25.Google Scholar
- Barbaresso, J, et al. (2014) USDOT’s Intelligent Transportation Systems ITS In: Strategic Plan 2015-2019.Google Scholar
- Bertino, E, Jabal AA, Calo SB, Makaya C, Touma M, Verma DC, Williams C (2017) Provenance-Based Analytics Services for Access Control Policies In: 2017 IEEE World Congress on Services, SERVICES 2017, Honolulu, HI, USA, June 25-30, 2017, 94–101.Google Scholar
- Bhatt, S, Patwa F, Sandhu R (2018) An Access Control Framework for Cloud-Enabled Wearable Internet of Things In: 2017 IEEE 3rd International Conference on Collaboration and Internet Computing (CIC), 328–338.Google Scholar
- Carminati, B, Colombo P, Ferrari E, Sagirlar G (2016) Enhancing User Control on Personal Data Usage in Internet of Things Ecosystems In: 2016 IEEE International Conference on Services Computing (SCC), 291–298.Google Scholar
- Colombo, P, Ferrari E (2016) Towards Virtual Private NoSQL datastores In: 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, 193–204.Google Scholar
- Colombo, P, Ferrari E (2017b) Towards a unifying attribute based access control approach for nosql datastores In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, 709–720.Google Scholar
- Colombo, P, Ferrari E (2018) Access Control Enforcement Within MQTT-based Internet of Things Ecosystems In: 23Nd ACM on Symposium on Access Control Models and Technologies. SACMAT ’18, 223–234.. ACM, New York (USA).Google Scholar
- Cugola, G, Margara A (2015) The Complex Event Processing Paradigm(Colace F, De Santo M, Moscato V, Picariello A, Schreiber FA, Tanca L, eds.). Springer, Cham.Google Scholar
- Dean, J, Ghemawat S (2004) MapReduce: Simplified Data Processing on Large Clusters In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6. OSDI’04, 10–10.. USENIX Association, Berkeley.Google Scholar
- Ferrari, E (2010) Access Control in Data Management Systems. Synthesis Lectures on Data Management. Morgan & Claypool Publishers. ISBN: 1608453758 9781608453757.Google Scholar
- Gupta, M, Sandhu RS (2018) Authorization framework for secure cloud assisted connected cars and vehicular internet of things In: Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies, SACMAT 2018, Indianapolis, IN, USA, June 13-15, 2018, 193–204.Google Scholar
- Gusmeroli, S, Piccione S, Rotondi D (2013) A capability-based security approach to manage access control in the Internet of Things. Math Comput Model 58(5):1189–1205. The Measurement of Undesirable Outputs: Models Development and Empirical Analyses and Advances in mobile, ubiquitous and cognitive computing.CrossRefGoogle Scholar
- Hemdi, M, Deters R (2016) Using REST based protocol to enable ABAC within IoT systems In: 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 1–7.Google Scholar
- Hernández-Ramos, JL, Jara AJ, Marin L, Skarmeta AF (2013) Distributed capability-based access control for the internet of things. J Internet Serv Inf Secur (JISIS) 3(3/4):1–16.Google Scholar
- Hu, VC, Cogdell MM (2013). Guide to Attribute Based Access Control (ABAC) Definition and Considerations, National Institute of Standards and Technology, Jan. 2014, [online] Available: http://nvlpubs.nist.gov/nistpubs/specialpublications/NIST.sp.800-162.pdf.
- Kaiwen, S, Lihua Y (2014) Attribute-Role-Based Hybrid Access Control in the Internet of Things. In: Han W, Huang Z, Hu C, Zhang H, Guo L (eds)Web Technologies and Applications, 333–343.. Springer, Cham.Google Scholar
- Kulkarni, D (2013) A fine-grained access control model for key-value systems In: Proceedings of the Third ACM Conference on Data and Application Security and Privacy (CODASPY ’13), 161–164.. ACM, New York. https://doi.org/10.1145/2435349.2435370.
- LeFevre, K, Agrawal R, Ercegovac V, Ramakrishnan R, Xu Y, DeWitt D (2004). Limiting disclosure in hippocratic databases. In Proceedings of the Thirtieth international conference on Very large data bases,Toronto (Canada), Volume 30 (VLDB ’04), Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer (Eds.), Vol. 30. VLDB Endowment 108-119.Google Scholar
- Longstaff, JJ, Noble J (2016) Attribute based access control for big data applications by query modification In: Second IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2016, Oxford, United Kingdom, March 29 - April 1, 2016, 58–65.Google Scholar
- Marra, AL, Martinelli F, Mori P, Saracino A (2017) Implementing Usage Control in Internet of Things: A Smart Home Use Case In: 2017 IEEE Trustcom/BigDataSE/ICESS, 1056–1063.Google Scholar
- Migliavacca, M, Papagiannis I, Eyers DM, Shand B, Bacon J, Pietzuch P (2010) DEFCON: High-performance Event Processing with Information Security In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference. USENIXATC’10, 1–1.. USENIX Association, Berkeley, CA, USA.Google Scholar
- Nehme, RV, Lim HS, Bertino E (2010) FENCE: Continuous access control enforcement in dynamic data stream environments In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 940–943.Google Scholar
- Ong, KW, Papakonstantinou Y, Vernoux R (2014) The SQL++ unifying semi-structured query language, and an expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR. https://doi.org/abs/1405.3631.
- Ouaddah, A, Bouij-Pasquier I, Elkalam AA, Ouahman AA (2015) Security analysis and proposal of new access control model in the Internet of Thing In: 2015 International Conference on Electrical and Information Technologies (ICEIT), 30–35.Google Scholar
- Pasarella, E, Lobo J (2017) A Datalog Framework for Modeling Relationship-based Access Control Policies In: Proceedings of the 22nd ACM on Symposium on Access Control Models and Technologies (SACMAT ’17 Abstracts), 91–102.. ACM, New York. https://doi.org/10.1145/3078861.3078871.
- Puthal, D, Nepal S, Ranjan R, Chen J (2015) Dpbsv – an efficient and secure scheme for big sensing data stream In: 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 1, 246–253.Google Scholar
- Rizvi, S, Mendelzon A, Sudarshan S, Roy P (2004) Extending query rewriting techniques for fine-grained access control In: ACM SIGMOD 2004, 551–562.Google Scholar
- Ulusoy, H, Colombo P, Ferrari E, Kantarcioglu M, Pattuk E (2015) GuardMR: Fine-grained Security Policy Enforcement for MapReduce Systems In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security. ASIA CCS ’15, 285–296.. ACM, New York.Google Scholar
- Ulusoy, H, Kantarcioglu M, Pattuk E, Hamlen K (2014) Vigiles: Fine-Grained Access Control for MapReduce Systems In: 2014 IEEE International Congress on Big Data, 40–47.Google Scholar
- Warmer, JB, Kleppe AG (1998) The object constraint language: Precise modeling with uml (addison-wesley object technology series).Google Scholar
- Zhang, G, Tian J (2010) An extended role based access control model for the Internet of Things In: 2010 International Conference on Information, Networking and Automation (ICINA), vol. 1, 1–3191323.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.