Keywords

1 Introduction

During the course of his doctoral study, Roy Fielding generalized the architectural principles that drove the Web conceived of by Tim Berners-Lee in the early 1990s and presented these principles as an architectural style which was underpinned by a framework of constraints. This framework was named Representational State Transfer (REST) [1] and systems which adhere to this framework are called “RESTful” systems or services. Because of the REST framework’s ease of use and deployment, it has since been used in a variety of other development methodologies, including web services and application programming interface (API) development, and has since become a serious rival to the use of the earlier Simple Object Access Protocol (SOAP) [2] which is a successor to the Remote Procedure Call (RPC) programming style. These SOAP and REST based APIs have been used to communicate data and information in many fields, most recently, Open Data.

Open Data is data that can be freely used, shared and built-on by anyone, anywhere for any purpose [3]. The recent global trends towards Open Data have organizations and governments relying on these APIs to communicate Open Data in a greater extent than before.

One of the key advantages of Open Data is that it increases the availability of data for consumers in decision making as well as providing potential for massive cost reduction through implicit outsourcing of information system development. At regional governmental level, a combination of Open Data with appropriate interrogation programs could possibly replace physical publication of such documents as guidebooks, listings, etc. The active interest and support by various national and international non-governmental organization as well as state and federal government sponsored ‘hackathons’ such as GovHack [4] and HealthHack [5] aiming to develop new applications that use Open Data also could lead to a strong upsurge in the use of Open Data in various fields.

As the capacity for acquiring and storing data increasing from year to year and with data analytics exerting greater influence on decision making than in years past, trust has to be placed in not just the processes and algorithms used to analyse data, but in the authenticity and integrity of data as well. There is now a fast developing trend for enterprises; both public and private, to incorporate Open Data with proprietary and private data collections in order to provide better decision making and other reports. However, how can users trust conclusions or decisions made on the basis of results obtained from largely, untrusted data?

An adversary, wishing to use any means necessary to cause disruption, may wish to misuse the Open Data movement to achieve this disruption within the society by:

  1. 1.

    Diverting Open Data requests to a fraudulent site containing fraudulent datasets.

  2. 2.

    Insertion of a fraudulent dataset into a legitimate site.

  3. 3.

    Deliberate modification of a legitimate dataset.

  4. 4.

    Denial of Service should Open Data become an integral and essential part of a community service.

Therefore, it becomes of vital significance to ensure the authenticity and integrity of Open Data in order to placing trust in decision making based on that same Open Data, for users, businesses, industry and government alike.

2 Paper Scope

Because of the nature of Open Data, methods such as encryption which is aimed at the confidentiality aspect of data may not be completely relevant in the broad philosophy of Open Data, but may be briefly discussed.

This paper addresses development of trust in the usage or adaptation of Open Data through the incorporation of appropriate authentication and integrity parameters for data included in end-user Open Data applications by developers. The principle here is that the average person would not normally access raw Open Data collections but would view them through the lens of an appropriate application. The user would therefore need to be able to trust both the authenticity and integrity of data supplied by the application.

The proposed architecture makes use of the Domain Name System Security Extensions (DNSSEC) for host/server verification. However, the full description of DNSSEC functionality lies outside the scope of this paper, and will only be briefly discussed in relation to how it fits into the proposed architecture as a whole. It should be noted that DNSSEC does not use digital certificates but rather a public key hierarchical registry.

The paper will also discuss the proposed architecture’s use of a public key hierarchical registry and digital signatures instead of the traditional certificate authority for the authentication of data publishers and integrity of datasets.

3 Related Work

3.1 Trustworthy Open Data

Open Data, which at its core, is the idea that certain types of data should be freely accessible to the public has its roots in the concept of open access to scientific data in the 1950s [6]. In more recent years, researchers have indicated the potential benefits of using Open Data analytics for positive effect in various fields [711] and coupled with the rise of Open Government [1214], the Open Data movement stands to make even greater impact in the near future. However there are certain challenges that Open Data faces, not the least of which is ensuring data quality and fostering trust in Open Data [15]. Strong [16] and later Mazon [17] recognize “fitness for use” as being the definition of data quality and an important criterion for data analytics and business intelligence, and security plays an important role in ensuring both data quality and trust.

A database search returned few works relating to on methods for Open Data security. Telikicherla and Chopella [18] propose a library for secure web application development as a means of preventing “frame busting” and other attacks in HTML5 based Open Data mashups. Eckert et al. [19] present a workflow model which preserves provenance for Linked Data and can be applied to Open Data. These approaches have their merits in the area of browser security and preserving provenance for data, however, as Open Data depends on the ability to transmit data that can be verified as both coming from an authenticated source while retaining integrity through the transmission process and therefore is trustworthy to the user, this issue still needs to be addressed.

3.2 Security for REST

The United States, United Kingdom, Japan, and Australia among others, use a software called the Comprehensive Knowledge Archive Network (CKAN), an open source data management platform to manage and publish their Open Data [20]. CKAN’s Action API is based on the RPC programming style. Another well-known Open Data Management Suite is Socrata. The Socrata Open Data API (SODA) provides an open, standards-based REST API [21]. In the case of the United States, the wide variety of government services and organizations has led to the use of both CKAN [22] and SODA [23] at the state and federal government levels.

As APIs for Open Data is mostly disseminated through the platform of the World Wide Web (WWW), and the WWW is based on REST principles as coded by Fielding [1], in order for Open Data to be secure, REST also needs to be secure. In Fielding’s thesis, REST was designed to provide simplicity of implementation and scalability but has no pre-defined security protection mechanisms [24] when compared to SOAP, which uses the WS-Security [25] standard. In response to this, several authors have recently suggested mechanisms by which to provide security for REST:

Forsberg [26] proposed an approach where content protection was based on keys being delivered to clients via secure session. Forsberg’s approach eliminates the need for repeated SSL/TLS encryption of cached content. Forsberg further notes that their solution is adjusted to match better with caching for data that requires confidentiality protection. An approach which used extended HTTP headers to effect extended username tokens was proposed by Dunlu et al. [27] for user authentication. A secondary password for the username token was required in order to avoid leakage of user password. The approach proposed by Serme et al. [28] had some similarities to Dunlu et al. where they used extended HTTP headers, except that Serme’s approach uses the HTTP headers to convey digital certificate and encryption information.

Lee and Mehta [29], investigating some of the security threats to REST-based Web Services concluded that although message encryption by HTTPS was a costly protection method, HTTPS-based data transfer was the best method to ensure data confidentiality. Backere et al. [30] states that the best solution to the RESTful security problem, or the one most conforming to RESTful principles, is to differentiate between messages that need to be encrypted and those that do not, that a message is not-modifiable, and that replayed messages be avoided. They propose a login and REST resource access mechanism that leveraged these concepts.

There is a common consensus that it is necessary for appropriate security mechanisms to be employed in REST Web Services, however the means of accomplishing this as well as the security properties to be protected vary from approach to approach. Most of the solutions presented by these authors however, focus on authenticating the user or protecting the confidentiality of information held in RESTful systems while using certificate-based Secure Sockets Layer/Transport Layer Security (SSL/TLS). This however, begs the question: Is it necessary to protect the confidentiality of publicly available Open Data, or even to authenticate users of Open Data?

The answer to this question is: Open Data by its very nature is public data, therefore it should be viewable by the public and not restricted by confidentiality mechanisms. With this, authentication of the end user for read-only access to Open Data is not strictly necessary. Having said this, restricting access to the methods that can be used to alter Open Data resources to authenticated entities, such as the original publisher of the data, or other authorized parties is still required. An important quality that was brought up by Backere et al. [30] which relates to Open Data is that messages and content should be unmodifiable, which basically refers to the integrity of Open Data resources and collections even as they are transferred and cached over the Web.

3.3 Public Key Infrastructures

There are two aspects to Open Data; (1) the management of Open Data collections, one of which involves actions by an authoritative source like adding, modifying or deleting datasets, and, (2) the usage or modification of datasets post-addition. Both of these aspects require authentication that: (1) The host of the data is genuine, and, (2) the data has been published by an authentic source, and, (3) the data being transmitted itself is genuine and remains unchanged through the communication process. This authentication can be accomplished through the use of message digests and digital signatures.

Diffie and Hellman [31] proposed a “public file” which could be used as a central authority for the binding of a specific identity to a particular public key for authentication purposes, thereby establishing a hierarchy of trust. Because of the technological limitations of the time, this approach has proven unfeasible when compared with the public key certificate concept proposed by Kohnfelder [32]. The public key certificate authentication scheme based on Kohnfelder’s thesis is now ubiquitous, however, despite the technological barrier for the use of the “public file” no longer being applicable and issues with public key certificate management that have been highlighted by Clarke [33], most notably complexity and the problems with certificate revocation mechanisms.

Rivest attempts to address the issue of certificate revocation by proposing that proper certificate infrastructure organization can allow a signer to present a collection of certificates as evidence of authenticity [34]. Another paper authored a few years later by McDaniels and Rubin [35] state that addressing PKI requirements in large, loosely coupled environments using certificate revocation lists is difficult and a web environment based on REST, as envisioned by Fielding, is designed to be a large, loosely coupled environment.

There have been recent attempts to address the problems with public key certificate based authentication and trust through initiatives that leverage the abilities of multiple certificate notaries, such as the Perspectives Project, [36] and Convergence [37]. The Perspectives Project and Convergence provide trust agility in which the user is given the ability to decide which party they wish to trust for authentication of certificates. In general, these initiatives provide trust that the certificate is genuine by leveraging multiple certificate notaries to provide additional trust perspectives on the Certificate in question.

However, in this emerging landscape of Open Data, a distributed trust model may not be the answer as usage of Open Data depends on whether or not a user trusts the owner or original creator of the Open Data. If that owner or creator is not a reputable source, or if the identity of said owner cannot be verified, then any data coming from that source is untrustworthy. In other words, an instance of a public key must be verified as belonging to a genuine entity, taking into account the threat of the addition of false identities and/or public keys, in order for data gleaned from that source to be considered trustworthy and/or having authenticity and integrity.

4 Proposed Solution

This paper proposes a Trustworthy Architecture for Open Data Systems (Fig. 1) which would serve as a precaution against tampering, enabling users to know when a particular resource is genuine or has been tampered with, thus augmenting the REST framework. As mentioned in a previous section, the SSL/TLS level encryption of request-response messages represents an expenditure of processing resource which is not critical for communicating Open Data. However, measures to protect message integrity and authenticity to ensure trustworthiness are still required.

Fig. 1.
figure 1

Trustworthy architecture for open data systems

4.1 Key Components

The architecture proposed hinges on several key components, a Key Generation component, a Public Key File, DNSSEC and a Verifier Module which interfaces with regular REST framework activity as needed.

Certification Authority vs Public Key Registry. Effective key management is essential for the smooth operation of cryptographic systems. In regular circumstances, a trusted Certification Authority is responsible for issuing digital certificates and maintaining certificate revocation lists, and is also responsible for the generation of cryptographic key pairs and digital signatures. If a certificate model is to be used then this would be the normal procedure, however, given that Open Data is of its own essence, “open”, proposing a complex certificate architecture including certificate revocation should be unnecessary. Moreover, access to confidentiality/privacy services and mechanisms is not required. It would seem reasonable, then, that each Open Data publishing entity could maintain its own public key, relevant to verification of data integrity alone, in an appropriately managed and controlled public key registry (PKR) file similar to the Diffie-Hellman concept of the “public file”, in line with the design philosophy of DNSSEC.

Public Key File. The Public Key File is a central location where public keys associated with recognized identities may be retrieved. It is proposed that any entity wishing to publish Open Data generates a cryptographic key pair and submit the public key to the Open Data Public Key File maintained by this Registry (using the Key Generation and Signing Module). Users of Open Data will then be able to retrieve the appropriate public key to verify a signature from this central location.

Key Generation and Signing Module (KENNEL). For ease of explanation the Key Generation and Signing module will be referred to as KENNEL. To become an Open Data Publisher, owners of Open Data first need to use KENNEL to generate a cryptographic key pair. This local generation of the key pair eliminates the need to securely transmit the private key over network channels and just exposes the public key, which is what the public key is designed for. The module then submits the generated Public Key to the Registry along with sufficient proof of identity over a secure channel (Fig. 2). This information will be the basis of a record in the Public Key File (Fig. 3).

Fig. 2.
figure 2

Open data publisher process using KENNEL

Fig. 3.
figure 3

Role of the open data certificate authority in relation to the open data publisher

Fig. 4.
figure 4

Open data user case illustration

The Open Data Publisher is responsible for maintaining the secrecy of its Private Key, and uses this key in conjunction with a cryptographic one-way hashing function to generate a Digital Signature Table 1.

Table 1. Process of Fig. 2. Open Data Publisher Process using KENNEL

Verifier Module (HOUND). For ease of reference, the Verifier will be referred to as HOUND. After address resolution with DNSSEC implemented is accomplished, a request for resources reaches the server and the server responds with the appropriate resource which is digitally signed. At the client-side, HOUND reads the server response, extracting signer’s identity, retrieves the matching public key from its Public Key Record and verifies the digital signature.

A message digest is recovered by decrypting the digital signature with a valid public key and is compared to a message digest computed at the recipient end of the communication. If both digests are identical, then the response is considered to be authentic and retains its integrity. The operation of the verifier terminates and the content is then passed to the Renderer and is displayed in whichever format is applicable. If the digests are not identical, then the recipient is alerted to the fact that the resource may be fraudulent or has been altered in transit.

DNSSEC. The Internet Engineering Task Force has developed RFC3833 [38] the Domain Name System Security Extension specification to resolve various threats to the DNS using public-key cryptography to establish a chain of trust, which in practice each zone has a public key which is deposited in the parent domain for authentication purposes. Therefore, DNSSEC does not make use of digital certificates but rather a public key hierarchical registry.

In order for DNSSEC to be incorporated, the domain host first needs to have DNSSEC enabled on the server-side and the client needs to install a DNSSEC validator which can read verification information from the server.

4.2 Use Case Scenario

Communication without the Proposed Architecture. A request for data to a server follows a standard request-response paradigm. This is an example of a standard HTTP request for a resource without the use of the proposed architecture:

Address Resolution is performed after the initial request is made by the Domain Name System. DNS uses Root or Authoritative Name Servers which are heavily supplemented by DNS caches as a workaround to reduce DNS traffic and increase efficiency. Caching DNS request-response records reduces load on individual servers but is vulnerable to DNS cache poisoning and other interception attacks.

After the 200 OK response, there is no further communication from hosting server and there is no provision for integrity verification. Content is then displayed. It is difficult to place confidence in the data because, as mentioned previously, without any security or integrity-checking mechanism in place, the file is still vulnerable to accidental corruption and interception attacks.

Communication using the Proposed Architecture. The following example illustrates use of the proposed architecture from the perspective of the end user:

In communication with the end-user, the proposed architecture does not change the request-response format of HTTP communication, but rather augments it with the previously mentioned security components Table 2.

Table 2. Process of Fig. 4. Open Data User Case

At this stage, as before, address resolution is performed, incorporating DNSSEC to authenticate the DNS Server and provide defence against cache poisoning or man-in-the-middle attacks. As mentioned in a previous section, a DNSSEC validator is installed on the client-side and validates the incoming content.

After the address resolution is performed successfully, communications proceed as per normal and a response from the hosting server is received. As per RFC2616 [39], a successful request should contain a 2xx or a 3xx status code depending on the HTTP method used. If an error on the client-side is perceived, the server should return a 4xx status code but if an error occurs on the server-side a 5xx status code should be returned to the client.

In this case, the response contains a status code and the actual message which is digitally signed and accompanied by a digital certificate. The verifier module is then called to verify the integrity and authenticity of the received message. At the end of verifier module process, the message or content should be displayed if both the certificate and signature pass verification. The following is pseudocode describing the functioning of the verifier module:

This procedure assumes DNSSEC is enabled. However, should DNSSEC not be available, HOUND should use HTTPS as a minimum for integrity to retrieve the appropriate public key from the Public Key File at the Registry for signature verification.

REST URI Interface with Proposed Architecture. Each dataset may contain one or more resource files, which is linked to a particular digital signature file and is associated with the identity and public key of the dataset owner/creator at the Registry.

When a GET request is called for a dataset, e.g. GET /datasets/sampledataset, the server returns a listing of resource files and their URIs. From the information in that list, a GET request may then be sent for an individual resource, e.g. GET /datasets/sampledataset/resource1, which should retrieve both the resource file and the associated digital signature. All the information is then used in the signature value with the algorithm described in Sect. 4.2.

Table 3. URI interface

4.3 Conclusion and Future Work

This paper proposes an architecture that uses digital signatures in conjunction with an associated public key file with the main goal to protect the integrity of Open Data communicated over the Web. This is a simplification of current public key certificate structures which use large revocation lists for certificate currency and have demonstrated problems in scalability and “certificate authority” trust.

The key element of this architecture is that it is based on DNSSEC, which is more appropriate to the new world of IPv6. The advantages of this architecture scheme is that it is far more scalable, not another certificate authority hierarchy with massive dispersion of key certificates which has of late become too widespread and unmanageable. With the use of a Public Key File, if the key is compromised; it is a simple matter of the single responsible entity replacing the key pair with a new one and re-performing the data file signing process.

Responsibility for authenticating Open Data is separated from any other certificate authority that might be used by the publishing entity. Under this proposed architecture, the Open Data environment does not interfere with the internal security schemes that might be employed by the entity. However, this architecture incorporates, when needed, parameters from the entity, e.g. person who authorized publishing as Open Data, at the time that datasets are created/added.

Future work will include the building of a proof-of-concept system using the architecture in this paper and performing benchmarking against regular systems in conjunction with penetration testing. An interesting philosophical question which may be studied in further papers is: what responsibility does an entity, whether private or public, take on when it makes Open Data available? Further study on the issue of Open Data and governance requirements must be done.