1 Introduction

The more one knows about people and their condition(s), actions, needs, preferences, beliefs and the like, the better one can provide services to them – if one is in a purely benevolent, altruistic, caring and diligent society. However, there are problems or ethical issues with such data, and with capturing, the processes for capturing, and using the data. Even then, interpretation of the data might be invalid.

As a concept, publicly-available data usually refers to data found readily (such as on the Internet) and accessed (downloaded) easily and for free. Many of these data sets are created and distributed by public organisations. Publicly-available data includes open data (freely usable, reusable and redistributable without restrictions), data available on request, public-domain data (without copyright), copyrighted data, commercially-sold data and data with limited availability (eg: for a limited time or for only specified uses). Clearly, any data could fall into more than one of these types.

This paper draws on PhD research and a project that investigated using publicly-available mobile data to derive commuting and travel patterns. Clearly, there are ethical issues over using such data, particularly invading privacy. We review and analyse the literature to explore the ethics of using publicly-available data. We do not attempt to pick a framework of normative ethics on using such data. Rather, we explore some of the issues, focusing on surveillance-type data and hence on privacy.

1.1 Characteristics of Publicly Available Data

The following are some characteristics of publicly-available data.

  • Surrogates: digital data are not the real world, but merely represent the real world, and often because of costs, availability, laws and so on, the data actually recorded are merely a surrogate (or proxy or approximation) for the phenomena in the real world that are meant to be measured or assessed.

  • Big data: it is easy to capture vast quantities of data, but often difficult to extract meaning or to forewarn from the overwhelming data. Even worse, some assume the sheer volume provides greater objectivity, neutrality and accuracy, but “data are always the result of conscious, subjective decisions on the part of researchers, and are the result of inherently social processes” [1]. The perceived authority or effectiveness is lost by being overwhelmed by all the automatically-collected data, by mistaking omniscience for omnipotence or intelligence: “the more you know about the secret lives of others, the less powerful you turn out to be!” [2].

  • False precision: digital geographical coordinates are often given to some arbitrary precision based on a data storage decision (eg: single vs double precision), rather than the accuracy of the recording method, giving incorrect perceptions of coordinate accuracy. For example, a point geocoded from a toponym (eg: Tshwane) could have a precision of a second (about 30 m), though the toponym encompasses many square kilometres. False precision also applies to other types of data.

  • Quality: many factors can inhibit the quality of data, but these are often not well understood by the end users.

  • Metadata: documenting data, their quality and characteristics are essential for being able to use the data meaningfully, but metadata is often not well understood by the end users and is often not provided adequately.

1.2 Potential Ethical Issues with Publicly-Available Data

Whatever the nature of the data, the following are some issues (which have ethical aspects) with the creation, distribution and use of publicly-available data.

  • Privacy: there are moral and legal concerns over the invasion of the privacy (or surveillance) of individuals, groups and organisations, and these are the main focus of this paper.

  • Bias: because of the above and subjectivity in deciding what attributes to collect and how, any data set is invariably a biased representation of the population. While this can be ameliorated through other data, local knowledge and insights, and careful statistical analysis, it is of particular concern when those using the data are blissfully unaware of the bias. Bias also occurs in training sets for models, such as in machine learning. Error is ubiquitous [3].

  • Liability: this could be for incorrect data, which then compromises someone’s rights, endangers safety and security, wastes money and other resources, and so on.

  • Right to exploit content: on the other hand, for some (such as entertainers and artists) it is important to be able to exploit their content publicly which might be inhibited by corporations controlling the content – analogous to censorship.

  • Censorship: this can be disguised and rationalized as prudent selection, due to the limited budget of a public library, to suppress hate speech, to maintaining literary excellence, to ensure balance and/or to meet the audience’s requirements [4, 5].

Invading privacy (or surveillance), censorship and liability are often used as excuses for one another. For example, content could be denied or restricted (censored) to “protect” privacy or because of “concern” over liability. Further, claims over content ownership are used to censor content or restrict use, frustrating creators: Toya Delazy released her album online as her record label was limiting stock availability [6].

On the other hand, privacy could be compromised over “concerns” over liability, such as when a company monitors staff emails. The issues are not well understood either, such as when “poor” data (eg: low resolution remotely-sensed imagery) is considered to be censored data, because it covers in inadequate detail, an area of interest to a conspiracy theorist, or the like. However, “privacy and security do not have to contradict each other; indeed, secure online interactions, enabled by a secure online identity, is a precondition for full internet freedom” [7].

These issues also apply to private or restricted data, as inappropriate surveillance or data exploitation can be done within limited or closed groups.

2 Ethics

Ethics concerns the nature of ultimate value and the standards by which human actions can be judged right or wrong. Ethical judgement is influenced by the values of a person or group: their convictions of what is good or desirable [8]. Values are determined by different factors, including culture, religion, social and economic status, personal experiences, age, gender and profession. Data are often shared globally and the ethics of using such data is subject to significantly diverse value systems. Normative ethics aims at establishing the norms or standards for appropriate conduct and applied ethics is how these are used to deal with practical moral problems. There are three major approaches in normative ethics, which in practice, are often mixed:

  1. 1.

    Virtue ethics emphasises virtues or moral character as a way of assessing or justifying each and every action or non-action.

  2. 2.

    Consequentialism emphasises the consequences of actions, which can be interpreted as the end justifies the means; and

  3. 3.

    Deontological ethics emphasises duties or rules, which can be reduced to a check list of what to do in different situations [8, 9].

Deontological ethics is perhaps the easiest to adhere to in practice, because in each situation, one can look up what is the appropriate thing to do. Essentially, legislation is a form of deontological ethics. However, problems with deontological ethics are:

  • Someone can use them without having any moral understanding of exactly what they are doing (or not doing) and the implications thereof;

  • If there is no obviously applicable rule in a particular situation, the person has no meta-framework or set of values to use to decide on the best course of action;

  • Reciprocity can be difficult as the values or rules one person uses for determining how to behave towards another might be incompatible with those used by the second person, causing conflicting understanding of the actions and reactions; and

  • Without a meta-framework, people will tend towards the softest option and/or try to push the boundaries of what is acceptable [8, 9].

Virtue ethics focuses on the moral character and the need to educate and develop such a moral character. Considering what a ‘virtuous person’ would do can guide ethical decision-making [8]. There are various forms of consequentialism, such as utilitarianism (good conduct has consequences that achieve the greatest good to the greatest number of people) and situational ethics (considers the context in which conduct takes place and the consequences within this context).

Artificial intelligence and other sophisticated tools can be used to identify ethical and unethical behaviour, such as on social media, and assess the veracity of news stories and images [10]. Such tools can also be used unethically and to create and disseminate fake news. One needs to consider how these tools function and their outputs, to embed robust ethical analysis and decision making in the tools [11].

When conducting research that collects private data, one obtains informed consent to invading someone’s privacy and publishing the research results, as part of a research ethics process. What constitutes informed consent is in itself an interesting problem in ethics, due to language, literacy, education, coercion, rewards, etc.

A problem with informed consent is that the research subject needs to remember what they have agreed to and when. Unfortunately, this is not always the case, as we found in a project tracking participants to and from an event [9, 12]. If the user has to opt-in to the tracking, there is likely to be a high loss rate. If the user has to opt-out, they might forget to stop the tracking. Such issues of informed consent apply to private data obtained for government, commercial and other purposes, which are often obtained without a formal ethical review and might have the informed part buried in fine print and the consent part implicit rather than explicit.

3 Privacy and Protecting Privacy

The right to life has come to mean the right to enjoy life,the right to be let alone; the right to liberty secures the exercise of extensive civil privileges” [13]. Their primary concern was over making private details public: “each crop of unseemly gossip, thus harvested, becomes the seed of more, and, in direct proportion to its circulation, results in the lowering of social standards and of morality” [13]. Further, “it is also immaterial that the intrusion was in aid of law enforcement” [14].

Perhaps the antithesis of data democratization and freedom of information is making too much available, compromising the privacy of individuals especially, but also of organisations. Privacy is complex to define, being perceived differently by different cultures and treated differently in legislation. Privacy is perceived as being about protecting people’s personal information, but it also includes territorial (or location) privacy, physical (or bodily or health) privacy and privacy of communications. Privacy is not the same as confidentiality or secrecy, though they can overlap [15].

Many sacrifice their privacy voluntarily, especially when using social media, but they could be doing so through ignorance, deception, coercion or peer-pressure. Unfortunately, social media sites are notorious for changing privacy settings (sometimes through “errors”) and/or for making them complex. Even when personal data are secured in a private area, they could still be exposed through changes in legislation, decisions by courts (eg: search warrants) and company buy-outs.

Many governments have introduced legislation to protect privacy to varying extents. Perhaps the best known and most significant because of its wide applicability is the European General Data Protection Regulation (GDPR), which came into effect on 25 May 2018 [16]. The principles of the GDPR are lawfulness, fairness and transparency; purpose limitation; data minimisation; accuracy; storage limitation; integrity and confidentiality; and accountability. The South African equivalent to the GDPR is the Protection of Personal Information Act (POPI) [17].

4 Invasion of Privacy

Privacy is mostly an illusion. A useful illusion, no question about it, one that allows us to live without being paralyzed by self-consciousness. The illusion of privacy gives us room to be fully human, sharing intimacies and risking mistake” [18].

4.1 Covert Surveillance

Covert surveillance is possibly what many consider surveillance to be: monitoring behaviour and communications surreptitiously, for detecting, investigating and monitoring threats (criminal, terrorist, social unrest, etc), influencing and controlling society, and, hopefully, protecting citizens. For example, “brain fingerprinting” is claimed to detect the presence or absence of information in someone’s brain, using electroencephalography (EEG) [19], though there are concerns over the studies [20].

4.2 Trans-Jurisdiction Surveillance

One feature of the designed-in robustness of a packet-switching network such as the Internet, is one cannot guarantee the routing of individual data packets. Even with a high-speed, high-bandwidth Internet connection directly between two countries, parts of the connection might be routed through other countries – which might capture and/or study the data traffic en route [21]. Such trans-jurisdiction surveillance might be accidental; though those doing the surveillance should realise it happens. For example, Internet traffic to and from the United Nations in New York is presumably routed through the USA and hence likely to be recorded by the NSA. It appears that Internet traffic can be misdirected deliberately and surreptitiously, particularly across national boundaries, to inspect and/or modify the transmitted data [22].

Another example concerns virtual private networks (VPNs). They are used to ensure that anyone intercepting the (often encrypted) traffic cannot read what is being transmitted (or perhaps even where the source and destination are), but the traffic gets routed through servers, which lends itself to surveillance by the server owners.

Another form of trans-jurisdiction surveillance is remote sensing, with an early use of LANDSAT satellites being to monitor wheat crops in the Soviet Union [23].

4.3 Overt Surveillance

Not all surveillance is covert, with overt forms including those visible and well identified (such as CCTV surveillance cameras in public, or disclaimers of a call being recorded) or to which one agrees explicitly (such as the small print for using a web site). However, in some jurisdictions, such supposed agreements might be unenforceable, being excessively long or changed arbitrarily and without notice [24].

Further, it is easy to forget one’s actions are being observed, even when giving explicit consent [9, 12]. Clearly, this leads to complacency and the risk of becoming accustomed to the surveillance society. It is also easier to accept surveillance when under the influence of someone one trusts, such as parents recommending their children enable mobile phone location disclosure services [25].

4.4 Overloaded Surveillance

Apparently, the American NSA “intercepts and stores nearly two billion separate e-mails, phone calls, and other communications every day”, making the system too complex to determine if it actually works [26]. Rather than wisdom, the sheer volume creates information entropy – so information becomes noise as it “is routinely distorted, buried in noise, or otherwise impossible to interpret” [26].

Consequently, such agencies probably create their own filter bubbles, due to, not in spite of, the sheer volumes they harvest. Much of the content (facts, opinions, allegations, imagery, comments, conversations, etc) will be contradictory, so the selection, rating and analysis will be biased by preconceived notions and desire to “simply want to believe something that feels right” [26]. It is easy to be so enamoured with sophisticated and expensive technology the basics get forgotten, with tragic consequences, such as the Boston Marathon bombing [27] and Navy Yard shootings [28].

Being able to conduct surveillance over the Internet, or use it to interfere with the rights of others, or conduct information warfare over the Internet are all quite different from being able to control the Internet! The genie is out of the bottle and cannot be replaced. The Internet was designed to be robust (distributed, with data sent in small packets) and self-healing if any node broke [29]. As the Internet pioneer John Gilmore put it, “the Net interprets censorship as damage and routes around it” [30].

4.5 Becoming Accustomed to the Surveillance Society

It is easy to forget one is being observed. This can result in acting carelessly whilst being observed and/or accepting the lack of privacy by becoming used to it, or even by expecting it. Americans have been accustomed to limits on their privacy for many years [18], realizing Bentham’s idea of the Panopticon [31]. The Panopticon is a circular building with an inspection house in the middle from which a custodian could observe secretly the inmates (around the perimeter) who could not communicate with anyone. Foucault [32] invoked the Panopticon conceptFootnote 1 as a metaphor for the tendency of modern “disciplinary” societies to observe and attempt to “normalise” their citizens. “The panopticon induces a sense of permanent visibility that ensures the functioning of power” [34]. Unsurprisingly, this can lead to limited, or even curtailed, political and personal freedoms, and the loss of self-reliance [35]. Dobson and Fisher [36] took Foucault’s metaphor further, identifying three “post-panoptic” models:

  1. 1.

    Bentham’s original concept, which they consider to be the one Foucault used;

  2. 2.

    Panopticism II, in the form of the “Big Brother” type of surveillance of [37]; and

  3. 3.

    Panopticism III, technology tracking humans and their activities, such as cell-phone tracking [9, 12, 38], GNSS receivers, RFIDFootnote 2 and geo-fencesFootnote 3. Crucially, the technology for Panopticism III is relatively cheap, effective and widely available to anyone, and not just well-resourced national security agencies.

The 1844 British postal espionage crisis concerned the Post Office opening letters at the behest of a foreign power. As the Law Magazine observed, “the post-office must not only be CHEAP AND RAPID, but SECURE AND INVIOLABLE” [39]. However, even though widely known and causing a ‘paroxysm of national anger’, it did not impact on the popularity of the Penny Post, which increased rapidly thereafter [39]. “Snowden’s revelations will have demonstrated that in practice, the web-surfing, texting and emailing public are indifferent to the risks they run to their privacy” [39]. Similarly, Lanier [40] was concerned 2013 would be the year of digital passivity, when the cool gadgets (such as tablets running only applications approved by a central commercial authority) made us accept the commercial and government surveillance economy. Carr [35] fears privacy could be perceived as an outdated and unimportant concept inhibiting efficient transactions, such as socializing or shopping.

4.6 Mutual Surveillance

The psychological and social effects of prevalent surveillance result in people being so intimidated by authority and/or so used to surveillance they conduct self-policing and can be forced or encouraged to spy on one another, extending easily, cheaply and significantly the surveillance reach of the authority, be it a government, the military, a corporation or any other type of organisation [32, 41].

4.7 Making Data Already in the Public Domain More Visible

A common claim is that it is fine to put data online that are already in the public domain but otherwise difficult to access, such as documents and photographs in archives. However, that allows data matching. Such online content can also be accessed readily by anyone without revealing their interests, for example, using Google Street View to examine a neighbourhood, be it to find security weaknesses for targeting burglaries, stalking a resident, or mere curiosity. Similarly, much personal data are published, often unwittingly, in online genealogies.

This could apply to archives themselves, though they have established procedures (file plans) for what can be archived, how, where, when, why and by whom. Archiving is complicated by legal issues such as copyright and technical issues such as accessing the deep Web, volatile communities, broken links and dynamic content [42].

Some assume naïvely that content made publicly available on the Web can be expunged permanently at a whim. The European Court of Justice decided that anyone has “the right to be forgotten” and can require search engines to remove pages from search results for specified terms [43], going against the advice of its own Advocate General [44]Footnote 4. This has obviously been used by the unscrupulous to hide their activities. Such pages are not deleted; they are just removed from searches.

As a result, legitimate reporting by respectable organisations such as the BBC gets proscribed, contravening the public interest [45, 46]. Essentially, this defames that article’s author by declaring their work illegitimate. The search engine’s operator has to decide what is a valid removal request, but that is inappropriate [47, 48].

Some applications use ephemeral data to (hopefully) protect privacy, that is, content deleted permanently after a specified time. Examples are SnapChat for photographs and Silent Circle for two-way transmissions of voice, email, video, etc. However, there is doubt that ephemerality can be enforced securely [49].

Web scraping or harvesting takes content from Web sites. Collecting can be targeted and pre-arranged, such as harvesting metadata and data from members of a collaborative system, for instance data providers in a spatial data infrastructure (SDI). Collecting can use well-behaved bots (as search engines do for indexing the Web), or simulated human access. This raises issues of copyright, such as the “Google Defense” case concerning thumbnails of images [50].

A search engine obviously does some form of Web scraping to locate the content first, before being able to provide the rapid search responses users expect. To return results as quickly as they do, search engines are not always accurate (particularly the results count) and there is much of the Web they cannot access [51].

4.8 Combining and Processing Available Data

It requires much skill, intelligence and persistence to link together analogue data from diverse sources to find common threads, as good detectives do [52, 53]. Now, it is far easier to combine data from different sources using pattern recognition, artificial intelligence or other sophisticated tools (data matching, behavioural tracking, text analysing, data mining, linkage analysis, statistical analysis, spatial analysis and machine translation), exploiting fast hardware and huge and persistent digital data bases.

Most ‘big data’ analysis is not done to invade privacy, but to examine questions otherwise unexplorable, to understand human, physical and environmental behaviours in different contexts, and (hopefully) benefit society [54]. Unfortunately, an individual can be identified uniquely with very few data points, even coarse ones, such as with cellular telephone use [55], power consumption of a mobile device [56] or renting public bicycles [57]. Personal traits can be gleaned from the digital footprints people leave on social media, which some exploit for trust and resilience modelling [58]. There are also services available for a fee to track a mobile telephone [59]. Hence, “there is no such thing as anonymous online tracking” [60].

4.9 Opting in vs Opting Out

To varying extents in different jurisdictions, one has limited control over how much of one’s personal information is known, retained by others and/or shared. Sharing one’s information (opting in) can provide access to services, opportunities or prizesFootnote 5, such as loyalty programmes (sharing personal and behavioural data for discounts or loyalty points), subscriptions to paid content, exposing one’s resumé to potential (and hopefully desirable) employers, security services such as vehicle tracking, research collaboration or even friendships. Further, for some the right of publicity [61] is key for their profession and income, through exploiting their names, photographs, likenesses, recordings and the like – but only if they have consented and are remunerated appropriately. In many jurisdictions, one nominally can opt out of divulging one’s private information, but even that explicit declaration gets ignored [62].

User-generated geographical data are known as volunteered geographical information (VGI). Some object to the term because data so collected might not be volunteered, but rather contributed, collected or harvested irrespective of whether the subject opted in, opted out, was even aware they were contributing their personal details, or had forgotten they were doing so. Harvey [62] suggests differentiating between volunteered (VGI) and contributed (CGI) geographical (or locational) information. Further, truth in labelling in the metadata following pragmatic ethics would explain the provenance of the information, allowing assessment of its fitness for use and if the quality of the data has been compromised by lax standards or even malfeasance [62].

4.10 Assuming One Has Nothing to Hide

For anyone who lived through Apartheid (or communism, fascism, etc), it should be obvious that everyone has something to hide from a repressive government. Even in a reasonably open and stable democracy such as the USA, an innocent person has the right to remain silent [63], and keep their matters private. “The skeptics no doubt have noticed that governments are made up of people and that people are prone to misuse information when driven by greed or curiosity or a will to power” [18].

Examples of ripostes to those justifying surveillance are: show me your credit card details; show me yours first; none of your business; and those with nothing to hide don’t have a life [64]. The person wanting to protect their privacy does not have to justify their position: the person wanting to invade someone’s privacy needs to justify it first [64]. The metadata of one’s communications can also reveal personality traits, religion, politics, habits, movements, condition, relationship issues, etc [65]. It is not only keeping ‘facts’ about oneself private, but also the assumptions made about us from the available data [66]. Further, there is the problem of identity theft.

4.11 Legal Complexities

Human beings need space where they are guaranteed to be free from surveillance or interference by anyone, such as to establish and preserve intimate human relationships and develop intellectual faculties through reading, private conversation or writing privately [67]. It is very difficult to grow intellectually if one cannot experiment with ideas without fear of surveillance and resulting misinterpretation. “Experience should teach us to be most on our guard to protect liberty when the government’s purposes are beneficent. Men born to freedom are naturally alert to repel invasion of their liberty by evil-minded rulers. The greatest dangers to liberty lurk in insidious encroachment by men of zeal, well-meaning but without understanding” [14].

5 Conclusions and Discussion

This paper presents a review and analysis of the literature on the ethics of using publicly-available data, particularly concerning privacy. It presents the characteristics of publicly-available data and explores potential ethical issues, such as surveillance, becoming accustomed to the surveillance society, increasing access to data, combining and processing data and assuming one has nothing to hide. There is clearly much research that still needs to be done on these issues, particularly given the different perspectives on vales and ethics due to culture, religion, politics, experiences, age, gender, social status and so on.

This research comes out of the CSIR’s Mobile Data Platform for Urban Mobility (MDP) work package of the Spatial Urban Dynamics 2014/2015 project and the PhD research [68] of the first author. We would like to thank Quintin van Heerden, Peter Schmitz and Derrick Kourie for their contributions to developing this research.