KeywordsMachine learning Privacy Security Automated systems Social media
Machine learning (ML) describes a wide array of algorithms that analyze data and enable a computer to make predictions.
Machine learning (ML) describes a wide array of algorithms that analyze data and enable a computer to make predictions. Differing from traditional statistical analysis, which makes various assumptions about data, algorithms identified as machine learning typically let the data do the talking. ML at this point is coming to mean just about any automated system that learns from data, or which was created by learning from a data set, a so-called training set, and then used to make predications based on data not seen previously. Unlike a static program that takes data in and outputs an answer, a program using ML techniques takes data as input and based on the data can modify itself to be more effective at making predications and achieving its objective. Increasingly, ML techniques are used in some part of the automated systems with which users interact, for example, a natural language interface relies on ML techniques. Already many devices utilize ML algorithms and operate independently, self-driving vehicles being one such example.
Users are exposed to ML algorithms every day, for example, when they access any web site that makes recommendations or when they receive targeted ads. Indeed, many of the algorithms today classified as ML algorithms including search engines have been around for quite some time. More recent applications include facial recognition, medical diagnosis, fraud detection, and spam detection. ML plays a key role in robotics, for example, allowing computers to identify objects. Classification, based on test data (supervised learning) or without prior data (unsupervised learning), is an important subfield of ML. Such algorithms often play a prominent role in determining who is a credit risk, who gets paroled or who gets the job.
Several advances in computing have allowed the ML community to flourish in recent years. Certainly, increased computational power and storage, especially in small devices, has played a key role. The ability to design an algorithm that learns from data but keeps the code largely fixed in size has been very important as well. Expert systems had been around for years. However, these systems, which tried to incorporate expert knowledge, required code sizes to grow and become increasingly complex and unwieldly.
Many experts in the field point out ML techniques often are heuristic. In many problem areas, designers know how methods work on test data but cannot guarantee they will always perform as expected on data not seen previously. Despite impressive successes, many in the field still describe ML as part science and part alchemy.
There are two key reasons for the widespread rise in interest in ML. First, it has enabled the development of smart devices that can respond to a variety of situations, even situations that have never occurred before. Second, in almost any field, too much data is being created too rapidly for humans to process them without some type of intelligent digital assistance. In the medical field, without such assistance, it is impossible for medical researchers to be aware of all the publications, clinical trials, and drug interactions that might impact their work.
As ML-based systems proliferate and reliance upon them increases, these systems will present significant privacy and security challenges. The remainder of this entry describes some of the privacy and security concerns surrounding ML-based systems. In many instances, ML-based systems also will provide the solutions.
Although privacy in the Internet age has long been an issue, no one has determined how to reconcile privacy risks with the current business models that make Internet services available. Simply providing more user control of privacy settings is not working. A recent review of the privacy literature concludes “users are often unaware of the information they are sharing, unaware of how it can be used, and even in rare situations when they have full knowledge of the consequences of sharing, uncertain of their own preferences” (Acquisti et al. 2015, p. 513). The research examined also shows user privacy decisions are impacted by context, for example, users are more likely to release sensitive information to well-known sites such as Facebook rather than lesser known sites. In addition, users are easily influenced in privacy attitudes by small rewards that can change privacy concerns. Users are quick to release data to gain a clear short-term benefit such as the use of an Internet-based service but prone to ignore the longer-term not so clearly defined risks of providing data or even just using the service.
On the other hand, when it comes to data collection, social media sites such as Facebook have a clearly defined purpose – collect as much data as possible and make every effort to monetize it. ML provides numerous opportunities to combine data from many sources to provide detailed profiles of people which are made available for sale. Not only data but full psychological profiles are now possible. Even detection and analysis of emotions are being considered with ML techniques (Gayo-Avello 2018). Increasingly, users cannot know the consequences of releasing data.
In his highly popular book, The Master Algorithm, Pedro Domingos describes a model for the future where the way users interact with the digital world is through a “digital half,” essentially an ML-based program that presents users to the digital world and presents the digital world to them (Domingos 2015, p. 269). In many ways, this already is occurring, although not to the level anticipated in the book. The key question is who will control this technology and how will it be used. Right now, the motivation for providing such a service is advertising revenue. This front end, which people will rely on to access the digital world, will make decisions for them and know them intimately. It would likely incorporate current research that will allow the ML algorithm to gauge the mood of humans when using the system. One of the concerns is that users may not even be able to participate effectively in twenty-first century economic and social activities without such a front end or digital half.
Although often portrayed as conflicting goals, without privacy there is no security. A fundamental tenet of security is the “need to know.” It usually is applied to classified information but is widely used by organizations and individuals to limit the damage done from the release of information, particularly by insiders. Aleksandr Kogan is a Cambridge University researcher hired by Cambridge Analytical to harvest data from tens of millions of Facebook profiles, data which was later used in the 2016 election without user consent. In an interview on April 23, 2018, Kogan, who had a long-standing relationship with Facebook, commented that Facebook app can transfer and sell user data for up to year and half without notification from Facebook. He went on to say social media platforms cannot be relied upon to police their sites, even if developers violate their policies. The basic business models of Internet companies create extreme incentives to share data with developers and advertisers, not to limit who has access to data nor to police privacy policies. Certainly, current business models, where users give away rights to data and how it is used, would be intolerable in a world where a digital half is needed to access the digital world and interact with it on a user’s behalf.
As would be expected, hackers are also taking advantage of ML. An important Internet security challenge is the social botnet (Ford 2016). Social bots are computers running algorithms that automatically produce content and are able to interact with humans or computers and imitate human behavior, the classic idea of what artificial intelligence means. Such algorithms frequently use a range of machine learning techniques to interpret natural language, respond to queries, and infiltrate other systems. By commandeering a large collection of social bots, attackers can have a significant presence in social media which they then can use for a variety of possibly malicious or at least disingenuous purposes (Ferrara et al. 2016). Such bots may be used to artificially inflate support for a political candidate or to denigrate a candidate. Misinformation gains credibility by the mere fact that the collection of social bots appears to be individuals expressing an opinion or shared belief. Such bots have also played a role in click fraud and classic pump and dump schemes to inflate the value of stocks. ML algorithms make it much harder to determine whether users are bots or humans?
On the other hand, ML techniques are a key technology used to detect social bots. A bot that can change its behavior to adapt to its environment is hard to detect with standard algorithms. ML algorithms focus on general behavioral patterns to detect these bots. Rather than have an expert system try to predict all types of behavior, these programs are able to learn the behavior of a wide range of bots and then determine what computer is a bot and which one is a legitimate user (Davis et al. 2016). Stanford University now offers a course in ML-based bot detection. On the other hand, at Black Hat conferences, speakers discuss techniques on how to avoid ML-based bot detection. The age where security depends on an ML arms race may have arrived.
Increasingly, Americans are getting news from social media sites. A recent Pew Research Center report indicates that about 67% of US adults get some news from social media sites, up from about 62% in 2016 (Pew Research Center 2017). The report also states that Facebook is by far the largest social media news source with over 45% of US adults getting news from this site.
Social media sites rely heavily on machine learning algorithms to determine user preferences and interests. The sites provide news feeds and other news items that reflect those preferences. Many people claim this results in an increasing level of polarization and isolation of users as they predominantly receive news items that fit their profile and current notions (Lynch 2016). Those trying to influence political and social onions have a receptive audience for stories that support or harden a political position or social view. Some authors refer to this as a “filter bubble,” since users are shielded from information that they do not like even if it is relevant to the discussion (Kelly 2016, p. 170).
Despite widespread belief in the value of disintermediation (i.e., no editors and anyone can post anything), social media sites now find themselves in the unenviable position of having to monitor content to filter out, among other things, violent extremist views, hate speech, pornographic material, and fake news. Again, machine learning algorithms are the way to examine the massive amount of content that is posted daily on social media sites. Blocking and filtering, however, reflect the biases of those who chose the data to train the ML algorithms used. According to a New York Times report, YouTube removed over 8 million videos in the first quarter of 2018, 80% of which were removed with no human intervention (Wakabayashi 2018). Using ML techniques, social media sites, which for most Internet users provide a window to the information on the Web, are becoming content sensors for a large portion of the world’s population.
In the foreseeable future, human lives will be dominated by automated systems which use a variety of ML algorithms. Users most likely will need some type of digital half to access and interact with the digital world. Users, however, cannot know the risks to privacy and security with the current state of digital world. What happens in an ML-driven world where users interact with algorithms that can practically imitate human behavior? ML offers incredible capabilities to take advantage of data, improve services, and gain knowledge. For Internet users, the benefits and risks will depend largely upon who controls these ML-based systems and how they are used.
- Domingos, P. (2015). The master algorithm. New York: Basic Books.Google Scholar
- Ford, M. (2016). Rise of the robots. New York: Basic Books.Google Scholar
- Gayo-Avello, D. (2018, April 1). Social media won’t free us. Computing Edge, pp. 42–51.Google Scholar
- Kelly, K. (2016). The inevitable: Understanding the 12 technological forces that will shape our future. New York: Penguin Books.Google Scholar
- Lynch, M. P. (2016). The internet of US: Knowing more and understanding less in the age of big data. New York: Liveright Publishing Corporation.Google Scholar
- Pew Research Center. (2017). News use across social media platforms. Retrieved 10 March 2018, from http://www.journalism.org/2017/09/07/news-use-across-social-media-platforms-2017/
- Wakabayashi, D. (2018, Apr 24). You Tube says computers are catching video pitfals. The New York Times, p. B3.Google Scholar
- Hof, R. D. (2018). Deep Learning: With massive amounts of computational power, machines can now recognize objects in real time. MIT Technology Review. Available at https://www.technologyreview.com/s/513696/deep-learning/. Retrieved 5 April 2018.
- Kleinberg, J., Ludwig, J., Mullainathan, S. (2016, December 8). A guide to solving practical problems with machine learning. Harvard Business Review. Available at https://hbr.org/2016/12/a-guide-to-solving-social-problems-with-machine-learning. Retrieved 1 March 2018