Keywords

1 Introduction

According to a popular, yet highly controversial saying, data – especially the Big Data – is the new oil. However, while the analogue has its merits in terms of the value potential of Big Data for contemporary businesses, a deeper scrutiny of the analogue reveals certain discrepancies that can be used to explore the value chain and challenges of Big Data. For example, unlike oil, Big Data streams from seemingly unlimited sources, and as such, is to quite an extent continuously renewable. Second, unlike raw oil, raw data has no such consistent constitution, which would always yield value when refined [1) – a big portion of the raw data is merely useless. And thirdly, raw data has not emerged as a result of evolutionary processes guided by the immutable laws of nature, but is a creation of intentional and unintentional human agency, guided by the exactly same haphazardness that accompanies all human activities.

These simple insights lead towards the focal discussion of this article. We map out the value chain of Big Data and identify the keychallenges associated with each stage of the value chain. We specifically focus on challenges that are particularly difficult to overcome with solely technological means. For example, the increasing sophistication of data processing algorithms notwithstanding, datafication of the entities from the physical, and particularly from the subjective realmsis prone to various errors and inaccuracies. Addressing these challenges essentially requires human judgement along the process of reaping the benefits of Big Data.

Far from being an unanimously defined concept [2], there is however an understanding of what types of contents the label ‘Big Data’ contains. The constitution of Big Data includes not only the traditional type of alphanumerical and relatively homogenous pre-categorized data found in institutional databases, but also the transsemiotic (images, sounds, scents, movements, digital action tokens, temperature, humidity to name a few) and highly heterogeneous data that is not categorized prior to its harvesting [3,4,5,6,7]. This latter type of data is sourced through sensor technology, from the intentional and unintentional interactions between humans and machines, from the surveillance systems, and from the automated digital transaction traces [2, 8,9,10,11,12,13,14]. As a result, of the increasing prowess of sourcing the data (i.e. datafication) and the equally increasing and cheapening computational capacity, the accumulation of data is extremely rapid, also resulting in the continuous change in its constitution. In short, Big Data is a constantly growing and changing nebulous and amorphous mass.

However, while the mass of existing Big Data is impossible to delineate, viewing the phenomenon from the perspective of its anticipated utility reveals a value chain spanning from datafication to potential business value creation. In this article, we identify eight stages of the value chain, discussed in more detail later, but named here as datafication, digitizing, connectivity, storage, categorizing, patterning, cross-analysis and personalization. Indeed, there are other proposals for the Big Data value chain [15], however, most of the existing value chain proposals delineate diverse technological stages required in rendering data useful. In this paper, we delineate the stages according to the increase in the value of the data, meaning that some of the stages include several technologies, and some technologies span more than one stage. In short, we decouple the technological requirements of processing data from the value adding activities in refining data.

The research question of this paper is twofold: (i) what types of challenges exist in different stages of the Big Data value chain, and (ii) which of these challenges are particularly difficult to mitigate without human-based intelligence and judgement? In order to explore these questions, the remainder of the article flows through first delving the eight stages and clustering them into three main phases to identify the relevant accompanying challenges, towards the conclusion listing the contributions, limitations and future research possibilities.

2 Big Data Value Chain

The process of obtaining insights from Big Data can be divided into three main stages, namely sourcing, warehousing, and analyzing data [16,17,18]. These stages have been rearranged and complemented, for example resulting in the stages of data acquisition, data analysis, data curation, data storage, and data usage [15]. These typologies are delineated from the perspective of technology, through clustering the stages along the technological requirements, a useful approach when the focus is on the side of technological developments needed for realizing the potential value of Big Data.

However, if we shift the focus beyond technology and zoom in to the actual value adding processes, the technology-driven boundaries do not match the boundaries between the stages of value add. Therefore, we propose another conceptualization of the Big Data value chain, where each of the stages differs from its neighbors in terms of its value accumulation potential. Our approach further divides the established three main stages and consists of the stages of datafication, digitizing, connectivity, storage, categorizing, patterning, cross-analysis and personalization, introduced next.

Datafication.

The emergence of the phenomenon of Big Data is underpinned by the developments in technologies that enable datafying different types of entities. For example, the developments in the sensor technology have enabled producing data about movements, humidity, location, sounds, composition or smell to name a few [1]. On the other hand, the diffusion of digital devices and the accompanying increase in human-computer interaction is making it possible to deduce and produce data about the subjective preferences of the individuals, based on the traces of these interactions [2, 19,20,21,22]. These developments in the technologies enabling myriad forms of datafication are the core source of Big Data, and the first fundamental building block of the Big Data value chain.

Digitizing.

As data is created from entities of a wide variety of ontological natures, capturing the data through analog technologies resulted in various data types, each requiring their own processing technologies. When all data is digitized, made into binary digits of zeroes and ones, in theory any machine capable of processing bits could process the data [23, 24] though in practice this is not yet the case. In terms of Big Data, the major value creating step is the homogenization of the data from the diverse sources, because it creates the foundation for cross-analyzing and -referencing data sets originating from very diverse sources.

Connectivity.

Even if we had homogenous data from a variety of phenomena, without the capability to connect that data, each individual parcel of data would be relatively useless. However, with the emergence of the TCP/IP protocol and internet mandating how the data should be packaged, addressed, transmitted, routed and received, and the developments in the communications technologies, the uniform data from a diversity of sources can be transmitted somewhere to be pooled and accessed. The technologies enabling this are multiple and continuously developing. However, the value add transcends the technologies: while there are imperfections in the complex communications technologies and even in the very design of the internet [25], the idea of connectivity is a major value add by itself.

Storage.

Even though digital data exists only in the form of zeroes and ones, it is not without a physical representation – quite the contrary, as storing the masses of data require hardware that enable accessing and processing the pooled homogenous data. The developments in the computational power and the cheapening of the storage capacity is critical for this value adding stage. However, the value itself emerges from the existence of these pools of data, which enables processing together data from the variety of sources. The technological solutions of data centers, data warehouses and data lakes are complex, however as with the stage of connectivity, the value add emerges from the mere possibility of having the data mass stored in pools fed and accessed from diverse points of entry [15].

Categorizing.

Unlike data in the traditional data sources, one of the defining features of Big Data is its automated and autonomous accumulation, which in other words means that the data is not categorized on the way in [7]. Instead, any sense making of the vast data masses must begin – or at least be guided by – designing mechanisms and principles based on which the data can be categorized. This value adding stage of categorizing is the stage where the end use of the data needs to be accounted for, because of the generative nature of data [26]: not only can the data be categorized in many ways, but the same data may yield different utility depending on the context of its use [27]. This is the stage where the algorithms are essential. Due to the volume, variety, and velocity of data, human computational capabilities are insufficient for processing. Therefore, algorithmic processing capabilities are needed [3, 10, 28,29,30].

Patterning.

The importance of the algorithms increases towards the end of the value chain. At the stage of patterning, the task is to identify patterns from the categorized masses of data. The patterns constitute the first stage in the value chain possessing to identify business value potential. Due to the volume and variety of the data, it is possible to identify patterns that may be invisible in smaller (in scope or quantity) data sets [31]. As an example, the customer behavior from CRM systems can be patterned to better understand the behavior of certain customer group.

Cross-analysis.

Even more valuable than identifying novel patterns, is the ability to cross-analyze diverse patterns to seek correlations – for example through cross-analysing the data patterns of customer behavior against the patterns from marketing campaigns. Most of the current data use cases are grounded on this stage of data utility [13], which excels in creating generalized knowledge about a wide variety of phenomena. For example, through cross-analyzing the data from traffic accidents and driver demography, it is possible to find correlations between the age and gender of the drivers and accidents. This value adding stage also enables increasingly efficient customer segmentation for example in social media marketing. If certain preferences and demographic features are correlated, an offering can be marketed to the exact demography.

Personalization.

The most value potential of Big Data is embedded in the final stage, namely personalization, which means that the data can be used for behavioral predictions [32, 33]. This value adding capability is built on first having such cross-analyzed patterns that reveal correlations, and then analyzing the behavioral datafied history of an individual against those correlations [2, 20]. Continuing the example in the previous stage, here the increase in value emerges from the possibility of harvesting the information of the driving behavior of an individual, and cross-analyzing that personal history with such generalized driving behavior patterns that correlate with an increase in accidents. Also, in terms of targeted marketing, at this stage it is possible to deduce the preferences on the level of the individual, based on the traces left in human-device interactions, and to personalize the offerings accordingly [19].

3 Challenges in the Big Data Value Chain

As our focus is on identifying such challenges that are particularly difficult to overcome through technological developments alone, we will not offer a comprehensive view on the current state-of-the-art in any of the underlying technologies. In other words, the boundary between what can and cannot be solved through technological developments is blurred and bound to a specific point of time.

The next subchapters cluster the aforementioned value chain into three phases familiar in other data value chain approaches, the first of which we refer to as sourcing to encompass datafication and digitizing, to be followed with warehousing consisting of connectivity and storage, and finally analyzing, covering the stages from categorization and patterning to cross-analyzing and personalization. Table 1 summarizes the discussion.

Table 1. Challenges in the Big Data value chain

3.1 Challenges in Sourcing: Veracity

The validity of the end results of any data refining processes is dependent on the validity of the source data, and unlike with oil, the quality of the raw data varies vastly [1]. First, in digitizing existing data sets, the imperfections, biases, unintended gaps and intentional results of human curation of data also end up in the mass of Big Data [34]. In other words, such data that originates from times preceding the digital age, the acts of datafying and storing that data required a lot of work, which means that only a small part of all relevant data ended up in a form that yields itself to digitizing.

In a more limited scale, this applies also to such data that does exist in the traditional databases. Due to the costs embedded in storing data with pre-digital means, the databases hold pre-prioritized, pre-categorized data that someone has chosen at some stage to store. This means that such queries that require accounting for historical data can never be quite accurate, because most of what has been gone without a retrospectively datafiable trace, and the rest is already curated and thus subjected to human biases and heuristics [9]. This problem is referred to as veracity [35], which means coping with the biases, doubts, imprecision, fabrications, messiness and misplaced evidence in the data. As a result, the aim of measuring veracity is to evaluate the accuracy of data and its potential use for analysis [36].

However, the older databases are only one source of data, and the issue of veracity is not limited to it. Another source, the interactions between humans and digital devices [for example in using the mobile phones, browsing the internet or engaging in social media) is a vast torrent of data. Only Facebook generates more than 500 terabytes of data per day. The sheer volume and variability of such data presents its own problems in terms of technological requirements; however, the veracity of that data is even more problematic. As illustrated by Sivarajah, Irani, and Weerakkody [37] data from human-to-human online social interaction is essentially heterogenous and unclear in nature. Furthermore, malicious tools or codes can be used to continuously click on the performance-based ads creating fake data. There are bots that create traces mimicking human behavior. In addition, individuals vary in the level of truthfulness of their traceable activities. Part of the problems of veracity stem from the intentional human actions, which can be biased, misleading, and overall random. Furthermore, the samples of population participating in online interactions is skewed – not to even mention the geographical discrepancies emerging from the varying levels of technological penetration around the globe [1].

In turn, the data from the sensor technologies, surveillance systems and digital transaction traces does not suffer from similar biases such as intentional human actions. However, these sources have their own veracity issues. As the processes are automated, and not a priori prioritized, a portion of that data is irrelevant and useless. This challenge relates to cleaning data, i.e. extracting useful data from a collected pool of unstructured data. Proponents of Big Data analytics highlight that particularly developing more efficient and sophisticated approaches to mine and clean data can significantly contribute to the potential impact and ultimately value that can be created through utilizing big data [3]. At the same time, however, developing the tools and methods to extract meaningful data is considered an ongoing challenge [27].

The problem of veracity is partially mitigable through technological developments. With the development of data cleaning and mining technologies, it becomes easier to filter the vast data masses to extract the valuable nuggets. However, the inaccuracies resulting from the imperfect older data sources and the haphazardness of human action are fundamentally immitigable.

3.2 Challenges in Warehousing: Ownership and Power

The primary problems in connectivity and storage relate to technologies enabling the transmission, storing and accessing the vast data masses. However, not all problems are even in these stages solvable through technological progress. Scrutinizing data security highlights the issue, as it is only partially a technological question.

As argued by Krishnamurthy and Desouza [38], companies and organizations are facing challenges in managing privacy issues, thus hindering organizations in moving forward in their efforts towards leveraging big data. For example, smart cities, where collected data from sensors about people’s activities can be accessed by various governmental and non-governmental actors [39]. Furthermore, the distributed nature of big data leads to specific challenges in terms of intrusion [40] and thus may lead to challenges to various threats such as attacks [41] and malware [42].

However, underpinning these data security capabilities is the question of data ownership. As Zuboff (2015) notes, one of the predominant features in digitalization is the lack of the possibility to not to opt-in as a data source – as the everyday life of an individual is embedded in the invisible digital infrastructures [23, 43, 44] the individuals have little control over the data exhaust being created about them.

The boundaries of ownership and control rights are a serious problem that surpasses the technological problems of ensuring the data security of specific data sets or applications. The issue of ownership is ultimately a question of data driven power distribution: the agents possessing not only the data sourcing capabilities but also the data sharing resources, are harnessing power to not only create business value but to have also socio-political influence [45].

3.3 Challenges in Analyzing: Black Boxes, Standards of Desirability and Tradeoffs

There are two major aspects to algorithms that pave the way for discussing the challenges, here dubbed as standards of desirability and black boxes. Firstly, the algorithms cannot come up with priorities or questions by themselves, but humans are needed to provide them, and secondly, as we need the algorithms because the human computational capacity cannot deal with the vast masses of digital data, the human computational capacity cannot follow the algorithmic processes dealing with those masses of data.

To begin with the problem of standards of desirability [46, 47], any question or task given to the algorithm to be processed must be underpinned by a set of goals that are to be reached. However, at any given moment of designing a goal there is no way of knowing whether that goal is still relevant or preferable at the time of reaching that goal: both the environmental circumstances and the internal preferences can have undergone changes. Especially, considering the generative nature of digital data [26] and the digital affordances [48], the data itself does not mandate a specific use or specific questions; instead the utility of the versatile data is dependent of the contextual fit and quality of the questions guiding the analyzing processes at any given time. However, as we know from both history and human focused research, we humans are far from infallible – the quality of questions, or the relevance of them is never guaranteed [49,50,51,52,53], which means that this challenge is immitigable through technological advances.

Secondly, as we cannot follow the algorithmic computational processes, we cannot detect if there are errors in the processes [34, 54] – ultimately the processes are black boxes. Furthermore, the black boxed nature of the outcomes does not end at the revelation of the outcomes of algorithmic processes: the outcomes of patterning and cross-analysis reveal correlations, and due to the sheer volume of the data masses there are high possibilities to find significant correlations between any pair of variables. This means that the identified correlations can be mere noise, unless supported by theoretical mechanisms(1]: unlike Anderson [55], McAfee et al. [9] claim, the scientists are not rendered obsolete, however the focal usefulness of scholars shifts from hunting correlations to understanding the underpinning theoretical causalities. The black boxes of algorithms reveal the black boxes of correlations, leaving it to humans to assess the relevance and validity of the outcomes.

Thirdly, the algorithmic analysis of Big Data creates such opportunities that require scrutinizing the accompanying tradeoffs. Newell and Marabelli [28] identify three, and next we introduce five. The tradeoffs next discussed are privacy-personalization, convenience-independence, collective security-individual freedom, data security-machine learning optimization, and ease of outcomes-validity of process.

Privacy vs Personalization.

Reaping the benefits from the ultimate stage of the Big data value chain through providing personalized offerings means that the agent making the offerings has to have access to personal data – in other words, has to breach the privacy of the targeted customer to an extent [56, 57]. The ethical valence of this tradeoff needs to be considered contextually, meaning that there are both cases where the loss of privacy is easily offset by the benefits resulting from accessing the personalized service, and cases where the value of the personalized offering does not justify the breach to privacy [13, 58,59,60].

Convenience vs Independence.

The more convenient it is to rely on a specific technology, for example the navigation devices and systems in cars and vessels, the more dependent on that technology one typically becomes. In a corporate use of Big Data and analytics, the widespread utilization of game analytics has even led into a situation where not using the technologies in decision-making is being referred to as “flying blind” and the use of analytics is considered as a necessity by game developers [61]. Taken together, there is an evident tradeoff between convenience and independence that requires acknowledging.

Collective Safety vs Individual Freedom.

Newell and Marabelli [28] share a case where Facebook was accused for not reacting to a threatening post by a person who carried out the threat and shot an individual. This example highlights this tradeoff amply: it is possible with the help of Big Data and behavioural prediction to anticipate threats to collective security, however reacting on those threats a priori of incidents limits the freedom of the individual being detained before having committed anything.

Data Security vs Machine Learning Optimization.

The different policies of EU and China in terms of access to data in developing artificial intelligence highlight this issue nicely. In Europe the priority is to protect the privacy and data security, which means that there is less data available for developing machine learning [62]. This results on the one hand the improved rights of the European citizens and on the other hand slower progress in developing artificial intelligence. In turn, China is investing heavily in developing artificial intelligence, for example facial recognition technologies through utilizing all available data from the ubiquitous mobile applications that billions of Chinese people use daily [63,64,65], resulting in less individual level data privacy, but competitiveness in the AI race.

Ease of Outcomes vs Validity of Process.

Traditionally, the value of accounting information has resided in the transparent and accessible processes through which the financial information has been gathered and processed. The credibility of the ensuing financial figures has been built on the validity of these processes. However, with the increasing use of the automated accounting systems and algorithms, the outcomes are achieved faster, however through the black boxes of algorithms – the validity of the process of creating the end results is no longer visible [54]. This is again a choice for the humans: when and why does the swift outcome have more value, and when and why is it mandatory to be able to observe the processes?

4 Conclusion

This study was set out to explore (i) what types of challenges exist in different stages of the Big Data value chain, and (ii) which of these challenges are particularly difficult to mitigate without human-based intelligence?

By addressing this research question: our paper adds on the literature on the roles and interplay between algorithmic and human-based intelligence in reaping the benefits and business value from Big Data [66] with two specific contributions. First, we present a Big Data value chain decoupled from the technological underpinnings and grounded on the stages of value accumulation, and secondly, we highlight a set of such challenges in utilizing Big Data that cannot be mitigated through technological developments.

We advance the understanding of the value and utility of data by putting forward a Big Data value chain that consists of eight stages in three clustered phases: sourcing (datafication and digitizing), warehousing (connectivity and storage) and analyzing (categorizing, patterning, cross-analyzing and personalization). In the first cluster, the immitigable challenges reflect the problems of veracity whereas in the second one, the problems relate to ownership and power distribution. Finally, in the third cluster, the issues include the black-boxed nature of the algorithms and implications thereof, the need for standards of desirability, and five tradeoffs that require acknowledging.

By elaborating on these tradeoffs related to the Big Data value chain, our study adds on the prior discussions related to data and privacy [2, 56, 57, 59, 60], strategic utility of data [3, 4, 10, 12] reliability of data [1, 9] and data ownership [45].

Like any other piece of research, this study suffers from a number of limitations that in turn call for additional research. First, due to its conceptual nature, empirical scrutiny of our Big Data value chain is a self-evident area for future research. Second, since the value chain can manifest itself differently across different contexts and under different contingencies, future research focusing on contextual aspects of Big Data value chain would be highly insightful [67]. Third, since the stages of the Big Data value chain often consist activities undertaken various actors, it is relevant to consider how does trust manifest itself among these actors [68], what kind of business ecosystems and networks emerge for the utilization of big data, and how different actors strategize their utilization of big data [69].