Keywords

1 Introduction

Todays’ organizations are harvesting more and more data using technologies such as mobile computing, social networks, cloud computing, and internet of things (IoT) (Akerkar 2013). This data deluge can be used to create a competitive advantage over competitors and create significant benefits (LaValle et al. 2013) such as better understanding of customer’s behavior, more effective and efficient marketing, more precise market forecasting, and more manageable asset risks (Beattie and Meara 2013; PricewaterhouseCoopers 2013). Manyika et al. (2011) argues that finance and insurance organizations have one of the highest potential to take advantage from big data.

However, creating value from big data is a daunting task. Reid’s et al. (2015) study revealed that two thirds of businesses across Europe and North America failed to extract value from their data. A number of challenges impede the creation of value from data by the financial service organizations (The Economist Intelligence Unit 2012). Data quality is one of the challenges that are frequently mentioned in the literature impeding value creation from big data (Chen et al. 2014; Fan et al. 2014; Janssen et al. 2016; Leavitt 2013; Marx 2013; Zhou et al. 2014; Zicari 2014).

Data quality is a multi-dimensional construct (Eppler 2001; Fox et al. 1994; Miller 1996; Tayi and Ballou 1998; Wang and Strong 1996). In data quality the role of the data custodian is a key elements in the relationship between colleting and creating value from data. Data custodians process data from data producers/providers and generate information for data consumer. Wang and Strong’s (1996) definition of data quality embraces the data custodian’s perspective, “data quality is data that is fit for use by data custodian” (p. 6). To be fit for data custodian’ task, the data should not only be intrinsically good, but also have proper representation, properly accessed and retrieved from the source, as well as appropriate for contextual use.

Insufficient data quality hinders the value creation from the data (Verhoef et al. 2015). Redman (1998) found that lack of data quality results in disadvantages a the operational, tactical and strategic level, including:

  • Operational level: lower customer satisfaction, an increase in costs, and lower employee satisfaction;

  • Tactical level: poorer decision making, longer time to make decision, more difficulties to implement data warehouse, more difficulties to reengineer, and increased organization mistrust;

  • Strategy level: more difficulties to set strategy, more difficulties to execute strategy, contribution to issues of data ownership, compromise ability to align organizations, and diverting management attention.

Moreover, poor data quality is also associated with great amount of quality cost. According to Eckerson (2002) poor data quality costs US businesses $600 billion annually (3.5% of GDP).

Our objective is to understand the relationship between big data and data quality in financial service organizations. This research is among the first that studied the relationship between big data and data quality. For this purpose, we formulated a research approach which is presented in Sect. 2. We then discussed key concepts and theories on the basis of state-of-the-art literature in Sect. 3. Big data will be measured by looking at its defining characteristics (the V’s) and data quality will be measured using the commonly found dimensions in the literature. Next case studies and the corresponding findings is presented in Sect. 4. This resulted in the relationship between the big data characteristics and data quality dimensions. Finally, conclusions will be drawn in Sect. 5.

2 Research Approach

To attain our objective, i.e. investigating correlation between big data and data quality, three main steps were taken

  1. 1.

    Literature review to further detail the big data and the data quality. This resulted in big data construct which is represented by its characteristics (V’s) and data quality construct which is represented by its dimensions. The constructs are employed as the basis for investigating the case studies.

  2. 2.

    Online case studies from financial service organizations by content analysis to extract data quality issues and the corresponding big data characteristics. The result is list of data quality issues as a consequence of big data characteristics. These cases did not enable us to understand the causal relation;

  3. 3.

    In-depth case studies at financial service organizations to cross-reference and further refine the findings from online case studies. The refined list of data quality issues is mapped to the corresponding data quality dimensions.

First literature about big data characteristics and data quality dimensions were investigated. To review big data characteristics, we surveyed the literatures during 2011–2016 for any statements of ‘big data’ or ‘data-intensive’ in Scopus. 22,362 documents were found. After carefully checked the contents, we focused on nine papers that are strongly relevant with big data characteristics. The same approach was utilized to study the data quality concepts. Using the statements ‘data quality’ or ‘information quality’, we found 7,468 documents in Scopus. However, we concentrated to 13 articles that discussed comprehensively about data quality and its dimensions.

The aim of the desk research was to find relevant cases. To explore the relationship between big data characteristics and data quality in financial industry, a desk research to online articles and corresponding white papers was conducted with systematic approach. The search started with narrowing down 10 biggest banks Europe based on Banks Daily’s rankingFootnote 1 and 10 biggest insurance companies in Europe based on Relbanks’s rankingFootnote 2 to keep the focus of this research. The search is conducted through Google Search with keyword “big data” < institution name > (e.g. “big data” Barclays). From the 2000 search results (10 Google Search pages of 10 search result per page for each institution), 2 of the authors independently selected relevant articles which results in a list 32 articles that were relevant with big data quality and produced within 5-years’ timeframe (2011–2016). After further analysis, seven online cases were selected providing sufficient details (e.g. mentioning data input, information output, and problematic big data quality issues) for being able to analyze them, as described in Table 1. The cases were analyzed for its big data characteristics and data quality dimension using content analysis of the case studies’ documents and interview transcripts using NVivo software. Content analysis has been widely used in qualitative study to analyze and extract information from text, web pages, and various documents (Hsieh and Shannon 2005).

Table 1. Online cases that are used in this study

In addition, we conducted three in-depth case studies to confirm and refine our findings from the previous step. It is important to see how the findings implemented in real-life practices as well as to find out the possible missing challenges. The criteria of case study selection were defined as follows: (1) the organization must be an information-intensive financial service organization; (2) the organization should make use of big data; (3) The organization is willing to cooperate and share information that are required to conduct this study. Three case studies were created by conducting interviews and investigating documents. The summary of offline case studies are presented in Table 2.

Table 2. In-depth cases that are used in the study

3 Literature Background: Key Concepts

3.1 Big Data Concept

Big data is used in various ways and has no uniform definition (Chemitiganti 2016; Ward and Barker 2013). Big data is often described in through white papers, reports, and articles about emerging trends and technology. A lack of formal definition may lead to research into multiple and inconsistent paths. Nevertheless, there is consensus about what constitutes the characteristics of big data. The big data have changed over time. As the initial big data characteristic the three V’s of Volume, Velocity, and Variety were introduced by Douglas (2001). Later, IBM added a new V called Veracity, which addresses the uncertainty and trustworthiness of data and data source (2012). The V’s continues to evolve to 5 V’s (Leboeuf 2016), 8 V’s (m-Brain, n.d.), and 9 V’s (Fernández et al. 2014). Our literature review that 11 different V’s are mentioned in the literature and reports. As our objective is to take a comprehensive view we take all V’s into account and define these V’s to avoid any confusion about overlap between these characteristics. The characteristics and their definitions are presented in Table 3. These will be used to analyze the big data used in the case studies.

Table 3. Big data characteristics

3.2 Data Quality (DQ) Concept

Data is the lifeblood of financial industry and DQ is key to the success of any financial organization (Zahay et al. 2012). Financial players such as analysts, risk managers, and traders rely on data in their value chain. Poor DQ such an inaccurate or biased data may lead to misleading insights and even wrong conclusions. Financial industry was reported to loss $10 billion annually from poor DQ (Klaus 2011). In addition, as a highly regulated industry, finance service organizations must conform to several regulations which require high DQ (Glowalla and Sunyaev 2012).

Quality is rather a subjective term, i.e. the interpretation of ‘high quality’ may differ from person to person. Moreover, the notion may change based on the circumstances. Various definitions of DQ are found in the literature (Eppler 2001; Huang et al. 1998; Kahn and Strong 1998; Miller 1996; Mouzhi and Helfert 2007; Tayi and Ballou 1998; Wang 1998; Wang et al. 1993). Overall, the term DQ depends not only on its intrinsic quality (conformance to specification), but also the actual use of the data (conformance with customer’s expectation) (Wang and Strong 1996). Knowing the customers and their business needs is a precursor to understand how DQ will be perceived (Fig. 1).

Fig. 1.
figure 1

(adapted from Wang and Strong 1996)

DQ category and dimensions

DQ is a multidimensional concept (Eppler 2001; Fox et al. 1994; Miller 1996; Tayi and Ballou 1998; Wang and Strong 1996). However, there is neither a consensus on what constitute the dimensions of DQ, nor the exact meaning of each dimension (Nelson et al. 2005). The dimensions of DQ vary among scholars (Bovee et al. 2001; Fox et al. 1994; Miller 1996; Naumann 2002; Wang and Strong 1996). However, the most cited DQ dimensions are the dimensions of Wang and Strong (1996), They list sixteen DQ dimensions categorized into four thematic, namely intrinsic, accessibility, contextual, and representational quality, as shown in Fig. 2.

Fig. 2.
figure 2

Relating big data characteristics to DQ dimension

Intrinsic quality is referring to internal properties of the data, e.g. accuracy, objectivity, believability, and reputation. Accesibility quality emphasizes the importance of computer systems that store and provide access to data. Representational quality consists of understandability, interpretability, concise representation, and consistent representation. Contextual quality, which highlights the requirement that DQ must be considered within the context of the task at hand, consists of value-added, relevance, timeliness, completeness, and appropriate amount.

4 Correlation Between Big Data and Data Quality in Financial Service Organizations

Our aim was to investigate the relationship between big data characteristics and DQ dimensions as depicted in Fig. 2. The big data characteristics and DQ dimensions are used to investigate the case studies. Using content analysis these are mapped and the relationship explored. There are eleven Vs that represent big data (their definition were given in Sect. 3) and four category of DQ that includes 16 dimensions (see Sect. 3 for their definition). We conducted seven cases that were carefully selected as explained in Sect. 2 to study the correlation. Three more in-depth case studies were performed to confirm and refine the findings and investigate the relationship in detail. DQ issues emerged from big data characteristics mentioned in case studies were explained as follow. Although big data characteristics and DQ dimensions are different, we found both ‘value’ refers to the same definition. Therefore we opted only one ‘value’ in the matrix, i.e. ‘value’ as a DQ dimension.

4.1 Volume

Volume was not frequently mentioned affecting DQ issue in the case. Huge size of data could increase chance to discover hidden patterns, such as finding a suspicious fraud. In addition, larger volume most likely leads to higher representativeness. However, bigger size could also bring troubles. In case 3 and 7, information overload was caused by volume of the data. It affected the level of amount of the data that is needed for the task in hand. For example, UBS Bank found in several situations that the transaction data for risk identification was too large for pre-processing.

4.2 Velocity

Many financial service organizations need real-time data for their activities such as fraud detection, complaint monitoring, and customer retention. Therefore, they were very concerned with the timeliness of the data. Outdated data is mentioned as an important issue by most cases (case 1, 2, 4, 5, 6, and 7). For example, data like credit card transactions is useful for the fraud detection and avoiding the fraud can have a huge impact, but becomes useless if it is not processed in real-time to predict and prevent the subsequent fraud.

4.3 Variety

Most cases mentioned the necessity to combine data from multiple sources in order to reveal more insightful value. However, incorporating many data sources results in a number of DQ issues, such as:

  1. (1)

    Different value was reported by same field from multiple data (case 3 and 6). An example is having a different zip code for the same person in different data sources;

  2. (2)

    Inconsistent field’s accuracy from multiple data (case 3 and 6), e.g. which one is the accurate one from multiple zipcodes for the same person?;

  3. (3)

    Varied population representativeness from multiple data (case 3 and 6), e.g. some data have true objectivity but others like social media data tend to be biased and the data represents only certain group of population (e.g. youth, people with good internet connection);

  4. (4)

    Inconsistent field’s format from multiple data (case 3 and 6, also confirmed in in-depth case 3). An simple example is that the content of field ‘name’ is varied in multiple data (e.g. John Clarke Doe, J. Doe, J. C. Doe);

  5. (5)

    Inconsistent field’s content from multiple data (case 3 and 6). An example is having ‘male’ and ‘man’ in the ‘sex’ field;

  6. (6)

    Different terminologies/semantics/definitions from multiple data (case 2, 4, and 5). For example the term ‘risk’ in the data differs across data sources from various domains, especially data from non-specific finance domain;

  7. (7)

    Various requirements for access from multiple data producers/providers (case 1, 5, and 7). Some data providers provide a secure API, whereas others may prefer insecure API or even refer plain data transfer to ensure a high speed;

  8. (8)

    Complex structure of the data (case 1, 2, 4, 5, 6, and 7). An example is unstructured content from social media that contains lexical complexity;

  9. (9)

    Duplicate and redundant data sources (case 1 and 6, confirmed in in-depth case 1, 2, and 3). In offline case 1, there are two legacy systems for mortgages for the private banking and for the company which keep different record of information, but refer to the same mortgage;

  10. (10)

    Incomplete content of the field in the data (case 2 and 6, confirmed in in-depth case 1). In in-depth case 1, previously customers can use post bus as an address, but based on new regulation now they must use postal code. Because the postal code data was not required previously, the absence of this data would make the mortgage information considered as incomplete;

  11. (11)

    Timeliness from multiple data (case 3, 4, and 7) causes difficulties to combine those data in the same timeframe, e.g. statistics data from Eurostat or World Bank was collected at different points in time and cannot be combined to infer at useful insights;

  12. (12)

    Complex relationship among data (case 1, 2, 4, 5, 6, and 7). The more varied and numerous data fed into the system, the more complex the relationship resides in those data and the more complex it is to be combined In these cases we found that the data could not be combined as the data analysts were not able to unravel the complexity.

4.4 Variability

Variability of the data is rarely mentioned in the cases. The DQ issues originate from the use of social media data. In case 3, different contextual meaning and sentiment for same content in the data occurs, e.g. ‘happy’ and ‘happy?’. Real sentiments are hard to express. It brings difficulties to operate the data if the organization uses a traditional way (e.g. static algorithm) to process the content. Moreover, the meaning of the words changes dependent on the context and the time which brings in the need to dynamically interpret the sentiment. The word could change from positive sentiment to neutral sentiment or even to negative sentiment after contextually use by communities along the time. For example, the word ‘advertisement’ which formerly gave a neutral sentiment currently shifts to a negative sentiment. It’s because nowadays people are annoyed by too many digital ads in web pages. On the contrary, some words may shift from neutral or negative sentiment to positive sentiment, such as ‘vegetarian’ that before was neutral now becoming more positive due to people’s conscience of nature reservation and personal health.

4.5 Veracity

Since many organizations involve many data sources into their data processing, they may face trustworthy issues on the authenticity, origin/reputation, availability, and accountability of the data, especially with the data is freely available in the Internet. The following DQ issues were found

  1. (1)

    Inaccurate content often found from self-reported data like social media (case 2). For example complaint came from black campaigner or fake account;

  2. (2)

    Unclear reliability and credibility of data providers (case 3, confirmed in indepth case 2), e.g. blogs or untrusted media;

  3. (3)

    Unclear ownership of the data (case 2, confirmed in in-depth case 2) may discourage organizations to use the data because they might not able to access the data if there is dispute in the future regarding commercial use of the data;

  4. (4)

    Unclear responsibility to maintain content of the data (case 2) might hinder use of the data for long term because the data could be complete and timely at the moment but useless in the future if the content and update of the data is not managed properly; the data from untrusted data source such as social media probably tends to have low objectivity, i.e. representing only portion of population (case 2, 3, 6, and 7).

4.6 Validity

Validity strongly represents the compliance of data generation with respect to procedures and regulations. Finance service organizations are among institutions that are mandated to strictly comply with external regulations such as privacy law and confidentiality agreement, as well as internal regulations and procedures, such as SOPs for data entry, service level agreements with partners and among internal units. Hence, the validity of the data should be carefully assessed beforehand because invalid data may bring trouble in the future.

Validity impacts the following DQ issues are the following

  1. (1)

    Inaccurate content of the field in the data due to manual entry (raised from offline case 1 and 3) creates difficulties to understand the data, e.g. wrong address, wrong postal code, or wrong spelling in mortgage data because of disobedience to DQ control procedures;

  2. (2)

    Wrong coding or tagging in the data (case 3);

  3. (3)

    Uncertainty about the right to use the data. For example no knowledge about licenses or the impact of the privacy regulation (case 1, 2, and 3, confirmed in in-depth case 1) might limit or even remove access of the organizations to personal data;

  4. (4)

    Difficult to extract value from anonymous data (case 1, 2, and 3) as a consequence of privacy compliance because person-related field (e.g. name, phone number, email address) is the primary key of multiple data that are going to be combined;

  5. (5)

    Anonymous field makes the data become incomplete for the task in hand (case 1, 2, and 3).

4.7 Visibility

Almost all the cases mentioned that it is difficult to discover the relationship among variables within the data. For example, it’s difficult to reveal which group of ages that have increasing internet banking usage over time in certain country by only viewing the data. Moreover, the more sources combined in the process, the more variables are added and the more complex relationship among the variables. Unless the organizations build capability to visualize big data, that relationship is difficult to discover (case 1, 2, 3, 4, 5, 6, and 7).

4.8 Vast Resource

Some cases mentioned that vast resources are essentially required in order to retrieve and process the data (case 2 and 5). Retrieving huge size, very rapid generation, variety of the data needs, sufficient network bandwidth (especially if the organizations decided to put the data analytics platform in the cloud), computing power, and storage. Moreover, data engineering skills are required to retrieve and operate the data. Besides that, to discover the relationship among variables in the data and finally get the insight from the data organizations require data scientist skills (case 1, 2, 4, 5, 6, and 7).

4.9 Volatility, Viability, Value

No case mentioned volatility and viability characteristic of big data influence DQ. An explanation for this is that these factors are less essential for finance service organizations. Meanwhile, value is not coded from the investigated cases because it is conflicting with value-added dimension of DQ and ‘value’ is not big data specific.

5 Mapping Big Data and Data Quality

From the aforementioned DQ issues that were resulted by big data characteristics, each issue was then mapped into DQ dimension, as shown in Fig. 3. The corresponding case number either online or offline are put near the arrow.

Fig. 3.
figure 3

Impact of big data characteristics on DQ dimensions ([x]: online case number, (x): offline case number)

The finding indicates there are no relationship between viability and volatility characteristic of big data with DQ in the investigated finance service organizations. The most dominant correlation is Velocity-Timeliness that were found in all online cases. The relationship reflects that finance service organizations perceive the rapid generation of the data and real-time use of data, such as credit card transaction data or insurance holder’s claim, plays an important role to create timely value of data, such as for fraud detection. The next dominant correlation is Variety-Ease of operations, interpreted as inclusion of data from multiple sources that may come with inconsistent formats and conflicting contents makes organizations difficult to process the data. Variety-Value added follows behind, which indicate that value creation is strongly influenced by number of data sources and complexity level of content (unstructured) residing in the data. Another most dominant pair is Visibility-Value which reflect the need of visualization to quickly discover the relationship among variables in the data. Vast resources-Value added is the next, which indicates the need of vast resources (hardware, software, data engineers, and data scientists) to retrieve, exploit, visualize and analyze the data so the value from the data could be derived.

The Table 4 was summarized from Fig. 3. It constructs a matrix that matches big data characteristics to DQ dimension. The number indicated in the pair represents the number of cases that mentioned the correlation.

Table 4. Number of cases from correlation pair between big data characteristics and DQ dimension

From big data characteristics, variety is the most dominant one in our cases of the financial service organizations. It influences all categories of DQ, i.e. intrinsic, representational, accessibility, and contextual DQ. The reason for this is that nowadays organizations utilize multiple data sources, for example the ones that have formerly been ignored – namely “long tail“ of big data, as well as new generated ones (Bean 2016). The next most influential big data characteristic is Validity which reflects organization’s compliance to regulation and procedures, for example about use of personal data (e.g. privacy law, untraceable requests, and confidentiality agreements). Compliance to privacy is very vital for service organizations (Yu et al. 2015), especially bank and insurance companies (Breaux et al. 2006; Karagiannis et al. 2007). Moreover, validity affects the accessibility to customer’s data in the long run, meaning that one day organization may loss its right to access the personal data if the customer or regulator requests to disclose or remove personal data. As a result, completeness of the data drops and value creation process (e.g. analyzing data) becomes more complex if anonymous data is the only way organization can use. Another dominant big data characteristic is veracity. Veracity or trustworthiness of the data is inevitable when multiple data sources are utilized to discover more insights (Leboeuf 2016). Since veracity includes authenticity, origin/reputation, availability, accountability of the data (Tee 2013), unsurprisingly intrinsic quality which embodied the issues is mostly influenced by this characteristic.

As depicted in Table 4, the most correlated category of DQ dimension is contextual quality. It is unsurprising because every organization tries their best for extracting contexts from big data. Two dimensions from contextual quality are dominant in the finding, i.e. value-added and timeliness. Since today’s organizations struggle creating business value from the data (Reid et al. 2015), the value from use of the data needs ample research. Another dominant correlated DQ dimension is accessibility which sounds the awareness of the financial service organizations to compliance.

6 Conclusion

The objective of this paper is to investigate the relation between big data and data quality. This study is among the first that investigated the complex relationship. To attain the objective, we conducted literature review, online and offline case studies in financial service organizations. Seven online case studies were initially performed to reveal the correlation, followed by three offline studies for cross-referencing and refining the findings. DQ issues raised from the case studies are then coded and mapped into the corresponding pair of big data characteristic and DQ dimension using content analysis. This provided detailed insight into the relationships between the V’s of big data and dimensions of DQ. The Vs’ take a blackbox perspective on the data. It characterizes the data form the outside. Meanwhile, DQ is about the actual data and can only be determined when investigating the data and by opening the blackbox. The V’s characteristics and DQ are similar in the sense that they provide insight about the data. They are complementary as the V’s take a look from the outside and at the possible usage, whereas, DQ look at the actual datasets.

The most related pair is Velocity-Timeliness, which indicates the more rapid the data being generated and processed, the better timely the data to use. This is followed by Variety-Ease of operations (more data sources and more varied structure of the data, the more complexity to retrieve, exploit, analyze and visualize the data), Variety-Value (the more data sources and more varied structure of the data resulting in more difficult to create value from the data), Visibility-Value (the more hidden relationship within the data, the more difficult to create value from the data) and Vast resources-Value (the more resources needed to process the data, the more difficult to create value from the data). Except for Viability and Volatility all Vs of big data influence DQ. Concise representation and access security were not found to be DQ issues in the cases. Variety is the most dominant factor impacting all categories of DQ, followed by Validity and Veracity. This suggest that term ‘big data’ is misleading as in our research we found that most of the time volume (‘big’) was not an issue and variety, validity and veracity is much more important.

Our findings suggest that organizations should take care of managing the variety of data and also ensure the validity and veracity of big data. The most correlated category of DQ dimension is contextual quality, which includes value and timeliness as the most dominant correlated DQ dimensions, followed by accessibility. These findings suggest that more effort should be spent on improving contextual use of the data as well as ensuring long-term accessibility to the data.

Further research recommendation is to cross-reference the findings with big data implementation in other information-intensive domains, such as telecommunication, government, and retail for generalization. This findings also open avenue to develop tools to improve and manage big DQ.