Advertisement

Iran Journal of Computer Science

, Volume 1, Issue 4, pp 237–259 | Cite as

Construing the big data based on taxonomy, analytics and approaches

  • Ajeet Ram Pathak
  • Manjusha Pandey
  • Siddharth Rautaray
Original Article
  • 229 Downloads

Abstract

Big data have become an important asset due to its immense power hidden in analytics. Every organization is inundated with colossal amount of data generated with high speed, requiring high-performance resources for storage and processing, special skills and technologies to get value out of it. Sources of big data may be either internal or external to organization, and big data may reside in structured, semi-structured or unstructured form. Artificial intelligence, Internet of Things, and social media are contributing to the growth of big data. Analytics is the use of statistics, maths, and machine learning to derive meaningful insights from data to make timely decisions and enable data-driven organization of the future. This paper sheds light upon big data, taxonomy of data, and hierarchical journey of data from its original form to the high level understanding in terms of wisdom. The paper also focuses on key characteristics of big data and challenges of handling big data. In addition, big data storage systems have also been briefly covered to get the idea on how storage systems help to accommodate the requirements of big data. This paper scrupulously articulates the eras of evolution of analytics varying from descriptive, predictive and prescriptive analytics. Process models used for inferring information from data have been compared and their applicability for analyzing big data has also been sought. Finally, recent developments carried in the domain of big data and analytics are compared based on the state-of-the-art approaches.

Keywords

Big data Data analytics Data science Deep learning 

1 Introduction

Being in the era of analytics 3.0, business industries and academia are overwhelmed with deluge of data. Network data, Internet of Things (IoT) enabled sensor data, event logs, user clicks on websites, transaction data, and social media data including audio tracks and videos are some examples of big data. Deriving key insights from such data in a timely and efficient manner has become the need of the hour. Many technical communities and business firms have come up with hardware and software solutions in the form of tools, products, proprietary packages and open-source application programming interfaces to address major challenges in big data. To keep pace with cutting-edge technology and enable faster decision-making, researchers, business firms, analysts are analyzing big data using advanced techniques based on machine learning, deep learning, statistics, predictive analytics, and natural language processing.

Big data analytics is about extracting valuable insights from data and empowering decision-makers with analytics that will eradicate gut feel decision-making and enable the data-driven organization of the future. Data Science is the interdisciplinary field consisting of scientific methods and processes for deriving the key insights from data based on domain expertise, mathematical, statistical knowledge and computing skills.

1.1 Background and motivation

The world-wide popularity of big data, analytics and data science over time period of January 2012–March 2018 obtained by Google trends is shown in Fig. 1. Computer science corporation predicted that rate of production of data will be 44 times greater than that in 2009 [1].
Fig. 1

Popularity of big data, analytics and data science over time period from January 2012 to March 2018 (created using Google trends)

Recent developments in blockchain would add immutable data layer in big data analytics process. Big data is expected to generate $203 billion revenue by 2020. Data generated by blockchain ledger would be 20% of global big data market and would generate $100 billion annual income [86]. Data generated by blockchain is both secure (it can not be falsified) and valuable (data are structured, abundant and complete). These two qualities make these data a good candidate for further analysis. For example, fraud prevention would be done proactively. This is because blockchain technology allows financial firms access and check every transaction at real time. Due to this, instead of verifying the fraud records which have already happened, banks would be able to focus on identifying risky or forged transactions on the fly and prevent such frauds proactively. Therefore, it is crucial to assess the impact of blockchain technology on big data.

1.2 Significance

Big data analytics is being widely used in many use cases of academics and businesses and is reaching every nook and corner of the business activities to take timely decisions and predictions. It has been applied in many domains such as IoT, Bioinformatics, smart homes, health-care systems, to mention a few. Due to advancements in sensor and communication technology, IoT-based devices are widely deployed in business firms and smart home systems. Big data analytics approach is used for energy management of smart homes in [83]. Tools of data analytics have been widely in health-care systems [84, 85]. Triguero et al. [85] put forth award-winning algorithm for imbalanced class distribution based on MapReduce approach for solving the Bioinformatics problem. Application of big data analytics is also found in railway transportation and engineering in the domains of operations, safely purposes and maintenance tasks [87].

In agriculture, weed control and management are a critical issue. Big data and machine learning-based approach is put forth in [88] for crop protection. Robot for Intelligent Perception and Precision Application (RIPPA) has deployed at farm as trial basis for detection of weeds and highlight the foreign objects over the crops [89].

Availability of large datasets and powerful GPUs has resulted in the proliferation of deep learning techniques for multitude of domains such as image classification [90], object detection [91, 92, 93], video analytics [94], natural language processing [95, 96, 97] and speech recognition [98]. More recently, deep neural network model based on Gaussian–Bernoulli Deep Boltzman Machine is used for real-time prediction in smart manufacturing industry [99]. By and large, deep learning has been applied to address significant problems in big data analytics such as semantic analysis, topic modeling, classification tasks, fast querying, real time analysis, and extraction of the complex patterns of data for interesting events. It is found to be applicable to multimedia data.

It can be observed from numerous applications of big data analytics that requirements of analytics vary as the application varies. Considering wide adoption of big data analytics in almost every domain, this paper is drafted as an attempt to get overall understanding of big data and analytics based on various aspects such as challenges of handing big data, big data storage systems, kinds of analytics, role of process models and technology-wise approaches of big data analytics.

The work is contributed as follows.
  • The graphical taxonomy of data has been put forth.

  • Aspects of big data such as characteristics, types, DIKW hierarchy, characteristics, challenges of handling big data, big data storage systems are scrupulously discussed.

  • Evolution of data analytics and applicability of different process models for big data analytics has been sought.

  • Thorough survey of state-of-the-art approaches for big data analytics based on addressed issues, technique applied, datasets and phase of knowledge discovery in databases (KDD) to which the approach belongs carried out.

  • Current trends in big data analytics and possible future directions are also put forth.

The contents of the paper are depicted in Fig. 2. Section 2 deals with taxonomy of data, kinds of big data, characteristics and challenges of big data. Big data storage systems are briefly enunciated in Sect. 3. Section 4 portrays evolution of analytics, applicability of process models for big data analytics. State-of-the-art approaches of big data analytics are compared in Sect. 5. Current trends in big data analytics and future directions are put forth in Sect. 6. The paper is concluded in Sect. 7.
Fig. 2

Roadmap of the paper (figure to be read from left in clockwise manner)

2 Big data

This section deals with what data are, knowledge pyramid of data, big data, their characteristics and key challenges in big data analytics.

2.1 Data

The plural form of the Latin word datum is data. Data can be defined as a discrete, boundless entity having unorganized structure or unprocessed form used for describing object, idea, event or fact. The taxonomy of data depicted in Fig. 3 is based on method of data collection, accessibility pattern, source of data generation, and statistical approach. According to the data collection applied, data can be categorized into 2 parts viz. raw data and secondary data. Raw data are also called as primary data. They are directly collected from the source where they get generated and constitute machine readings, alphabets, numbers, etc. Measuring the height of every student and listing it on spreadsheet constitutes “primary data”. Primary data are generated and gathered by the investigator who is actually conducting the experiment or research. In case of secondary data, user of secondary data is different from the one who actually has collected the data.
Fig. 3

Taxonomy of data

For example, data collected by government agency for censuses purpose can be used by researcher for data science project. The readiness of availability of Secondary data alleviates the need of data collection by the user. Depending on accessibility pattern, data can be categorized into 3 types as open, shared and closed [2]. The concept of open data became popular due to wide adoption of world-wide web (WWW) and government agencies promoting open data. Open data are freely available to anyone for unrestrictive commercial or non-commercial use.

It is permissible to share and access open data without any restriction of legal copyrights, patents and way of access. The major sources of open data are scientific data [3], linked open data (LOD), government data [4]. Linked data deal with interlinking of structured data available on Internet and accessing them via semantic queries. DBPedia [5] and Freebase [6] are some examples of LOD. Shared data are available to the group of users who fulfill the access criterion. It can also be availed for public purpose usage under some terms and conditions. Researchers share the datasets created and synthesized by them through shareable media with the prior condition of acknowledging the source of data, wherever they are used. This is the example of shared data.

Closed data can not be shared with any third party and its access is restricted to data owner or special group in an organization due to security constraints and policies. Security and privacy measures are required to keep the confidentiality and secrecy of closed data. Organizational-level data are kept confidential and “closed” by implementing firewall or the access control strategies. Business data, government data related to defense and security, etc. are examples of closed data.

Depending on the source of generation, data can also be categorized into internal data and external data. Internal data deal with data procured from reports, spreadsheets and analytical results internal to the organization. These data are taken into consideration for strategic decision-making for successful working of organization. They can be sales data, clients’ data, financial data, production data or human resources’ data and are confined to be used for internal matters of the organization. On the other hand, external data are data collected from the sources external to the organization. Census data, traffic data, etc. are examples of external data. According to the statistical way of defining and measurement, data can be classified into qualitative data and quantitative data. Quantitative data are measurable data which can be expressed with the help of numbers. Marks obtained in an exam, weight of a person, etc. are some examples of quantitative data. On the other hand, qualitative data deal with categorical variable, i.e., characteristic or quality of the thing under consideration. They also depend on the subjectivity or judgment. Social and financial status, smell, taste and color are some examples of qualitative data.

2.2 DIKW hierarchy

The popular model to establish link between data, information, knowledge and wisdom is DIKW hierarchy or knowledge pyramid [7] as shown in Fig. 4. From the viewpoint of DIKW hierarchy, data can be conceived as raw information constituting numbers or signals. Information carries meaning, and purpose of data. It is a subject-oriented collection of facts and data. Knowledge is used for decision-making by interpreting available information. It is acquired by understanding, study and past experiences. Data, information and knowledge work on past and ongoing events. Wisdom, on the other hand, is used for predictions. It constitutes knowledge and accumulated experiences to offer cognitive judgments and decisions. The complexity level of understanding increases from data to wisdom as shown in the DIKW hierarchy. However, Frické criticized DIKW hierarchy and demonstrated how it is an infeasible methodology for operations [8].
Fig. 4

DIKW hierarchy

2.3 Big data and their kinds

NIST definition of big data [9] can be given as Big data consist of extensive datasets—primarily in the characteristics of volume, variety, velocity, and/or variability—that require a scalable architecture for efficient storage, manipulation, and analysis. It can be categorized into structured data, unstructured data and semi-structured data. Table 1 compares the kinds of big data.
Table 1

Comparison between kinds of Big Data

Parameter of comparison

Structured data

Unstructured data

Semi-structured data

Support for data model

Yes

No

No

Support for database schema

Yes

No

Yes

Level of Interaction computer can establish with data

Easy

Difficult

Medium

  • Structured data possess highly organized schema, standard format and layout. They are stored, accessed and processed in organized and precise manner. They follow predefined set of rules for modeling data types and relationship between data. As elements in the structured data are easily accessible, and addressable, data processing algorithms work efficiently on structured data due to ease of interaction. The epitome of structured data can be best understood by relational database in which rows and columns are used to store data in a structured way. Structured query language (SQL) is the most sought after language to search and query database seamlessly having structured data. To help the websites to store data in proper format, structured data markup can be used which describes the way of embedding the structured data in websites so that they can be efficiently crawled by search engines. The examples of structured data markup are resource description framework in Attributes (RDFa) [10], Schema.org [11], microformats [12] and microdata [13].

  • Unstructured data are not abided by predefined format, structure and data model. Therefore, they are difficult to convert or map unstructured data into the format required for efficient processing. Machine log data, CCTV footage, X-ray images, videos are some examples of unstructured data. It is claimed that 80% of business information is generated in unstructured textual format [14]. According to the forecast by HP, there will be around 1 trillion sensors installed by 2030, and IoT data will be the most important part of big data [15]. From this, it can be seen that IoT will be the largest contributor of unstructured data.

  • Semi-structured data are data that may be irregular or incomplete, and have a structure that may change rapidly or unpredictably [16]. The difference between structured data and semi-structured data is subtle. Semi-structured data do not conform to standard data model but possess tags or markers to describe the semantic elements embedded in them and emphasize ranking among records in the data. Due to this, they are also called as self-describing data. They are characterized by partial, irregular and implicit structure. Their structure is irregular in the sense that there is no ordering among the elements, not all elements are necessarily present and elements with same name may exhibit different form. Integration of several sources of structured data may result in the generation of semi-structured data [17]. Semi-structured data are generally managed using XML and HTML. Web pages, documents of type standard generalized markup language (SGML), Javascript Object Notation (JSON) and BibTex are few examples of semi-structured data.

2.4 Characteristics of big data

Many research groups, firms and business organizations explained big data characteristics in terms of V’s as shown in Table 2.
Table 2

Characteristics of big data

Dimension

Definition

Volume

Generation of large amount of data

Velocity

Speed of data generation

Variety

Emanation of data from different sources, having varied data formats and types

Variability

Change in meaning of data according to the context

Veracity

Credibility or trustworthiness of big data

Value

Useful and refined insights obtained from data

Visualization

Representation of data and analytical results

Volatility

Quick change in data. Deals with amount of time for which data are valid and entitled to store

Validity

Correctness and accuracy of data

Viscosity

Time gap between occurrence of event and time required for event description in terms of data to understand event

Virality

Speed at which data spread among the community

Venue

Heterogeneous data sources

Vocabulary

Content and context dependent Metadata (Schema, ontology, data model) describing structure of data

Vagueness

Ambiguous meaning of data

Table 3 shows the occurrences of V’s appeared in the definitions put forth by various firms. ‘√’ indicates that definition addresses the corresponding V mentioned in the column. The firm, Elder Research proposed 42 V’s for big data and data science [18].
Table 3

Comparison of Big Data Characteristics in terms of V’s

Characteristics firm

No. of Vs

Volume

Velocity

Variety

Variability

Veracity

Value

Visualization

Viscosity

Virality

Venue

Vocabulary

Vagueness

Gartner [19]

3 V’s

         

IDC [20]

3 V’s

         

NIST [21]

4 V’s

        

IBM [22]

4 V’s

 

       

Enterprise Architects [23]

5 V’s

 

      

Impact Radius [24]

7 V’s

     

Data Science Central [25]

8 V’s

 

   

MapR Data Technologies [26]

10 V’s

  

2.5 Challenges associated with big data

Key challenges associated big data are briefly mentioned in this section.
  • High dimensional data The main issue related to structured big data is their dimensionality. It becomes infeasible to visualize the useful results with multi-dimensional data when dimensionality count increases by hundred- or thousand-folds. This challenge can be addressed by applying clustering methods or binning techniques for making the process of visualization easier.

  • Volume There will be 40 zettabytes of data possessed by digital universe by 2020 which is 50 times greater than that stated in 2010 [27]. To store and analyze such mammoth data, infrastructures are needed.

  • Security and privacy Also security and privacy aspects of data should be considered while accessing data and analyzing them.

  • Heterogeneity It is certain that unstructured data do not have any structure. They constitute information obtained from different sources with varied data format. Thus, challenge is how to support analytics over heterogeneous data.

  • Speed The speed at which data generate demands timely and accurate analysis. One of the solutions to this challenge is to enhance the capacity of the processing units either by vertical scalability or horizontal scalability. Vertical scalability (Scale up/down) deals with addition of more resources to the existing servers in the form of memory, CPU or storage, and, thus, making the system powerful. On the other hand, horizontal scalability deals with addition of commodity nodes (servers) to the system so that system works as a single logical unit (scale out/in). Figure 5 depicts the distinction between vertical and horizontal scalability. Another solution to address the speed of big data processing is to apply in-memory analytics. It is a method of querying data residing in server’s main memory. As the data are queried from main memory, there is no need to maintain indexing and store aggregated views in Online Analytical Processing (OLAP) cubes or tables. This approach enables faster query processing and helps in timely decision-making processes. Due to reduction in costs of RAMs, in-memory analytics has become popular option for deployment in big data environments. The pros of in-memory analytics are parallel processing, support for large datasets with broad range of schema and analytical processing.
    Fig. 5

    Scale architectures designs—horizontal scalability and vertical scalability

  • Usability Extracting useful insights and value from large data and applying their usability in business processes is biggest challenge of unstructured data. For this, new tools and APIs are required to access, search and process the data.

  • Data understanding Data understanding deals with contextual aspects of big data viz. where the data are coming from, which stakeholders are involved in data processing cycle, for which audience does analytical results are targeted for, etc. To address this challenge, strong domain expertise is required to avoid pit falls in processing.

  • Data Quality ISO/IEC 25012 defined characteristics of data quality as inherent data quality and system-dependent data quality, and put forth 15 characteristics of data that conform to quality [28]. “Inherent data quality refers to the degree to which quality characteristics of data have the intrinsic potential to satisfy stated and implied needs when data is used under specified conditions”. Another type—system-dependent data quality is referred to as The degree to which data quality is reached and preserved within a computer system when data are used under specified conditions”. There is immense need to apply these aforementioned data quality models for big data [29]. The issue of garbage in, garbage out (GIGO) should be tackled while applying quality model with big data.

  • Dearth of big data expertise McKinsey forecasted that US will require 140,000–190,000 personnel with proficiency in analytical skills and around 1.5 million personnel to perform analytical and managerial functions related to big data by 2018 [30]. From this, it is vivid that myriad of professionals ranging from data engineers, data scientists, graphic designers (for visualization analytics), expertise in hardware and software skills, statisticians, mathematicians and analytical architects to fully leverage the power of big data. To keep pace with the state-of-the-art technology team consisting of people from various business groups, innovative technology groups and academics should be united to collaboratively face big data problems.

  • Data access, connectivity and synchronization among data sources According to survey by McKinsey, not all data points are connected mutually. This hinders the process of data access from various sources. Therefore, the need to manage and aggregate data from multiple sources using common platform arises. Another challenge which comes into picture once unified platform has been allotted for big data is that Whether the collected data procured from different sources with different timeframes and different speed is synchronized and consistent with the big data system? The data collected from one source should be up to date with data obtained from another sources; this is named as time-based synchronization of data [31].

  • Data visualization Data visualization becomes very challenging when large amount of data points have to be plotted on graph. For example, comparing 10 billion rows of sales data is critical. Use of clustering techniques can help to visualize the data at different levels of complexity. With the advent of high-performance techniques and in-memory technology, process of data visualization can be strengthened.

  • Structural schema heterogeneity This is also referred to as data exchange problem in which the structural schema used by communicating systems Ssource and Sdestination for data exchange is different (i.e., Source system and destination system have different structural schema). For communication between 2 systems having different structural schema, mapping is required. In case of federating several heterogeneous systems, middleware performs mapping and these mappings are used for data dissemination process. In case of S > 2, one–many or many–many mapping comes into picture.

  • Semantic interoperability It not only supports data exchange among systems having different structural schema but also let automatic interpretation of exchanged data in a meaningful and correct way as defined by the user.

3 Big data storage systems

Big data storage systems are maintained in such a way that massive amount of data can be easily stored, accessed and retrieved by applications and services providing analysis facility. It is required that big data storage systems should support faster input/output operations. The infrastructure of big data storage can be maintained using direct attached storage or network attached storage with redundancy and scale out storage capability. Compute nodes/process nodes can be attached to storage nodes to process and retrieve large chunk of data quickly.

3.1 Relational versus NoSQL databases

Big data storage systems can be based on relational databases or NoSQL databases. Relational databases follow ACID properties (atomicity, consistency, isolation, and durability), whereas NoSQL databases follow CAP (consistency, availability, and partition tolerance) theorem. CAP theorem states that it is not possible for any distributed system to simultaneously provide more than two out of the following three guarantees: consistency, availability and partition tolerance. NoSQL databases can be classified into key-value store, document-based store, column-based store, and graph-based databases.

3.2 In-memory versus on-disk databases

In-memory database systems use computer’s main memory to store data to provide faster response time. Due to advancements in multi-core processors and availability of low-cost RAMs, in-memory databases (IMDBs) have been widely used in business analytics and intelligence applications. IMDBs are among the top ten technologies affecting IT market. Market of IMDBs will rise to $6.58 billion in 2021 compared to $2.72 billion in 2016, i.e., almost 19% growth in compound annual rate highlighting their capability to provide real time analytics on active and live data [32]. Moving to the history of IMDBs, IBM scientist developed first in-memory engine namely IMS/VS FastPath in 1978 [33]. TimesTen (acquired by Oracle) was the first commercial in-memory database [34]. Generally, IMDBs are used in time-critical, big data and high-performance computing environments to get faster query response. They are different from traditional databases which use on-disk storage for storing data. Memory access time in IMDBs is lesser than disk access time in traditional databases. Implicit use of RAM for storage not only helps for faster query retrieval but also alleviates the need of data indexing and storing pre-computed views in aggregate tables or OLAP cubes, and, thus, mitigates the cost of IT resources. To survive from different disruptions occurring in hardware and software, different durability measures are applied in IMDBs. Transaction logging records the snapshot of in-memory databases periodically and saves it on stable/non-volatile storage. Depending on the implementation of transaction logging in IMDBs, recovery procedure is followed to “redo” committed transactions or “undo” on-going transaction when system failure occurs. Data availability is ensured by database replication in which standby database is maintained for automatic failover. Master and replica database may reside on single hardware or different machines connected by high speed communication link. To provide durability for IMDB systems having volatile memory, non-volatile RAM (NVRAM)—battery-operated RAM can be used which has capacity to retain the databases in case of power loss. Some variants of NVRAM are ferroelectric RAM, magnetoresistive RAM, and phase-change RAM.

Upsurge of hybrid databases has blurred the distinction between IMDBs and traditional databases since hybrid databases use both in-memory and disk-based storage to get advantage of high performance and reliability simultaneously. Table 4 compares striking features of in-memory databases with traditional on-disk databases. On-disk databases suffer from I/O bottleneck. So, query optimization is more focused on reducing I/O costs. As whole data reside in RAM, query optimization is bit elusive in IMDBs. IMDBs use T-tree data structure for indexing. T-tree is a balanced index tree structure which directly stores pointer to data field residing in memory rather than storing the copy of indexed data in the index node. On-disk databases use B-tree as primary data structure for indexing. Explicit durability features are incorporated in IMDBs. On-disk databases support implicit durability. IMDBs can be accessed via shared memory interface or JDBC/ODBC interface. On the other hand, traditional databases are accessed using client–server socket interface. As IMDBs support faster processing, coarse level of locking is used, whereas on-disk databases support fine granular concurrency and enforce various concurrency mechanisms. In-memory databases are used in time-critical applications where immediate response is expected such as IP network routing, telecom switching, industrial control systems. They are also used in real time operating systems running on embedded devices. Apart from this, IMDBs are deployed in financial services or stock market for quick manipulation of data. Social networking sites and e-commerce sites also use IMDBs for caching some part of on-disk database for faster processing of query requests. On-disk databases can be used in OLTP systems, inventory management, enterprise data warehousing, data marts, health sectors and many more. Slow recovery time is one of the biggest challenges of IMDBs from storage perspective. The speed of data transfer from disk to in-memory database is limited by bandwidth used for their communication. So recovery process is bit slower in IMDBs. To improve the recovery time, high-bandwidth recovery mechanism can be used.
Table 4

Difference between in-memory databases and traditional on-disk databases

Parameter

In-memory database

Traditional on-disk database

Nature of data

Data is either persistent or volatile. It depends on kind of memory (volatile or non-volatile) and durability measures implemented in database product

Data is stored on persistent disks

Bottleneck due to disk I/O

It is free from disk I/O bottleneck

Disk I/O is bottleneck

Data size

Data size is bounded by amount of main memory available

Database size is virtually unlimited

Indexing

T-tree data structure is used by main memory database

B-tree data structure is used by traditional database

Durability

Explicit durability is achieved using transaction logging, savepoints, checkpoints, data replication, non-volatile RAM and its variants

It supports implicit durability

Access method

Shared memory, interfaces offering Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC)

Client–server-based socket

Concurrency level

Coarse level of locking

Granular level of locking

Usage

It is optimized for special applications like real time systems, telecommunication switching, network routing applications, industrial control systems, financial systems, e-Commerce systems, social networking sites

It is used in wide range of applications such as OLTP, data mart, data warehousing, inventory management

4 Big data analytics

This section discusses evolutionary phases of analysis, process models used for analytics and applicability of models for big data analytics.

4.1 Evolution of analytics

The evolution of analytics as shown in Fig. 6 can be perceived through following phases: analytics 1.0, analytics 2.0 and analytics 3.0. Technically, analytics 1.0 was marked by the era of traditional analytics involving business intelligence, enterprise data warehousing and data mart. This phase spanned mid 1950 until mid 2000. In this phase, analysis was restricted to organization’s internal data usually of small magnitude with structured format.
Fig. 6

Evolution of analytics

Examples of data sources could be sales and production data, customer interaction data and financial data. The complex task of feeding data into warehouse (extract, load and transform) by analyst incurred more time on data preparation rather actual analysis of data. The main focus of analytics 1.0 was to perform descriptive analytics using batch processing over historical data focusing on ‘What happened aspect and used to take several weeks or months to get analytical results. No emphasis was given on predictions or future insights for decision-making. The characteristic of analytics 1.0 is given as—support for predefined queries, small and structured data and usage of data restricted for tasks internal to the organization. The decisions were taken based on experience and intuitions due to lack of technical communication between analyst and decision-making executives.

Analytics 2.0 emerged when Internet-based service companies like Google and e-commerce companies like eBay started to analyze the data at scale. It mainly concerned with data of unstructured format and streamed nature. Such big data constituted data from external sources like social media data, scientific project-based data like human genome data, transportation data, retailing data, etc. Due to large scale of data, traditional processing using centralized server was not a feasible option. This need was satisfied by Hadoop distributed processing system [35] for performing batch analytics on distributed clusters in parallel. To alleviate the problem of heterogeneous data, NoSQL databases came into picture. Therefore, analytics 2.0 is marked with proliferation of big data, Hadoop and NoSQL databases. The emergence of cloud computing paradigm helped to provide storage and processing capabilities for big data. Complex, unstructured, massive and external data are the characteristics of data pertaining to analytics 2.0 phase. The technologies emanated in this era are ‘In-memory analytics’ and ‘In-database analytics’ which process data locally in memory rather than accessing it from disk. Predictive analytics is more emphasized in analytics 2.0 compared to descriptive analytics. The focus is given on ‘What will happen’ aspect. Predictive analytics may consume weeks to provide insight. Here, data scientists play major role to perform analytics; and decision-making is based on experience and predicted insights.

We are currently in the phase of analytics 3.0 which focuses on prescriptive analytics and future will be governed by the same. It is hybrid analytics amalgamating the benefits from analytics 1.0 and 2.0 era. [36]. Current era is marked with proliferation of Internet of Things, big data, and cloud computing. The aim of this era is to automate the process of decision-making, incorporation of analytics into working processes and bring optimization. This can be collectively called as smart analytics. It is expected from analytics 3.0 to process and render large amount of insights within seconds. Table 5 compares evolutionary phases of analytics on the basis of goals, emerged technologies, techniques, and basis for decision-making.
Table 5

Comparison of evolutionary phases of analytics

Parameter

Analytics 1.0

Analytics 2.0

Analytics 3.0

Era of

Date Warehousing, data mart (traditional analytics)

Big Data

Fast impact for data economy

Analytics

performed

Batch processing and descriptive analytics

Descriptive analytics and predictive analytics

Descriptive analytics, predictive analytics, prescriptive analytics (highly focused)

Goal

Descriptive results over historical data

Hardware and software products performing data analytics and future insights

Automatic decision-making

Questions addressed

What happened?

Why did it happen?

What will happen?

Why will it happen?

When and why will it happen?

which action should be recommended taking into account what will happen?

Processing time

Weeks/months (slow)

Days/Weeks (Medium)

Seconds (Fast—Near real time)

Kind of data

Structured

Unstructured

Structured and Unstructured

Stakeholder involved

Data analyst

Data Scientist

Collaborative teams, chief analytics expert

Emerged technologies

Relational databases mostly SQL

Big data, Hadoop, NoSQL, cloud computing

Cyber physical system, IoT, big data, cloud computing

Technique

Business intelligence, centralized processing

Distributed processing, visual analytics

State-of-the-art techniques for governing decision science, agile development, embedded analytics

Decision-making

Intuitions and experience

Experience and predictions

Automated decision-making

4.2 Process models for data analytics

For extracting value from data, systematic procedure is followed. Such procedure is generally expressed using process model for analysis. Different types of process models are discussed here to perform analysis of data. Applicability of such process models in big data analytics environment is compared in Table 6.

4.2.1 KDD model

Knowledge discovery in databases (KDD) deals with extracting useful information from the large scale of data. KDD enhances the value of data. It is the process model having capability to transform the raw data into valuable information nuggets. Being a pioneer for stating phases in knowledge discovery from data, Fayyad et al. [37] stated that KDD process model constitutes mainly 5 phases which work in interactive and iterative manner involving user decisions. KDD process model may contain loops between any two phases.

Figure 7 depicts the phases and methodologies applied in KDD process model. In this, according to the prior knowledge, the application domain from which useful data to be extracted is selected and the target dataset is built. The pre-processing phase deals with data cleaning, handling of missing data, removal of noise and outliers. DBMS issues like database schema, type of database, mapping of missing data are addressed in this phase. Dimension reduction techniques, transformation and projection methods are applied on pre-processed data to get the transformed data. Considering the purpose of data model, appropriate data mining algorithm is chosen. Data mining is the crucial step in KDD which deals with the application of data analysis and discovery algorithms to find enumerated patterns from data. Data mining tasks include classification, regression, clustering, anomaly detection, association rule mining, summarization, etc. The patterns obtained from data mining phase are translated into understandable format using visualization tools. The extracted knowledge is used for making decisions or it is documented or disseminated to intended end users.
Fig. 7

Knowledge discovery in databases (KDD) [37]

4.2.2 CRISP-DM model

CRISP-DM is an acronym of cross industry standard process for data mining [38]. It is a data mining process model adapted by data experts to solve data mining problems. It consists of six stages as shown in Fig. 8. The detailed discussion of these stages is also given.
Fig. 8

CRISP-DM process model [38]

  • Business understanding This initial phase of CRISP-DM model deals with project goal setting according to business context, formulating problem definition based on data mining. This phase also involves framing the project plan and standard decision model.

  • Data understanding Data collection and activities to get acquainted with data are performed as pre-requisite tasks for data understanding. Data quality problems associated with data mining, inferring initial insights from data and accordingly devising hypotheses for interestingness measures in data are carried out.

  • Data preparation Data pre-processing and cleaning tasks are iteratively performed in this phase. It involves applying techniques of attribute and schema selection, transformation methods on raw data obtained in previous phase to prepare final data set which will be given as input to the modelling tool.

  • Modelling In this phase, various data models are selected (conceptual, logical, physical and enterprise models) and applied. The aim is to optimize the data values for various parameters defining the model.

  • Evaluation Qualitative data model is built according to data analysis perspective. The processes used for devising the model and actual model are rigorously evaluated. This phase checks whether all business issues are thoroughly addressed and verifies business objectives. It is expected to finalize how data mining results would be used in business process.

  • Deployment Once the model is created, the task is to organize useful results and avail them for further action by end users. So this phase outputs analytics report, data scoring or involve complex data mining process. To understand the know-how of model and up front inputs required for the model, it is expected to carry out deployment phase by the customer rather than analyst.

4.2.3 SEMMA model

Sample, Explore, Modify, Model, Assess (SEMMA) model is a logical set of functional tool set for performing data mining tasks put forth by SAS [39] for Enterprise Miner. Figure 9 depicts the stages followed in SEMMA process model.
Fig. 9

SEMMA process model [39]

  • Sample Sample data are extracted from a large data set in such a way that they are not only a representative and reliable data possessing the significant feature of large data but also small in size to be handled at ease. Sampling allows reducing the processing time to get timely retrieval of business insights. This phase requires training, validation and testing on partitioned data sets.

  • Explore To understand and explore data to get insights and interesting patterns from data and to fine-tune data discovery process, this phase uses visual exploratory techniques and statistical techniques like factor analysis and clustering.

  • Modify Due to dynamic and iterative nature of data mining, while selecting the model for analysis, relevant parameters and variables are selected and transformed. Depending on the insights obtained and patterns detected by exploring the data in previous phase, variables are grouped or reduced to fit the given model. The values of parameters and variables change due to change in the data to be mined.

  • Model For getting reliable predictions, data modeling is performed based on the type of data. Modeling techniques include neural networks, statistical models, tree-based models, etc.

  • Assess Reliability and validity of analytical results obtained by data mining process can be checked by data assessment. The efficiency of data model can be checked by (a) the data set used for constructing the model (b) the data set which was not used for training the model and (c) the known data to measure how accurately model estimates the predictions.

5 State-of-the-art big data analytics approaches

Substantial amount of work has been carried out in the domain of big data analytics. Table 7 compares selected state-of-the-art approaches of papers for big data analytics based on issues addressed, features provided, technology, approach, dataset and the phase of KDD on which the contribution of the paper has been focused.
Table 6

Applicability of process models in big data environments

Process model

Focus

Applicability to big data

Snail shell process model [40]

Iterative phases of knowledge discovery via data analytics (KDDA) for big data, and end–end decision-making in big data environment

Strong

CRISP-DM [39]

Discusses data mining approaches to solve the problems by data mining experts

Weak

SEMMA [38]

Set of sequential steps for carrying out data mining task

Weak

KDD [37]

Extract useful information from raw data

Medium

Table 7

Comparative study of state-of-the-art approaches for Big Data Analytics

D and C for M2M [72]

Addressed Issues and features

Technology

Approach

Dataset

Phase of KDD

Analysis farm [41]

Storage scalability, computation scalability, query agility

Cloud Computing—OpenStack, NoSQL—MongoDB

Scalable aggregation

Network Log

Analysis

RCFile [42]

Data storage and loading, query processing, dynamic workload

Hadoop

Data placement policies, column-wise compression

Facebook data

Analysis, Query Processing

YSmart [43]

Intra-query Correlation, Query optimization, Redundant computations, I/O operations and network transfers, Scalability

Hadoop, Cloud Computing—Amazon EC2

Correlation aware SQL–MapReduce Translator

DSS Workloads, click-stream analysis workloads

Query optimization

Perf Pred [44]

Concurrent join queries, Reduction in contention and thrashing, Scalability, Star Schema, Benchmarking

RDBMS, PostgreSQL

CJOIN operator for Concurrent star-schema queries, adaptive query processing pipeline

Star Schema

Benchmark—Query Template

Query Processing

P-OLAP [45]

Graph analysis, scalability, data heterogeneity

RDF Query Language—SPARQL

OLAP, MapReduce-based graph processing

DBLP and Amazon Online Rating DB

Query processing

RFID-Cuboid [46]

Logistics decision-making, visualization

C ++, MATLAB

OLAP, RFID-Cuboid-based warehousing, Trajectory pattern mining & interpretation

RFID logistic dataset

Data warehousing, mining

HaoLap [47]

High Dimensional data, Data loading efficiency

MapReduce

OLAP, Shared nothing architecture, Integer coding method, Traversing, partitioning and linearization

Ocean dataset

Data Analysis

Data Partitioning [48]

High Dimensional data

MapReduce, HBase

OLAP, Indexing & partitioning, Full Source Scan algorithm

TPC-H benchmark dataset

Data Analysis

Pipeline61 [49]

Data pipelining, version control, dependency management

Hadoop Spark, Bash scripts

DAG Scheduling

CSV, Text and JSON data

Data Storage, Data Integration

TraceAnalyzer [50]

Batch processing, Stream processing, scalability, availability

Cloud Computing—SolrCloud, Hadoop, Spark

Layered architecture, REST API

Trace Data from Google trace, Time Series Data

Data Analysis

CIM-Based Visual Data Mining Framework [51]

Variety, Interoperability, Volume, Velocity, Utilization and Analytics

Model Query Language (MQL)

Layered architecture, Data Mining, Visual and query driven data mining

Utility Data

Data Integration, Analysis, Data Visualization

AlgorithmSeer [52]

Indexing, rankingAlgorithm search engine, Identification, Document Element Extraction

Solr/Lucene open-source indexing and search system

Hybrid machine learning (Ensemble ML + Rule based ML), TF-IDF based cosine similarity

Scholarly documents from CiteseerX repository

Data Indexing, Data Extraction, Data Analysis

UTIM [53]

Unlicensed taxi identification, efficiency and accuracy

Cloud computing, data-driven transportation, HBase, HDFS

Statistics, machine learning, feature extraction, SVM based training

Real-time vehicle trajectory data

Data acquisition, data analysis, identification

DiploCloud [54]

Distributed RDF data management, scalability, efficiency

Cloud computing—Amazon EC2, RDF triple store

Standard graph and adaptive partitioning, lexicographical tree based parsing, declarative storage patterns

RDF Data LUBM 1600 data set

Data storage, query processing

MBDA [55]

Variety and value aspect of data, volume, velocity, and volatility aspects of data

Apache spark, hadoop

Deep learning, iterative mapreduce computing, load sharing platform

Mobile big data, accelerometer data from action tracker dataset

Data analysis

IoT-RFID [56]

Data heterogeneity, query speed, even data distribution

IoT, MongoDB, EPC information services, Fosstrak RFID software platform

Horizontal data partitioning, compound shard key

Real RFID/sensor big data generated from supply chain

Data integration, data storage, query processing

MR-BDA [57]

Network traffic reduction

Hadoop

Decomposition-based distributed algorithm, online algorithm

Dump files of Wikimedia

Data pre-processing

FastBootstrap [58]

Data volume, statistical robustness, scalability, efficiency

Distributed computing

Statistical inference, bag of little fast and robust bootstraps method

Simulated data, million song dataset (Audio) from UCI ML

Data analysis

SWFT [59]

Curse of dimensionality, concept drifting,

fault detection

Statistics, machine learning

Stream data mining, angle-based subspace anomaly detection, sliding window method, unsupervised online subspace learning

Stream data

Data analysis, anomaly detection

SparseComp [60]

Data volume, classification handling massive computations

scalability

MATLAB

Similarity-based ML, approximate PCA, supervised normalized cut, K-NN, SVM

UCI ML, LIBSVM

and ACM SIGKDD dataset

Data analysis

FlightPred [61]

Recommendation, balancing of dataset, prediction, scalability

Cloud computing—microsoft azure, hadoop, HD Insight

MapReduce-based repartition join for data transformation, random under sampling algorithm for balanced dataset, parallel random forest, classification

Airline flights dataset, Weather dataset

Data pre-processing,

data mining

Social Inference [62]

Data sparseness, handling missing data, scalability

Cloud computing—Amazon EC2, map reduce

Entropy-based model, variant of k-d tree—rectangular parallelepiped

Social network datasets—Gowalla, Brightkite, Foursquare

Data analysis

Simba [63]

Addressing expensive query evaluation plan, fault tolerance, high throughput, low latency, scalable

Apache spark SQL

Spatial indexing over RDDs, cost-based optimization for spatial query plan, SQL context module to execute queries in parallel to get high throughput

OSM and GDELT real dataset, random clusters as synthetic dataset

Query processing, query optimization

Rheem [64]

Platform independence, multi-platform task execution

Data cleaning tool

Data processing abstractions on top of various data processing platforms, Multi-platform optimization

UCI ML, sensor data from oil well, traffic data

Data processing, abstraction, query optimization

Quegel [65]

Interactive and batch querying, cluster-based resource management, straggler problem, Real-time response time

Apache Giraph, HDFS

Client–server architecture,

super-step sharing execution model to utilize cluster resources

Graph datasets—Twitter, LiveJ, BTC, DBLP, Xmark

Query processing

DataLab [66]

Version management, code revision, separate data & metadata management

GitLab API, HDFS, MongoDB, Python and R, Spark, SPArse Modelling tool

Data work flow, ML-based nonnegative matrix factorization algorithm, super pixel approach for Image segmentation

Semi-structured data table, 120,000 images from BDGP project

Big data workflow management

Crime Inference [67]

Community-centric Inference

Statistics, machine learning

Statistics and ML, feature extraction and construction, normalization, pearson correlation, linear regression, negative binomial regression, leave-one-out evaluation scheme

112,000 POIs from FourSquare, Taxi flow data

Data mining, outlier detection

BAD Asterix [68]

Actionable notification, context based data subscription, parallel data analytics, scalability

Asterix DB

Continuous channel semantics, data feed adapter for streaming, DDL for repetitive and continuous channel

Subscription datasets

Notification management and distribution

BigDatalog [69]

Complex analytics, declarative queries

Apache Spark

Parallel evaluation technique, Spark-based physical planning, scheduler optimizations for recursive queries, distributed monotonic aggregation and evaluation

Synthetic and real-world graphs, graph data from LiveJournal, Orkut, Twitter and Arabic

Query processing, and optimization

LargeVis [70]

2D and 3D data visualization, scalable to million data points

Statistics, machine learning

Graph construction by approximated k-NN, asynchronous stochastic gradient descent

Text, Image and network data

Data visualization

Hybrid ICT [71]

Energy consumption management, anomaly detection, forecasting

BigETL, Spark, Hive, PostgreSQL with MADlib

In-database data analytics, forecasting algorithms (PARX, ARIMA, Holt-Winters), multiple regression, k-NN

Smart meter data, electricity consumption data

Data analysis

D and C for M2 M [72]

Real time and offline processing, Memory management, feature engineering

Hadoop, Java

Divide-and-conquer approach, block-wise vertical data representation, data aggregation, data fusion algorithm

Real time and offline data, earth observatory data images

Data analysis

CQSIM-R [73]

Failure handling of scalable storage components, reliability, analytical and simulation modelling, scalability, soundness, performance

Simulation tools

Exponentially distributed failure modelling, event-based Monte Carlo simulation

4 TB storage of drives for experiments having 3 years of mean time to failure

Maintenance (post deployment phase)

BigDebug [74]

Debugging Distributed and scalable data intensive application, interactive and dynamic debugging, dynamic error fixing and resuming

Apache Spark, HDFS

Combination of breakpoints and watch points, record level error tracing, latency profiling, crash identification

Programs of WordCount, Grep and PigMix query, 1 TB data for debugging

Maintenance (post deployment phase)

Big data Query [75]

Workload characterization, data locality-aware query processing

Cloud Computing—Amazon EC2, GT-ITM tool

Locality-aware online query evaluation, query evaluation plan for single source and distributed data

512 GB data distributed on data centers

Query processing

Predictive analytics [76]

Near real time predictions

Apache Kafka, Python, Esper CEP

Complex event processing, Adaptive prediction algorithm, Regression modelling

Real time transportation data

Data Analysis and predictions

5.1 OLAP based approaches

Most research work focused on extending the use of OLAP for big data analytics [45, 46, 47, 48]. Beheshti et al. [45] presented multi-dimensional and multi-view graph data using MapReduce-based graph processing. They have integrated newly designed Process OLAP (P-OLAP) framework with existing ProcessAtlas framework.

To extract the trajectory knowledge from RFID-enabled data of Logistic management, RFID-Cuboids are designed [46]. The approach encompasses warehousing based on cuboids, cleaning, compression, classification, spatio-temporal pattern recognition and visualization. Map tables are used for linking cuboids to achieve information level granularity and abstraction. This method is used for decision-making and excavating trajectory knowledge and can be extended to perform planning of material equipment and scheduling. HaoLap system is fusion of Hadoop-based MapReduce and multi-dimensional Online Analytical processing [47]. It exploits the unprecedented power of MapReduce for job execution and also supports OLAP processing based on multi-dimensional cubes.

Analyzing the feasibility of OLAP queries (SQL aggregations like sum, min, max, etc.) on big data is termed as small analytics in [48]. In this, HBase is used to store data and MapReduce is used for running algorithmic jobs. The data to be analyzed are stored in denormalized fact table. To build multidimensional data cubes over Hadoop, two algorithms have been implemented viz. index random access (IRA) and full source scan algorithm. These algorithms allow secondary indexing and partitioning as tuning features.

5.2 Big data quality monitoring approaches

It is crucial to verify the quality aspects of big data model. Hall et al. [73] put forth CQSIM-R tool for predicting the failure of storage components deployed in big data environments. Similarly, to ease the process of debugging the large-scale big data models, real-time and interactive primitive BigDebug model is proposed in [74]. For failure handling and achieving reliability, it is a general trend to manage the big data in the cloud environment in which multiple data centers are distributed, separated geographically and connected via Internet. Querying the big data in such scenarios is challenging due to availability of data on multiple data centers and resource requirements of query may not be handled by the single data center. Xia et al. [75] put forth online query evaluation framework based on locality awareness of data. For this, they have defined the metric for analyzing the workload capacities of multiple data centers and resource requirements for query evaluation. This metric is used in online algorithms for query evaluation on both single- and multiple data centers.

For justifying the significance of workload metrics, workload characterization method namely Metric Importance Analysis is proposed in [80]. The work by Balliu et al. [81] also considers the workload characterization for analysis of large trace logs generated by clusters and servers. This analyzer supports R, SQL and Hadoop MapReduce for user interaction queries and uses HDFS and SQLite as backend storage servers.

In-memory cluster computing platforms are widely used executing large set of big data workloads in a parallel manner. Due to complexity of such platforms and lack of tools for understanding and optimizing the workloads, resources available on cluster computing platforms may remain underutilized and leads to application failure. By finding out machine learning-based statistical correlation among metrics collected at system and application level, optimal parallelism strategies can be employed [100]. This method characterizes the workload performance of big data running in parallel environment.

5.3 Complex event processing-based approaches

IoT applications based on complex event processing (CEP) need to analyze the heterogeneous datasets and predict the complex events in near real-time manner. Though CEP performs analytics at near real time, it does not make use of historical data. Akbar et al. [76] proposed prediction algorithm based on combined methods of CEP and ML-based analytics to take into account both historical and real-time data for prediction.

5.4 Approaches providing BDA as a service

Depending on the application requirement, big data analytics works on infrastructure layer (machines for big data analytics, for example, high-performance computing servers, public or private cloud), platform layer (Hadoop, Spark) and software layer (Algorithms for big data analytics). A BigProvision framework [77] facilitates the user to select computing infrastructure, analytics platform for big data analytics and thus, provides comprehensive system as a service. For this, BigProvision framework evaluates probable analytics approaches based on data and analytics requirements set the by user. After evaluation, provisioning configuration is selected and accordingly whole environment for big data analytics is set up automatically. Big data are generated at different locations and sources. To use such data from heterogeneous sources for complex event processing, knowledge-based platform is developed [78]. This platform collects data in a distributed manner using publish/subscribe modelling and applies schema mapping to address the issue of data heterogeneity. The platform also uses techniques of ontology extraction and semantic analysis for advance big data processing.

5.5 Hadoop MapReduce-based approaches

HDFS distributes the datasets over the nodes for parallel execution. But, due to clustering of sub-datasets, some nodes carry heavy workload compared to other nodes. This problem is alleviated by DataNet method for distributing the datasets based on storage distribution of sub-datasets [79]. For the sake of distributing the data on nodes, metadata structure is designed based on HashMap and bloom filters. The distribution-centric algorithms let distribute sub-datasets uniformly on the nodes, thereby achieving balanced and efficient parallel execution of big data. As big data analytics leverage high-performance clusters, it is important to find important metrics and characterize the workload for efficient execution.

Many firms perform analytics on event log file for business intelligence. These logs are ordered temporally and then grouped according to the user IDs to ease further analysis. This method is known as Relative order preserving-based grouping (Re-Org). Generally, Re-org tasks are deployed on MapReduce for analysis and experience inefficiency due to sort-merge operations in shuffling phase. GOM-Hadoop framework [82] used group–order–merge approach based on ordering property of data and proposed efficient shuffling strategy for speedy execution of Re-org tasks.

5.6 Deep learning-based approaches

Deep learning algorithms helps to perform automated feature engineering. It is used for extracting complex data representation from unlabeled data in an unsupervised manner. In the past decade, neural network- and memory-based learning has been applied for data mining tasks [101]. Feature weighting and selection is performed in online manner for regression and classification tasks using hybrid strategy involving neural network- and memory-based learning. Yan et al. [102] put forth hierarchical convolutional neural network-based semantic indexing method for biomedical documents.

6 Current trends and future directions

With reference to the state-of-the-art literature on big data analytics, current trends and future directions have been discussed in this section taking into consideration the scope of big data analytics to each kind of multi-media data, i.e. text, images and videos.

6.1 Current trends in big data analytics

  • Query languages Recent developments in frameworks such as Spark and Hadoop-based paradigms follow declarative style and optimized query languages based on SQL dialects.

  • IoT and edge analytics The spontaneous generation of IoT data from sensor devices has contributed to big data. For analyzing such data, current data analytic platforms for IoT follow Lamda architecture [103], Kappa architecture [104]. Lamda architecture supports both batch analytics and real time analytics, but suffers from code maintenance problem [105]. Kappa architecture helps to alleviate the code maintenance problem of Lamda architecture. Another trend is to move computations/analytics where the data are generated or residing; this is known as edge analytics. Edge analytics alleviates the challenges related to central data collection and management, slow network bandwidth. This trend of edge analytics is popular for IoT devices. Since the continuous stream of data is generated by IoT devices, it is more beneficial to perform on-the-spot processing using edge analytics to provide responses quickly as the event occurs.

  • Domain adaptation As the deep neural network model requires large amount of domain-specific data for training, the current trend is to apply domain adaptation method in which training data and test data are sampled from different distributions [106]. For instance, AlexNet which is basically developed for image classification tasks is used as pre-trained model in other computer vision tasks such as, semantic segmentation [107], anomaly detection [108].

  • Focus of analysis The trend of traditional data mining by analyst mostly focused on traditional machine learning algorithms and descriptive analytics. This trend has been changed to large-scale data analytics by data scientist focusing on amalgamation of many technologies (mathematics and statistics, machine learning, deep learning, visualization) as well as predictive and prescriptive analytics.

  • Analytics coupled with visualization Data visualization in addition to data discovery has become a significant trend. Use of captivating and well-founded visualization models such as Tableau [109], Qlikview [110], Highcharts [111], Datawrapper [112], FusionCharts [113], Plotly [114], Sisense [115] are mostly models for processing big data. For visualizing the analytical results of big data, libraries from languages like Python, R are used. With the rapid proliferation of deep learning, TensorFlow [116] is widely used for multitude of tasks. TensorBoard is used to visualize the complex computation graphs, TensorFlow programs, histograms of tensors at different points in time.

  • Cloud computing for big data analytics Nowadays, cloud computing has been widely used in data management and analysis [117, 118, 119, 120]. With the release of new data platforms for Data Science especially open-source frameworks, the current trends in big data analytics are moving towards hybrid data management, data visualization and hybrid cloud.

  • Training Initially, models were learnt based on strong supervised learning, unsupervised learning, or semi-supervised learning. Recent trend of training the model adopts the method of self-supervised learning [121], weakly supervised learning [122], and reinforcement learning [123].

  • Social media analytics Due to proliferation of user-generated content on widely used social media platforms such as Twitter, Facebook, Tumblr, etc. interest in analyzing the sentiments and opinions of people expressed in text has risen rapidly in both academia and business. Majority of traditional approaches focused on sentiment analysis based on document level [124], and phrase level [125] irrespective of the entities and attributes mentioned in the sentiments. Currently trending “aspect based sentiment analysis” aims to identify aspects (attributes) of entities and sentiments (positive, negative, or neutral) for each aspect [126, 127]. Researchers are also focusing on detecting the rumors, fake news and fake likes, videos, text, and audio recordings from social media platforms [128, 129].

6.2 Future directions

  • Data privacy and security issues arise when data mining and machine learning techniques are applied to data to infer the accurate inferences. For instance, many commercial companies try to collect personal data of customers in the best possible way from multiple sources or devices to increase their sales and apply data mining techniques by uncovering the interesting behavior patterns of customers to ease the process of marketing and sales. However, such approach hampers the privacy of customers. Therefore, the techniques should be developed which would perform analysis to yield accurate results but also would protect the privacy of customers. One of the solutions to problem is the use of multi-disciplinary approaches such as cryptosystems coupled with bio-metric systems [130], data obfuscation techniques [131], etc.

  • Diagnostic decision support systems which are based on medical image analysis for diagnosis, and treatment of diseases require sufficiently enough accurate labels for images and anomalous parts in images by field experts. One of the possible solutions is to perform image labelling by crowd-sourcing tools such as Amazon Mechanical Turk [132]. But, there are also some open problems in crowd-sourcing such as when should one stop asking for more labels for a given instance, how the accuracy of labelling can be checked, etc. Some works related to crowd-sourcing for labels are mentioned in [133]. To pacify the growing need of labelled dataset by data scientists or developers, there would be significant growth in start-ups which are specialized to deliver synthetic labelled data for training the models [134].

  • Near future would witness development of low-power, high-performance Artificial Intelligence chipsets for real-time inferences which will encompass new generation deep learning/machine learning compilers and data science tools as optimization models being deployed in the hardware chipsets.

  • Big data analytics has been widely used for maintenance purposes in transportation systems such as visual inspection of railway infrastructure [135]. Very few researchers focused on utilizing the live and open data available on world-wide web for operations of transportation [136]. Inspection of vehicles using drone cameras with pre-installed faster image processing is the emerging topic. Detecting the drowsiness of drivers using cameras installed on the roadways would help to reduce the accidents on roadways. Existing BDA approaches are mainly focused on descriptive analytics. There is a need to focus on predictive and prescriptive for taking corrective actions for transportation.

  • There is a need for improving the infrastructure which supports reusability of data for proper data management and data stewardship. As mentioned in [137], all data discovery and management processes, datasets, design procedures should be findable, accessible, interoperable and reusable (FAIR). The emphasis should be given on applying these factors of FAIR to both human and machine oriented activities.

  • Though Hadoop systems are schema-less and able to accommodate any kind of data (structured, unstructured or semi-structured), it important to maintain metadata information such as version histories of data and code, schema of data and features of data blocks for ensuring the analytical results are obtained on valid data only. But, due to lack of standard approach of metadata management, it becomes time consuming to scale the small system analysis model to support large-scale analysis model. The reason being scaling up to large-scale analysis environment involves massive sharing of code and data and thus, analysis becomes time consuming. By following the principled approach to metadata, one can easily scale to large-scale analysis environment [138]. Therefore, it is very important to incorporate standard approach to metadata management in big data ecosystems.

  • 90% of unstructured data which are never analyzed can be termed as dark data [139]. This unstructured data may reside in analog form, and, therefore, may not be exploited for business analytics. The new wave of data analytics would be focused on digitization of such dark data. It is anticipated that organizations would develop the big data solutions to move dark data from mainframes directly into cloud-based Hadoop environment for analysis.

  • Blockchain has a potential to change the way the world perceives big data [140]. As blockchain maintains database record for each transaction, it would allow for real-time transfers with a significant less cost. The separate record for each transaction would enable to mine the interesting patterns from consumers spending and detect suspicious transactions in real-time as per application requirement. Therefore, real-time big data analytics over data generated by block chain transactions would be in great demand in near future.

  • Though deep learning is applicable in many machine learning tasks, deep models are poor at replicating the many functions like human brain. Explicit memorization based deep models can be used to characterize novel patterns in unseen data [141]. As a future research, this paradigm can be used to find novel patterns from multimedia data.

  • As it is well known that deep models require powerful devices for training, future research will be based on developing deep models to train deep neural networks on low-powered devices such as mobile devices. Recently, quantized neural networks have been put forth [142] to allow training on low-power devices. But research issue is how to obtain accurate results on par with state-of-the-art techniques with models trained on low-power devices.

7 Conclusion

Recent lustrum witnessed great research momentum in big data analytics. The paper overviews the progress made in the domain of big data analytics. As a contribution, the graphical taxonomy of data has been put forth. Big data storage systems and how the analytics evolved till date have been sought in the paper. The applicability of process models for inferring insights from big data is also elaborated. The state-of-the-art approaches for big data analytics are thoroughly compared based upon addressed issues, technique applied, datasets and phase of knowledge discovery in databases (KDD) to which the approach belongs. Current trends in big data analytics have also been identified and possible future directions are also put forth.

It can be observed that big data analytics can be coupled with multi-disciplinary domains for actuating the complex event processing which involves decisions based on multimedia data such as images, videos, and text. For improving the performance of big data systems, big data systems are augmented with high-performance computing clusters and in the near future, emphasis would be given on devising the big data architectures for exascale computing.

References

  1. 1.
  2. 2.
    Closed, shared, open data. https://theodi.org/blog/closed-shared-open-data-whats-in-a-name. Accessed 5 Mar 2018
  3. 3.
    Data and services. http://www.icsu-wds.org/services/data-portal. Accessed 5 Mar 2018
  4. 4.
    Archives. https://www.archives.gov/open. Accessed 5 Mar 2018
  5. 5.
    DBPedia. http://wiki.dbpedia.org/. Accessed 5 Mar 2018
  6. 6.
    Freebase. http://www.freebase.com/. Accessed 5 Mar 2018
  7. 7.
    Hey, J.: The data, information, knowledge, wisdom chain: the metaphorical link. Intergov Oceanogr Comm 26, 1–18 (2004)Google Scholar
  8. 8.
    Frické, M.: The knowledge pyramid: a critique of the DIKW hierarchy. J. Inf. Sci. 35, 131–142 (2009)CrossRefGoogle Scholar
  9. 9.
    NIST big data interoperability framework. https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-1.pdf. Accessed 5 Mar 2018
  10. 10.
    Resource description framework. https://www.w3.org/TR/rdfa-primer/. Accessed 5 Mar 2018
  11. 11.
    Schema. http://schema.org/. Accessed 5 Mar 2018
  12. 12.
    Microformats. http://microformats.org/. Accessed 5 Mar 2018
  13. 13.
    Microdata. https://www.w3.org/TR/microdata/. Accessed 5 Mar 2018
  14. 14.
    Unstructured data and the 80 percent rule. https://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/. Accessed 5 Mar 2018
  15. 15.
    Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob Netw. Appl 19, 171–209 (2014)CrossRefGoogle Scholar
  16. 16.
    Connolly, T.M., Begg, C.E.: Database systems: a practical approach to design, implementation, and management. Pearson Education (2005)Google Scholar
  17. 17.
    Abiteboul, S.: Querying semi-structured data. In proceedings of the 6th international conference on database theory, pp. 1–18. Springer, Berlin (1997)Google Scholar
  18. 18.
  19. 19.
    Gartner IT glossary. http://www.gartner.com/it-glossary/big-data/. Accessed 15 Mar 2018
  20. 20.
  21. 21.
    NIST. http://dx.doi.org/10.6028/NIST.SP.1500-1. Accessed 15 Mar 2018
  22. 22.
  23. 23.
    Enterprise architects. http://enterprisearchitects.com/the-5v-s-of-big-data/. Accessed 15 Mar 2018
  24. 24.
    Impact radius. https://www.impactradius.com/blog/7-vs-big-data/. Accessed 15 Mar 2018
  25. 25.
  26. 26.
  27. 27.
  28. 28.
    ISO: ISO/IEC 25012: standardization/international electrotechnical commission, I. O. & others. Software engineering-Software product quality requirements and evaluation (SQuaRe) data quality model. ISO/IEC 25012, 1–13 (2008)Google Scholar
  29. 29.
    Merino, J., Caballero, I., Rivas, B., Serrano, M., Piattini, M.: A data quality in use model for big data. Future Gener. Comput. Syst. 63, 123–130 (2016)CrossRefGoogle Scholar
  30. 30.
    Manyika, J., et al.: Big data: The next frontier for innovation, competition, and productivity (2011)Google Scholar
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on 1–10. (2010)Google Scholar
  36. 36.
  37. 37.
    Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.: Advances in knowledge discovery and data mining, vol. 21. AAAI press, Menlo Park (1996)Google Scholar
  38. 38.
    Wirth, R. Hipp, J.: CRISP-DM: towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, 29–39 (2000)Google Scholar
  39. 39.
    Olson, D.L., Delen, D.: Data mining process. Advanced Data Mining Techniques, pp. 9–35. Springer, Berlin Heidelberg (2008)CrossRefGoogle Scholar
  40. 40.
    Li, Y., Thomas, M.A., Osei-Bryson, K.-M.: A snail shell process model for knowledge discovery via data analytics. Decis. Support Syst. 91, 1–12 (2016)CrossRefGoogle Scholar
  41. 41.
    Wei, J., Zhao, Y., Jiang, K., Xie, R., Jin, Y.: Analysis farm: a cloud-based scalable aggregation and query platform for network log analysis. In 2011 International Conference on Cloud and Service Computing, 354–359 (2011)Google Scholar
  42. 42.
    He, Y., et al.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In 2011 IEEE 27th International Conference on Data Engineering, 1199–1208 (2011)Google Scholar
  43. 43.
    Lee, R., et al.: YSmart: yet another SQL-to-MapReduce Translator. In 2011 31st International Conference on Distributed Computing Systems, 25–36 (2011)Google Scholar
  44. 44.
    Candea, G., Polyzotis, N., Vingralek, R.: Predictable performance and high query concurrency for data analytics. VLDB J. 20, 227–248 (2011)CrossRefGoogle Scholar
  45. 45.
    Beheshti, S.-M.-R., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34, 379–423 (2016)CrossRefGoogle Scholar
  46. 46.
    Zhong, R.Y., et al.: A big data approach for logistics trajectory discovery from RFID-enabled production data. Int. J. Prod. Econ. 165, 260–272 (2015)CrossRefGoogle Scholar
  47. 47.
    Song, J., et al.: HaoLap: a Hadoop based OLAP system for big data. J. Syst. Softw. 102, 167–181 (2015)CrossRefGoogle Scholar
  48. 48.
    Romero, O., Herrero, V., Abelló, A., Ferrarons, J.: Tuning small analytics on big data: data partitioning and secondary indexes in the Hadoop ecosystem. Inf. Syst. 54, 336–356 (2015)CrossRefGoogle Scholar
  49. 49.
    Wu, D., et al.: A pipeline framework for heterogeneous execution environment of big data processing. IEEE Softw. (2018).  https://doi.org/10.1109/MS.2016.62 CrossRefGoogle Scholar
  50. 50.
    Singh, S., Liu, Y.: A cloud service architecture for analyzing big monitoring data. Tsinghua Sci. Technol. 21, 55–70 (2016)CrossRefGoogle Scholar
  51. 51.
    Zhu, J., et al.: A framework-based approach to utility big data analytics. IEEE Trans. Power Syst. 31, 2455–2462 (2016)CrossRefGoogle Scholar
  52. 52.
    Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: AlgorithmSeer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans. Big Data 2, 3–17 (2016)CrossRefGoogle Scholar
  53. 53.
    Yuan, W., Deng, P., Taleb, T., Wan, J., Bi, C.: An unlicensed taxi identification model based on big data analysis. IEEE Trans. Intell. Trans. Syst. 17, 1703–1713 (2016)CrossRefGoogle Scholar
  54. 54.
    Wylot, M., Cudré-Mauroux, P.: Diplocloud: EFFICIENT and scalable management of rdf data in the cloud. IEEE Trans. Knowl. Data Eng. 28, 659–674 (2016)CrossRefGoogle Scholar
  55. 55.
    Alsheikh, M.A., Niyato, D., Lin, S., Tan, H.-P., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Netw. 30, 22–29 (2016)CrossRefGoogle Scholar
  56. 56.
    Kang, Y.-S., Park, I.-H., Rhee, J., Lee, Y.-H.: MongoDB-based repository design for IoT-generated RFID/sensor big data. IEEE Sens. J. 16, 485–497 (2016)CrossRefGoogle Scholar
  57. 57.
    Ke, H., Li, P., Guo, S., Guo, M.: On traffic-aware partition and aggregation in mapreduce for big data applications. IEEE Trans. Parallel Distrib. Syst. 27, 818–828 (2016)CrossRefGoogle Scholar
  58. 58.
    Basiri, S., Ollila, E., Koivunen, V.: Robust, scalable, and fast bootstrap method for analyzing large scale data. IEEE Trans. Signal Process. 64, 1007–1017 (2016)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Zhang, L., Lin, J., Karim, R.: Sliding window-based fault detection from high-dimensional data streams. IEEE Trans. Syst. Man Cybern. Syst. 47, 289–303 (2017)Google Scholar
  60. 60.
    Hochbaum, D.S., Baumann, P.: Sparse computation for large-scale data mining. IEEE Trans. Big Data 2, 151–174 (2016)CrossRefGoogle Scholar
  61. 61.
    Belcastro, L., Marozzo, F., Talia, D., Trunfio, P.: Using scalable data mining for predicting flight delays. ACM Trans. Intell. Syst. Technol. 8, 5 (2016)CrossRefGoogle Scholar
  62. 62.
    Pham, H., Shahabi, C., Liu, Y.: Inferring social strength from spatiotemporal data. ACM Trans. Database Syst. 41, 7 (2016)MathSciNetCrossRefGoogle Scholar
  63. 63.
    Xie, D., et al.: Simba: efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data, 1071–1085 (2016)Google Scholar
  64. 64.
    Agrawal, D., et al.: Rheem: enabling multi-platform task execution. In Proceedings of the 2016 International Conference on Management of Data, 2069–2072 (2016)Google Scholar
  65. 65.
    Zhang, Q., Yan, D., Cheng, J.: Quegel: a general-purpose system for querying big graphs. In Proceedings of the 2016 International Conference on Management of Data, 2189–2192 (2016)Google Scholar
  66. 66.
    Zhang, Y., et al.: DataLab: a version data management and analytics system. In Proceedings of the 2nd International Workshop on BIG Data Software Engineering, 12–18 (2016)Google Scholar
  67. 67.
    Wang, H., Kifer, D., Graif, C., Li, Z.: Crime rate inference with big data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 635–644 (2016)Google Scholar
  68. 68.
    Carey, M. J., Jacobs, S., Tsotras, V. J., Breaking, B.A.D.: A data serving vision for big active data. In Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems, 181–186 (2016)Google Scholar
  69. 69.
    Shkapsky, A., et al.: Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data, 1135–1149 (2016)Google Scholar
  70. 70.
    Tang, J., Liu, J., Zhang, M., Mei, Q.: Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web, 287–297 (2016)Google Scholar
  71. 71.
    Liu, X., Nielsen, P.S.: A hybrid ICT-solution for smart meter data analytics. Energy 115, 1710–1722 (2016)CrossRefGoogle Scholar
  72. 72.
    Ahmad, A., Paul, A., Rathore, M.M.: An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication. Neurocomputing 174, 439–453 (2016)CrossRefGoogle Scholar
  73. 73.
    Hall, R.J.: Tools for predicting the reliability of large-scale storage systems. Trans. Storage. 12, 241–2430 (2016)CrossRefGoogle Scholar
  74. 74.
    Gulzar, M. A., et al.: BigDebug: debugging Primitives for Interactive Big Data Processing in Spark. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), 784–795 (2016)Google Scholar
  75. 75.
    Xia, Q., Liang, W., Xu, Z.: Data locality-aware big data query evaluation in distributed clouds. Comput. J. 60, 791–809 (2017)CrossRefGoogle Scholar
  76. 76.
    Akbar, A., Khan, A., Carrez, F., Moessner, K.: Predictive analytics for complex IoT data streams. IEEE Internet Things J. 4, 1571–1582 (2017)CrossRefGoogle Scholar
  77. 77.
    Li, H., Lu, K., Meng, S.: Bigprovision: a provisioning framework for big data analytics. IEEE Netw. 29, 50–56 (2015)CrossRefGoogle Scholar
  78. 78.
    Esposito, C., Ficco, M., Palmieri, F., Castiglione, A.: A knowledge-based platform for big data analytics based on publish/subscribe services and stream processing. Knowl Based Syst. 79, 3–17 (2015)CrossRefGoogle Scholar
  79. 79.
    Wang, J., Zhang, X., Yin, J., Wu, H., Han, D.: Speed up big data analytics by unveiling the storage distribution of sub-datasets. IEEE Trans., Big Data (2017)Google Scholar
  80. 80.
    Yu, Z., et al.: MIA: metric importance analysis for big data workload characterization. IEEE Trans. Parallel Distrib., Syst (2017)Google Scholar
  81. 81.
    Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: A big data analyzer for large trace logs. Computing 98, 1225–1249 (2016)MathSciNetCrossRefGoogle Scholar
  82. 82.
    Yin, J., Liao, Y., Baldi, M., Gao, L., Nucci, A.: GOM-Hadoop: a distributed framework for efficient analytics on ordered datasets. J. Parallel Distrib. Comput. 83, 58–69 (2015)CrossRefGoogle Scholar
  83. 83.
    Al-Ali, A.R., Zualkernan, I.A., Rashid, M., Gupta, R., Alikarar, M.: A smart home energy management system using IoT and big data analytics approach. IEEE Trans. Consum. Electron. 63, 426–434 (2017)CrossRefGoogle Scholar
  84. 84.
    Wu, P.Y., et al.: Omic and electronic health record big data analytics for precision medicine. IEEE Trans. Biomed. Eng. 64, 263–273 (2017)CrossRefGoogle Scholar
  85. 85.
    Triguero, I., et al.: ROSEFW-RF: The winner algorithm for the ECBDL′14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl Based Syst. 87, 69–79 (2015)CrossRefGoogle Scholar
  86. 86.
  87. 87.
    Ghofrani, F., He, Q., Goverde, R.M.P., Liu, X.: Recent applications of big data analytics in railway transportation systems: a survey. Trans. Res. Part C Emerg. Technol. 90, 226–246 (2018)CrossRefGoogle Scholar
  88. 88.
    Ip, R.H.L., Ang, L.-M., Seng, K.P., Broster, J.C., Pratley, J.E.: Big data and machine learning for crop protection. Comput. Electron. Agric. 151, 376–383 (2018)CrossRefGoogle Scholar
  89. 89.
  90. 90.
    Krizhevsky, A., Sutskever, I., Hinton, G. E.: Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, p. 1097–1105 (2012)Google Scholar
  91. 91.
    Pathak, A.R., Pandey, M., Rautaray, S.: Application of deep learning for object detection. Procedia Comput. Sci. 132, 1706–1717 (2018)CrossRefGoogle Scholar
  92. 92.
    Pathak, A. R., Pandey, M., Rautaray, S.: Deep learning approaches for detecting objects from images: a review. In Progress in Computing, Analytics and Networking, p. 491–499 (2018)Google Scholar
  93. 93.
    Pathak, A.R., Pandey, M., Rautaray, S., Pawar, K.: Assessment of object detection using deep convolutional neural networks. Intell Comput Information and Comm 693, 457–466 (2018)CrossRefGoogle Scholar
  94. 94.
    Pawar, K., Attar, V.: Deep learning approaches for video-based anomalous activity detection. World Wide Web. (2018).  https://doi.org/10.1007/s11280-018-0582-1 CrossRefGoogle Scholar
  95. 95.
    Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: NIPS'11 Proceedings of the 24th International Conference on Neural Information Processing Systems. Curran Associates Inc., Granada, Spain, pp. 801–809 (2011)Google Scholar
  96. 96.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems, vol 2. MIT Press, Montreal, Canada, pp. 3104–3221 (2014)Google Scholar
  97. 97.
    Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. Proc Fifteenth Int Conf on Artif Intell Stat 22, 127–135 (2012)Google Scholar
  98. 98.
    Graves, A., Mohamed, A., Hinton G.: Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing, IEEE International Conference on, 2013. p. 6645–6649 (2013)Google Scholar
  99. 99.
    Wang, J., Wang, K., Wang, Y., Huang, Z., Xue, R.: Deep Boltzmann machine based condition prediction for smart manufacturing. J. Ambient Intell. Humaniz. Comput. (2018).  https://doi.org/10.1007/s12652-018-0794-3 CrossRefGoogle Scholar
  100. 100.
    Hernández, Á.B., Perez, M.S., Gupta, S., Muntés-Mulero, V.: Using machine learning to optimize parallelism in big data applications. Future Gener. Comput. Syst. 86, 1076–1092 (2018)CrossRefGoogle Scholar
  101. 101.
    Shin, C.-K., Yun, U.T., Kim, H.K., Park, S.C.: A hybrid approach of neural network and memory-based learning to data mining. IEEE Trans. Neural Netw. 11, 637–646 (2000)CrossRefGoogle Scholar
  102. 102.
    Yan, Y., Yin, X.-C., Zhang, B.-W., Yang, C., Hao, H.-W.: Semantic indexing with deep learning: a case study. Big Data Anal. 1(1), 7 (2016)CrossRefGoogle Scholar
  103. 103.
    Marz, N., Warren, J.: A new paradigm for Big Data. Big data princ. best Pract. scalable real-time data syst. Manning Publications, Shelter Island (2014)Google Scholar
  104. 104.
    Questioning the lambda architecture. http://radar.oreilly.com/2014/07/questioning-the-lambdaarchitecture.html. Accessed 14 May 2018
  105. 105.
    Pawar, K., Attar, V.: A survey on data analytic platforms for internet of things. In Computing, Analytics and Security Trends (CAST), International Conference on 605–610 (2016)Google Scholar
  106. 106.
    Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proceedings of the 28th International Conference on Machine Learning (ICML), 513–520 (2011)Google Scholar
  107. 107.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  108. 108.
    Sabokrou, M., Fayyaz, M., Fathy, M., Moayed, Z., Klette, R.: Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes. Comput. Vis. Image Underst. (2018).  https://doi.org/10.1109/TIP.2017.2670780 CrossRefGoogle Scholar
  109. 109.
    Tableau. https://www.tableau.com. Accessed 14 Apr 2018
  110. 110.
    Qlikview. https://www.qlik.com/us/products/qlikview. Accessed 14 Apr 2018
  111. 111.
    Highcharts. https://www.highcharts.com. Accessed 14 Apr 2018
  112. 112.
    Datawrapper. https://www.datawrapper.de. Accessed 14 Apr 2018
  113. 113.
    FusionCharts. https://www.fusioncharts.com. Accessed 14 Apr 2018
  114. 114.
    Plotly. https://plot.ly. Accessed 14 Apr 2018
  115. 115.
    Sisense. https://www.sisense.com. Accessed 14 Apr 2018
  116. 116.
    TensorFlow. https://www.tensorflow.org. Accessed 14 Apr 2018
  117. 117.
    Alipourfard, O., et al.: CherryPick: adaptively unearthing the best cloud configurations for big data analytics. NSDI 2, 2–4 (2017)Google Scholar
  118. 118.
    Sinnott, R.O., Voorsluys, W.: A scalable cloud-based system for data-intensive spatial analysis. Int. J. Softw. Tools Technol. Trans. 18, 587–605 (2016)CrossRefGoogle Scholar
  119. 119.
    Zhang, P., Yu, K., Yu, J.J., Khan, S.U.: QuantCloud: big data infrastructure for quantitative finance on the cloud. IEEE Trans. Big Data 4, 368–380 (2018)CrossRefGoogle Scholar
  120. 120.
    Hashem, I.A.T., et al.: The rise of ‘big data’ on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)CrossRefGoogle Scholar
  121. 121.
    Doersch, C., Gupta, A., Efros, A. A.: Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, 1422–1430 (2015)Google Scholar
  122. 122.
    Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 685–694 (2015)Google Scholar
  123. 123.
    Sutton, R.S., Barto, A.G.: Introduction to reinforcement learning, vol. 135. MIT press, Cambridge (1998)Google Scholar
  124. 124.
    Pang, B., Lee, L. A: sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. 271 (2004)Google Scholar
  125. 125.
    Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 347–354 (2005)Google Scholar
  126. 126.
    Pontiki M., et al.: SemEval-2016 task 5: aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 19–30 (2015)Google Scholar
  127. 127.
    Schouten, K., Frasincar, F.: Survey on aspect-level sentiment analysis. IEEE Trans. Knowl. Data Eng. 28, 813–830 (2016)CrossRefGoogle Scholar
  128. 128.
    Chen, W., Zhang, Y., Yeo, C.K., Lau, C.T., Lee, B.S.: Unsupervised rumor detection based on users’ behaviors using neural networks. Pattern Recognit. Lett. 105, 226–233 (2018)CrossRefGoogle Scholar
  129. 129.
    Sen I., et al.: Worth its weight in likes: towards detecting fake likes on Instagram. In Proceedings of the 10th ACM Conference on Web Science, 205–209 (2018)Google Scholar
  130. 130.
    Upmanyu, M., Namboodiri, A.M., Srinathan, K., Jawahar, C.V.: Blind authentication: a secure crypto-biometric verification protocol. IEEE Trans. Inf. Forensics Secur. 5, 255–268 (2010)CrossRefGoogle Scholar
  131. 131.
    Upmanyu M., Namboodiri A. M., Srinathan K., Jawahar C. V.: Efficient privacy preserving video surveillance. In Computer Vision, 2009 IEEE 12th International Conference on 1639–1646 (2009)Google Scholar
  132. 132.
    Amazon mechanical turk: https://www.mturk.com/. Accessed 20 Apr 2018
  133. 133.
    Raykar V, Agrawal P.: Sequential crowdsourced labeling as an epsilon-greedy exploration in a Markov decision process. In: Kaski S., Corander J (eds) Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics 33, 832–840 (PMLR 2014)Google Scholar
  134. 134.
    Deep learning with synthetic data will democratize the tech industry. https://techcrunch.com/2018/05/11/deep-learning-with-synthetic-data-will-democratize-the-tech-industry/. Accessed 20 Apr 2018
  135. 135.
    Distante A., Marino F., Mazzeo, P. L., Nitti, M., Stella, E.: Automatic Method and System for Visual Inspection of Railway Infrastructure. (2009)Google Scholar
  136. 136.
    Wei, S., et al.: Exploring the potential of open big data from ticketing websites to characterize travel patterns within the Chinese high-speed rail system. PLoS ONE 12, 1–13 (2017)Google Scholar
  137. 137.
    Wilkinson, M.D., et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 9 (2016)CrossRefGoogle Scholar
  138. 138.
    Smith, K., et al.: ‘Big Metadata’: the need for principled metadata management in big data ecosystems. In Proceedings of Workshop on Data Analytics in the Cloud 13:1–13:4 (ACM, 2014)Google Scholar
  139. 139.
  140. 140.
    Rodrigues, B., Bocek, T., Stiller, B.: The use of blockchains: application-driven analysis of applicability. In: Advances in computers. Elsevier (2018).  https://doi.org/10.1016/bs.adcom.2018.03.011 Google Scholar
  141. 141.
    Brahma, PP., Huang Q., Wu D.: Structured memory based deep model to detect as well as characterize novel inputs; 2018. arXiv:1801.09859Google Scholar
  142. 142.
    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 6869–6898 (2017)MathSciNetGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Computer EngineeringKalinga Institute of Industrial Technology (KIIT) UniversityBhubaneswarIndia

Personalised recommendations