Definition

Non-conventional data are any kind of data that are useful for business intelligence (BI) but that cannot be directly managed with traditional data warehousing (DW) technology. Non-conventional data cover a great variety of user-generated contents such as domain knowledge, corporate documents like contracts and e-mails, news feeds, messages posted on social media, and so on. Non-conventional data sources have in common a semi-structured, dynamic, and text-rich nature, which make difficult their integration within traditional corporate information systems (including data warehouses). Nowadays, non-conventional data mainly reside in the Web, adopting the standard formats proposed by the World Wide Web Consortium (W3C), like HTML, XML, RSS, RDF, etc. This entry is focused on those approaches aimed to either integrate non-conventional data with traditional DW/OLAP or to perform ad hoc DW of these data sources. This entry does not account for approaches that adopt Web languages like XML and RDF for data interchanging and integration in traditional BI systems.

Historical Background

Until the late 1990s, data warehousing has been mainly restricted to summarize corporate data coming from transactional databases. As databases have been thoroughly dominated by the relational model, DW technology has been devised to integrate and summarize relational data, so that they can be analyzed with online analytical processing (OLAP) tools.

The enormous popularity of the Web made available huge volumes of data of great value for BI analysts. These data are considered “external data” as they are not generated nor controlled by the company that is performing the analysis tasks. Particularly, Web data are challenging to traditional DW technology because they are semi-structured, highly heterogeneous, dynamic, and text-rich. These data do not naturally fit into the traditional corporate data models, which mainly rely on well-structured relational data. As a consequence, analyzing non-conventional data requires both new ad hoc DW methods and proper mechanisms to blend these data with corporate DW.

During the 2000 decade, several approaches of DW for non-conventional data were proposed, mainly for storing and retrieving documents such as e-mails, technical reports, patents, and so on. Later on, the popularity of semi-structuredformats for publishing Web data (e.g., XML and RDF) demanded new proposals for their analysis through DW/OLAP technology. Nowadays, the analysis of social networks on the Web is becoming crucial for companies in order to achieve a complete view of their business environment (e.g., voice of the customers and voice of the market).

The next section is dedicated to describe the main scientific aspects involved in state-of-the-art approaches for DW on non-conventional data.

Scientific Fundamentals

From a BI point of view, non-conventional data consist of user-generated contents highly valuable to decision-makers for improving their business [5]. Most limitations of traditional DW to manage these data come from the summarizability properties, which require closed, complete, and well-structured data. As previously mentioned, non-conventional data are indeed characterized by the opposite properties: they are semi-structured (i.e., open and usually incomplete), heterogeneous, and text-rich [12].

Two main trends to DW on non-conventional data can be identified in the literature: (1) methods that define a loose integration between the corporate DW and non-conventional data and (2) methods that transform/process non-conventional data in order to be analized with traditional DW/OLAP.

Regarding the first trend, non-conventional data are stored in native repositories (e.g., document stores, XML/RDF stores, graph databases, etc.) and then linked to the corporate DW facts [2, 3, 4, 7, 11]. For this purpose, information retrieval (IR) and information extraction techniques play a relevant role, as they aim at automatically indexing and extracting data from textual contents. Recently, the integration of opinion data with corporate DW is also demanding text mining (TM) techniques to perform semantic analysis, so that opinions and their polarities can be linked to DW facts [6]. Other approaches to performing a loose integration between non-conventional and corporate data consist in applying semantic-based techniques, which aim at semantically annotating both kinds of data through some reference ontology (see [1] for a review).

Once the integration is defined, analytical queries are performed over the corporate data warehouse with OLAP operators, allowing the inspection of non-conventional data via drill-through operations. Some approaches like [3, 6, 11] also allow reflecting measures derived from non-conventional data into the corporate cubes (e.g., relevance criteria, opinion polarities, number of documents, etc.), which are automatically summarized when applying OLAP operators.

Approaches of the second trend aim at processing non-conventional data in order to extract well-structured data to be analyzed with traditional DW/OLAP techniques (see [10] for a review). For example, DW on documents extracts metadata like authors, title, publication date, and content keywords in order to represent a collection of documents as a multidimensional model [14]. The extraction of dimensions that characterize textual contents has been one of the main issues of DW on documents [8]. Automatic indexing techniques from IR and automatic topic discovery from TM [13, 14] have been proposed for addressing this issue.

In the second trend, we can also find approaches for DW on semi-structured data expressed with XML/RDF languages (see [12] for a review). In this case, data cubes for analysis are automatically generated from the underlying tree/graph data structures of these languages. More recently, in [9] a method for generating data cubes from OWL-DL semi-structured data was proposed. In this case, data is aggregated according to the inferred relations derived from the ontology axioms.

Key Applications

To have a complete view of BI, it is necessary to combine information sources that provide complementary perspectives to analysts. For example, along with the traditional sales figures, analysts would also like to capture indicators from customers, markets, competitors, and so on. These indicators are usually derived from non-conventional data mainly gathered from document and Web sources. More recently, the irruption of social business as a new paradigm for doing business through social networks has spurred on new methods for social sentiment analysis, which combine text and graph mining techniques. However, there is not consensus yet within the BI community about how these and future proposals will successfully tackle the new scenarios that are emerging from the e-commerce and social business hybridization.

Future Directions

Although DW on non-conventional data is a relatively young field, it has been subject to the quick changes of the Web nature that happened during the last decade. We have witnessed a big change in the usage of the Web, being now a democratic media for creating and interchanging information, data, knowledge, and opinions. The big data scenario is the main result of this change, where new scalable methods are being demanded to manage and analyze the huge amounts of data generated by popular web services (e.g., social networks, sensor networks, etc.) For example, the quick adoption of NoSQL databases is massively populating the Web with semi-structured data under formats like JSON and data graphs. These databases are helped by massive parallel processing methods for performing both queries and basic analysis tasks (e.g., map-reduce frameworks). However, there is little research in either adapting or integrating existing DW/OLAP technology to this new scenario. The combination of semantic Web technologies (e.g., linked open data) with highly scalable methods for analysis is one promising direction to explore in the next years [1].

Cross-References