Interoperability in Data Warehouses
KeywordsResource Description Framework Data Cube Object Management Group Link Open Data Structural Conflict
The term refers to the ability of combining the content of two or more heterogeneous data warehouses, for the purpose of cross-analysis. This need emerges in a variety of practical situations. For instance, when different designers of a large company develop their data marts independently, or when different organizations involved in the same project need to integrate their data warehouses.
Data Warehouse interoperability is a special case of the general problem of database integration, but it can be tackled in a more systematic way because data warehouses are structured in a rather uniform way, along the widely accepted concepts of dimension and fact. As it happens in the general case, different degrees of interoperability can be pursued by adopting standards and/or by applying reconciliation techniques, likely specific for this context.
The problem is becoming increasingly relevant with the spreading of federated architectures. Nevertheless, it has been the focus of a few systematic works and numerous open problems remain to be solved.
In spite of its relevance, the problem of data warehouse integration has received little attention until recent years. Conversely, the general problem of database integration has been studied in the literature extensively and several aspects, both at schema and instance level, have been deeply investigated, such as the automatic matching of terms and the resolution of structural conflicts (see [15, 21] for surveys on these topics).
In the specific context of data warehouses, Kimball  has identified the problem for the first time: he has investigated the integration of heterogeneous dimensions in a scenario of data warehouse design and has introduced the informal notions of dimension conformity. Intuitively, two dimensions are conformed if they share some information in a consistent way. This is an important requirement in drill-across queries, which are basically joins of different facts over common dimensions. The notion of conformity has been formalized and extended by Cabibbo and Torlone in the context of data mart integration  under the name of dimension compatibility: they have demonstrated that this property gives the ability to perform correct drill-across queries over heterogeneous data marts.
An issue related to the integration of data warehouses, which has been studied in the context of statistical databases, is the derivability of summary data. This notion has been defined by Sato  as the problem of deciding whether summary data (which is, in a statistical database, the counterpart of a fact table) can be inferred from other summary data aggregated in a different way. The concept has been extended by Malvestuto , by considering the case in which the source is composed of several heterogeneous data sets: he proposes an algebraic approach to this problem and provides some necessary and sufficient conditions of derivability. Unfortunately, statistical databases have some similarity with multidimensional databases, but also some important diversities: this makes the application of these approaches to data warehouses difficult.
Only recently, the specific problem of the interoperability between heterogeneous data warehouses has been addressed by some authors. These works will be discussed in the following section.
While current commercial tools do not provide a complete support for data warehouse interoperability, they offer facilities that can be very useful in this framework, such as metadata import/export (using XML) and standardized ways to represent data (using a multidimensional model).
Since data warehouse interoperability can be considered a special case of the problem of database integration, general data reconciliation techniques can be often used. For instance, methods for the automatic matching of terms or for the resolution of structural conflicts. In addition, it is possible to take advantage on the fact that, in this context, the data sources always have a multidimensional structure. Therefore, the problem can be addressed by focusing on the reconciliation of heterogeneous dimensions and facts. Following this observation, the section discusses: standards that can be adopted to support data warehouse interoperability, conflicts that can arise in this context, and methodologies that can be used to perform the integration.
An important support for interoperability can be provided by the adoption of standards. Initially, two industry standards have been proposed by multi-vendor organizations for data warehouses: the Open Information Model (OIM) developed by the Meta Data Coalition (MDC), and the Common Warehouse Metamodel (CWM) developed by the Object Management Group (OMG) . Later, MDC and OMG joined their efforts and proposed a new version of the CWM as the standard metadata model. The Common Warehouse Metamodel is a platform-independent specification for exchanging multidimensional data between different platforms and tools. It is based on the standards UML, XMI, and MOF, and provides a set of generic, external representations of metadata, called metamodels, that provide a comprehensive framework for data exchange. These metamodels can be used to describe the various components of the data warehouse architecture: data sources, ETL processes, multidimensional cubes, relational tables, and so on. However, it has been observed that their expressivity is not sufficient to capture all the complex semantics of conceptual multidimensional models, so they hardly can be used for effective integration of different data warehouses .
Recently, the W3C Linked Data Working Group has proposed the Data Cube vocabulary (QB) as a mean to publish statistical data on the Web using the RDF (Resource Description Framework) standard. The QB4OLAP vocabulary is an extension to the Data Cube vocabulary that allows the representation of OLAP cubes in RDF and the implementation of OLAP operators in SPARQL. Using QB4OLAP, Etcheverry et al. have recently shown the importance of using semantic technologies for the integration of multidimensional data available on the Web .
- Dimension conflicts:
Schema: conflicts can arise on entity names (e.g., different names for the same dimensions and/or different names for similar levels of two dimensions) and on dimension hierarchies (similar dimensions organized over different levels of aggregation and/or inconsistencies on the roll-up relationships between levels).
Instance: conflicts can arise on member names (different names for the same members of different dimensions) and on the members of dimensions (similar dimensions populated by different members).
- Fact conflicts:
Schema: still, conflicts can arise on names (different names for the same measures) and on dimensions that differ in number and/or in the levels of aggregation.
Instance: conflicts can arise on measures (inconsistent values for the same measures and/or differences in scales).
As mentioned in the previous section, Cabibbo and Torlone  have identified a fundamental property that should be enforced while solving conflicts between heterogeneous data warehouses: dimension and fact compatibility. Two different dimensions d 1 and d 2 are compatible when their common information is consistent, that is, when aggregations computed over d 1 and d 2 and aggregations computed over the dimension obtained by merging d 1 and d 2 produce the same results. Having compatible dimensions and facts is important because it gives the ability to look consistently at data across data marts and to combine and correlate such data by means of drill across queries. Building on this notion, they have also identified a number of desirable properties that a matching between dimensions (that is, a correspondence between their levels) should satisfy: (i) the coherence of the hierarchies on levels, (ii) the soundness of the levels in correspondence, according to the members associated with them, and (iii) the consistency of the roll-up functions that relate members of different levels within the matched dimensions.
Identification of the facts that can be integrated and the dimensions of these facts that can be combined to perform the integration
Resolution of conflicts between common dimensions
Resolution of conflicts between facts to be integrated
Reconciliation and integration of dimensions and facts according to the desired level of interoperability
While this process can be supported by general reconciliation techniques based, for instance, on domain ontologies , it is possible to rely on specific techniques that take into account the rather standard structure of dimensions and facts. As usual, the level of interoperability can range from a scenario of loosely coupled integration, in which there is just the need to identify the common information between sources while preserving their autonomy, to a scenario of tightly coupled integration, in which the goal is rather merging the sources. In the former approach, queries are performed over a virtual view defined on the original sources, in the latter, queries are performed against a materialized view built from the sources.
Banek et al.  have addressed the problem of matching schema structures specific to data warehouses, the initial step of the above methodology. Their approach consists of two basic tasks. First, similarity matches between multidimensional structures are identified by comparing their names, data types and substructures (e.g., matches cannot violate the partial order in hierarchies). Then, heuristic rules, based on graph similarity, are used to choose the actual mappings, among the possible matches.
A methodology for the resolution of conflicts that guides the designers through the combination of independent data cubes has been proposed by Berger and Schrefl . They also propose a specific language called SQL-MDi (SQL for multi-dimensional integration), supporting the methodology. In their approach, the goal is the generation of a tightly coupled architecture that combine heterogeneous multidimensional data sources into a materialized warehouse. The same authors have recently proposed a visual design tool for federations of autonomous ROLAP data marts .
Cabibbo and Torlone have proposed  and implemented  two practical approaches to the integration of autonomous data warehouses that try to enforce matchings satisfying the properties discussed in the previous section and refer to the scenarios of loosely and tightly coupled integration, respectively. As a preliminary tool, they introduce a powerful technique, the chase of dimensions, that can be used in both approaches to test for consistency and combine the content of the dimensions to integrate. This technique operates over a tableau populated by the members of the dimensions to be integrated, and makes use of the roll-up functions defined over such dimensions. Two integration algorithms are then proposed. The first algorithm provides the operations, expressed in an abstract algebra, that applied to the original dimensions, allow the specification of correct drill-across joins between the heterogeneous sources. The second algorithm generates new dimensions and facts, obtained by merging the original data sources, that constitute the reconciled data warehouse. This approach has been later extended and experimented successfully in real-world scenarios . A refinement of the chase technique mentioned above has been recently proposed by Olaru .
From a practical point of view, a general federated architecture supporting the interoperability of distributed and autonomous data warehouses has been proposed by Mangisengi et al.  and by Berger and Schrefl . Tseng and Chen  have proposed a framework in which, after a resolution of conflicts, autonomous data cubes are first transformed into XML documents, then conflicts are solved by means of XQuery operations, and finally the access to integrated data is achieved through queries posed over an XML global view.
A common practice for building a data warehouse is to implement a series of data marts, each of which provides a dimensional view of a single business process . These data marts should be based on common dimensions but what happens in practice is that, very often, different departments of the same company develop their data marts independently. It turns out that methods and tools for data warehouse reconciliation are very useful in such common situation.
Indeed, the need for combining autonomous data warehouse arises in other common scenarios. For instance, when different companies merge or get involved in a federated project or when there is the need to combine a proprietary data warehouse with data available elsewhere, for instance, in external and likely heterogeneous information sources, or in multidimensional data wrapped from the Web.
Furthermore, methods supporting data warehouse interoperability can be useful when there is the need to migrate a data mart from one implementation platform to another.
The area of data warehouse interoperability needs further investigation and there is still a compelling need of effective solutions and practical tools. From a conceptual point of view, the problem needs a deeper investigation that takes into account, for instance, cases in which the structure of the data warehouses to be combined is non standard (e.g., for the presence of non-strict hierarchies or many-to-many relationships between facts and dimensions). In particular, the presence of irregular hierarchies makes the problem of dimension compatibility much harder since it requires complex tests at instance level. Semantic Web and Linked Open Data are also important directions for integrating external data with Data Warehouses [1, 2]. From a practical point of view, there is still a lack of effective tools supporting specifically the integration of autonomous and heterogeneous data warehouses.
- 3.Banek M, Vrdoljak B, Min Tjoa A, Skocir Z. Automating the schema matching process for heterogeneous data warehouses. In: Proceedings of 9th International Conference on Data Warehousing and Knowledge Discovery; 2007. p. 45–54.Google Scholar
- 4.Bergamaschi S, Olaru M. O., Sorrentino S, Vincini M. Dimension matching in Peer-to-Peer Data Warehousing. In: Proceedings of 8th International Conference on Decision Support Systems; 2012. p. 149–60.Google Scholar
- 5.Berger S, Schrefl M. Analysing multi-dimensional data across autonomous data warehouses. In: Proceedings of 8th International Conference on Data Warehousing and Knowledge Discovery; 2006. p. 120–33.Google Scholar
- 6.Berger S, Schrefl M. FedDW global schema architect: UML-based design tool for the integration of data mart schemas. In: Proceedings of 15th International Workshop on Data warehousing and OLAP; 2012. p. 33–40.Google Scholar
- 7.Cabibbo L, Torlone R. On the integration of autonomous data marts. In: Proceedings of 16th International Conference on Scientific and Statistical Database Management; 2004. p. 223–34.Google Scholar
- 8.Cabibbo L, Torlone R. Integrating heterogeneous multidimensional databases. In: Proceedings of 17th International Conference on Scientific and Statistical Database Management; 2005. p. 205–14.Google Scholar
- 9.Cabibbo L, Panella I. Torlone R. DaWaII: a tool for the integration of autonomous data marts. In: Proceedings of 22nd International Conference on Data Engineering, Demo session; 2006.Google Scholar
- 10.Etcheverry L, Vaisman A, Zimányi E. Modeling and querying data warehouses on the semantic web using QB4OLAP. In: Proceedings of 16th International Conference on Data Warehousing and Knowledge Discovery; 2014. p. 45–56.Google Scholar
- 12.Kämpgen B, Stadtmüller S, Harth A. Querying the global cube: integration of multidimensional datasets from the web. In: Proceedings of 19th International Conference on Knowledge Engineering and Knowledge Management; 2014. p. 250–65.Google Scholar
- 13.Kimball R, Ross M. The data warehouse toolkit: the complete guide to dimensional modeling. 2nd ed. Wiley; 2002.Google Scholar
- 15.Lenzerini M. Data integration: a theoretical perspective. In: Proceedings of 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2002. p. 233–46.Google Scholar
- 16.Malvestuto FM. The classification problem with semantically heterogeneous data. In: Proceedings of ACM SIGMOD International Conference on Management of Data; 1988. p. 157–76.Google Scholar
- 17.Mangisengi O, Huber J, Hawel C, Eßmayr W. A framework for supporting interoperability of data warehouse islands using XML. In: Proceedings of 3rd International Conference on Data Warehousing and Knowledge Discovery; 2001. p. 328–38.Google Scholar
- 19.Olaru MO. Partial multi-dimensional schema merging in heterogeneous data warehouses. In: Proceedings of 31st International Conference on Conceptual Modeling; 2012. p. 563–71.Google Scholar
- 20.Pedersen TB, Shoshani A, Gu J, Jensen C8.S. Extending OLAP querying to external object databases. In: Proceedings of International Conference on Information and Knowledge Management; 2000. p. 405–13.Google Scholar
- 22.Rizzi S, Abelló A, Lechtenbörger J, Trujillo J. Research in data warehouse modeling and design: dead or alive? In: Proceedings of ACM 9th International Workshop on Data Warehousing and OLAP; 2006. p. 3–10.Google Scholar
- 23.Sato H. Handling summary information in a database: derivability. In: Proceedings of ACM SIGMOD International Conference on Management of Data; 1981. p. 98–107.Google Scholar