Quarry: A User-centered Big Data Integration Platform

Abstract

Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    https://www.who.int/neglected_diseases/disease_management/wiscentds

  2. 2.

    https://www.who-umc.org

  3. 3.

    https://www.promedmail.org

  4. 4.

    https://www.who.int/chagas/en

  5. 5.

    http://mss4ntd.essi.upc.edu/wiki/index.php?title=WHO_Integrated_Data_Platform_(WIDP)

  6. 6.

    https://www.dhis2.org

  7. 7.

    http://mss4ntd.essi.upc.edu/wiki/index.php?title=WHO_Integrated_Medical_Supplies_System_(WIMEDS)

  8. 8.

    https://www.bonitasoft.com

  9. 9.

    http://data.un.org

  10. 10.

    https://spark.apache.org

  11. 11.

    https://flink.apache.org

  12. 12.

    https://hadoop.apache.org

  13. 13.

    https://neo4j.com

  14. 14.

    https://www.postgresql.org

  15. 15.

    WebVOWL:Web-based Visualization of Ontologies - http://vowl.visualdataweb.org/webvowl.html

References

  1. Abiteboul, S., André, B., & Kaplan, D. (2015). Managing your digital life. Communications of the ACM, 58(5), 32–35.

    Google Scholar 

  2. Angles, R., & Gutiérrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40(1), 1:1–1:39.

    Google Scholar 

  3. Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J. L., & Vrgoc, D. (2017). Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5), 68:1–68:40.

    Google Scholar 

  4. Bean, R. (2016). Variety, not volume, is driving big data initiatives. URL https://sloanreview.mit.edu/article/variety-not-volume-is-driving-big-data-initiatives

  5. Bilalli, B., Abelló, A., Aluja-Banet, T., Munir, R. F., & Wrembel, R. (2018a). PRESISTANT: data pre-processing assistant. In CAiSE Forum (pp. 57–65).

    Google Scholar 

  6. Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2018b). Intelligent assistance for data pre-processing. Computer Standards & Interfaces, 57, 101–109.

    Google Scholar 

  7. Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2019). PRESISTANT: Learning based assistant for data pre-processing. Data & Knowledge Engineering, 123, 100–122.

    Google Scholar 

  8. Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., & Xiao, G. (2017). Ontop: Answering SPARQL queries over relational databases. Semantic Web, 8(3), 471–487.

    Google Scholar 

  9. Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., van Keulen, M., Jarrar, M., Santucci, G., Sattler, K., Scannapieco, M., Wimmer, M., Wrembel, R., & Zaraket, F. A. (2018). Big data semantics. J. Data Semantics, 7(2), 65–85.

    Google Scholar 

  10. Chen, Y., Alspaugh, S., & Katz, R. H. (2012). Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. PVLDB, 5(12), 1802–1813.

    Google Scholar 

  11. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., & Tang, N. (2013). Nadeef: A commodity data cleaning system. In SIGMOD (pp. 541–552).

    Google Scholar 

  12. Deng, D., Fernandez, R. C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A. K., Ilyas, I. F., Madden, S., Ouzzani, M., & Tang, N. (2017). The data civilizer system. In CIDR.

    Google Scholar 

  13. Doan, A., Halevy, A. Y., & Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.

  14. Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., & Zdonik, S. B. (2015). The BigDAWG Polystore System. SIGMOD Record, 44(2), 11–16.

    Google Scholar 

  15. Fernandez, R. C., & Madden, S. (2019). Termite: a system for tunneling through heterogeneous data. In aiDM@SIGMOD (p. 7:1–7:8).

    Google Scholar 

  16. Fletcher, G. H. L., & Mandreoli, F. (2016). No users no dataspaces! query-driven dataspace orchestration? In SEBD (pp. 150–157).

    Google Scholar 

  17. Franklin, M. J., Halevy, A. Y., & Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4), 27–33.

    Google Scholar 

  18. Friedman, M., Levy, A. Y., & Millstein, T. D. (1999). Navigational plans for data integration. IJCAI: In.

    Google Scholar 

  19. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J. D., Vassalos, V., & Widom, J. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information System, 8(2), 117–132.

    Google Scholar 

  20. Golshan, B., Halevy, A. Y., Mihaila, G. A., & Tan, W. (2017). Data integration: After the teenage years. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017 (pp. 101–106).

    Google Scholar 

  21. Gorawski, M., & Lorek, M. (2017). Efficient storage, retrieval and analysis of poker hands: An adaptive data framework. Applied Mathematics and Computer Science, 27(4), 713–726.

    Google Scholar 

  22. Gorton, I., & Klein, J. (2015). Distribution, data, deployment: Software architecture convergence in big data systems. IEEE Software, 32(3), 78–85.

    Google Scholar 

  23. Hai, R., Geisler, S., & Quix, C. (2016). Constance: An intelligent data lake system. In SIGMOD (pp. 2097–2100).

    Google Scholar 

  24. Halevy, A. Y., Rajaraman, A., & Ordille, J. J. (2006). Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006 (pp. 9–16).

    Google Scholar 

  25. Halevy, A. Y., Korn, F., Noy, N. F., Olston, C., Polyzotis, N., Roy, S., & Whang, S. E. (2016). Managing google’s data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3), 5–14.

    Google Scholar 

  26. Hewasinghage, M., Varga, J., Abelló, A., & Zimányi, E. (2018). Managing polyglot systems metadata with hypergraphs. In ER (pp. 463–478).

    Google Scholar 

  27. Jovanovic, P., Romero, O., Simitsis, A., Abelló, A., & Mayorova, D. (2014a). A requirement-driven approach to the design and evolution of data warehouses. Information Systems, 44, 94–119.

    Google Scholar 

  28. Jovanovic, P., Simitsis, A., & Wilkinson, K. (2014b). Engine independence for logical analytic flows. In ICDE (pp. 1060–1071).

    Google Scholar 

  29. Jovanovic, P., Romero, O., Simitsis, A., & Abelló, A. (2016). Incremental consolidation of data-intensive multi-flows. IEEE Transactions on Knowledge and Data Engineering, 28(5), 1203–1216.

    Google Scholar 

  30. Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A. A. A., Gottlob, G., Keane, J. A., Libkin, L., & Paton, N. W. (2017). The VADA architecture for cost-effective data wrangling. In SIGMOD (pp. 1599–1602).

    Google Scholar 

  31. Lenzerini, M. (2002). Data integration: A theoretical perspective. In PODS (pp. 233–246).

    Google Scholar 

  32. Lerman, K., Minton, S., & Knoblock, C. A. (2003). Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18, 149–181.

    Google Scholar 

  33. Luján-Mora, S., & Trujillo, J. (2006). Applying the UML and the unified process to the design of data warehouses. JCIS, 46(5), 30–58.

    Google Scholar 

  34. Munir, R. F., Nadal, S., Romero, O., Abelló, A., Jovanovic, P., Thiele, M., & Lehner, W. (2018). Intermediate results materialization selection and format for data-intensive flows. Fundam. Inform., 163(2), 111–138.

    Google Scholar 

  35. Nadal, S., Rabbani, K., Romero, O., & Tadesse, S. (2019a). ODIN: A dataspace management system. In ISWC (pp. 185–188).

    Google Scholar 

  36. Nadal, S., Romero, O., Abelló, A., Vassiliadis, P., & Vansummeren, S. (2019b). An integration-oriented ontology to govern evolution in big data ecosystems. Information Systems, 79, 3–19.

    Google Scholar 

  37. Popovic, A., Hackney, R., Tassabehji, R., & Castelli, M. (2018). The impact of big data analytics on firms’ high value business performance. Information Systems Frontiers, 20(2), 209–222.

    Google Scholar 

  38. Priyatna, F., Corcho, Ó., & Sequeda, J. F. (2014). Formalisation and experiences of r2rml-based SPARQL to SQL query translation using morph. In WWW (pp. 479–490).

    Google Scholar 

  39. Quix, C., & Hai, R. (2019). Data lake. Encyclopedia of Big Data Technologies: In.

    Google Scholar 

  40. Rabbani, K. (2019). Supporting the Semi-Automatic Creation of the Target Schema in Data Integration Systems. Master’s thesis, Technische Univesitat Berlin - Universitat Politècnica de Catalunya, BarcelonaTech.

  41. Saltor, F., Castellanos, M., & García-Solaco, M. (1991). Suitability of data models as canonical models for federated databases. SIGMOD Record, 20(4), 44–48.

    Google Scholar 

  42. Sarma, A. D., Dong, X. L., & Halevy, A. Y. (2011). Uncertainty in data integration and dataspace support platforms. In Schema Matching and Mapping (pp. 75–108).

    Google Scholar 

  43. Simitsis, A., Vassiliadis, P., & Sellis, T. K. (2005). State-space optimization of ETL workflows. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1404–1419.

    Google Scholar 

  44. Simitsis, A., Wilkinson, K., Dayal, U., & Hsu, M. (2013). HFMS: managing the lifecycle and complexity of hybrid analytic data flows. In ICDE (pp. 1174–1185).

    Google Scholar 

  45. Skoutas, D., & Simitsis, A. (2007). Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int. J. Semantic Web Inf. Syst., 3(4), 1–24.

    Google Scholar 

  46. Stonebraker, M. (2019). The Case for Polystores – ACM SIGMOD Blog. [Online; accessed 27. Jun. 2019].

  47. Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., Pagan, A., & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR.

    Google Scholar 

  48. Tadesse, S., Gómez, C., Romero, O., Hose, K., & Rabbani, K. (2019). ARDI: Automatic Generation of RDFS Models from Heterogeneous Data Sources. In: EDOC.

    Google Scholar 

  49. Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015). Data wrangling: The challenging yourney from the wild to the lake. In CIDR.

    Google Scholar 

  50. Touma, R., Romero, O., & Jovanovic, P. (2015). Supporting data integration tasks with semi-automatic ontology construction. In DOLAP (pp. 89–98).

    Google Scholar 

  51. Varga, J., Romero, O., Pedersen, T. B., & Thomsen, C. (2014). Towards next generation BI systems: The analytical metadata challenge. In DaWaK (pp. 89–101).

    Google Scholar 

  52. Wojciechowski, A. (2018). ETL workflow reparation by means of case-based reasoning. Information Systems Frontiers, 20(1), 21–43.

    Google Scholar 

Download references

Acknowledgements

We thank Dr. Lise Grout and Dr. Pedro Albajar-Viñas from the Neglected Tropical Diseases (NTD) department at WHO, for providing the use case. This work is partially supported by GENESIS project, funded by the Spanish Ministerio de Ciencia, Innovación y Universidades under project TIN2016-79269-R.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Petar Jovanovic.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jovanovic, P., Nadal, S., Romero, O. et al. Quarry: A User-centered Big Data Integration Platform. Inf Syst Front 23, 9–33 (2021). https://doi.org/10.1007/s10796-020-10001-y

Download citation

Keywords

  • Data Integration
  • Big Data
  • Data-Intensive Flows
  • Metadata