An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

Abstract

In the last years, data lakes are emerging as an effective and an efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. In this scenario, one of the most challenging issues to address consists in the extraction of topic-guided (i.e., thematic) views from the (very heterogeneous and often unstructured) sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, “structuring” unstructured data. Finally, we define a technique to extract topic-guided views from the sources of a data lake, based on similarity and other semantic relationships among source metadata.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    http://dbpedia.org/

  2. 2.

    https://www.zaloni.com/

  3. 3.

    Recall that, in database context, a view is the result of a query or a more complex extraction process that can be exploited by users for further computations.

  4. 4.

    http://www.opencalais.com

  5. 5.

    Here and in the following, to make the presentation smoother, we use the term “source” (resp., “keyword”) to denote both the source (resp., a keyword) and the corresponding node associated with it.

  6. 6.

    In this paper, we use the term “lemma” according to the meaning it has in BabelNet (Navigli and Ponzetto 2012). Here, given a term, its lemmas are other objects (terms, emoticons, etc.) that contribute to specify its meaning.

  7. 7.

    Note that Phases 2 and 4 could be merged into a unique one, avoiding to define arcs with label “lemmaOf”. Here, we maintain these arcs and both phases to keep the information about similarity between nodes for future uses.

  8. 8.

    Whenever this does not happen, the mapping can be automatically provided by the DBpedia Lookup Service (http://wiki.dbpedia.org/projects/dbpedia-lookup).

  9. 9.

    Here, two nodes are assumed to be equal if the corresponding names coincide.

  10. 10.

    In Figs. 3 and 4, we do not show the arc labels for the sources C, W and E because all of them are “contains” and their presence would have complicated the layout unnecessarily.

  11. 11.

    Hereafter, we use the notation S.o to indicate the object o of the source S.

  12. 12.

    In this figure, for layout reasons, we do not show the arc labels because they are the same as the ones of the corresponding arcs of Figs. 34 and 5.

  13. 13.

    Prefixes dbo and dbr stand for http://dbpedia.org/ontology/ and http://dbpedia.org/resource/

  14. 14.

    Consider that, since we have 20 real sources in the data lakes adopted in our experimental campaign, the value of Hj can range in the real interval [0.05, 20].

  15. 15.

    As a matter of fact, a topic set with 8 keywords would encompass a great number of different concepts and, as such, it would not be generally able to capture a clear and specific desire of a user.

References

  1. Abiteboul, S., & Duschka, O. (1998). Complexity of answering queries using materialized views. In Proc. of the International Symposium on Principles of Database Systems (SIGMOD/PODS’98) (pp. 254– 263). Seattle: ACM.

  2. Aversano, L., Intonti, R., Quattrocchi, C., & Tortorella, M. (2010). Building a virtual view of heterogeneous data source views. In Proc. of the International Conference on Software and Data Technologies (ICSOFT’10) (pp. 266–275). Athens: INSTICC Press.

  3. Bachtarzi, C., & Bachtarzi, F. (2015). A model-driven approach for materialized views definition over heterogeneous databases. In Proc. of the International Conference on New Technologies of Information and Communication (NTIC’15) (pp. 1–5). Mila: IEEE.

  4. Bergamaschi, S., Castano, S., Vincini, M., & Beneventano, D. (2001). Semantic integration and query of heterogeneous information sources. Data & Knowledge Engineering, 36(3), 215–249.

    Article  Google Scholar 

  5. Bidoit, N., Colazzo, D., Malla, N., & Sartiani, C. (2018). Evaluating queries and updates on big xml documents. Information Systems Frontiers, 20(1), 63–90.

    Article  Google Scholar 

  6. Bilalli, B., Abelló, A., Aluja-Banet, T., & Wrembel, R. (2016). Towards intelligent data analysis: the metadata challenge. In Proc. of the International Conference on Internet of Things and Big Data (ioTBD’16) (pp. 331–338). Rome, Italy.

  7. Biskup, J., & Embley, D. (2003). Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28(3), 169–212. Elsevier.

    Article  Google Scholar 

  8. Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Microtone Publishing.

    Google Scholar 

  9. Bouadjenek, M.R., Hacid, H., & Bouzeghoub, M. (2016). Social networks and information retrieval, how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms. Information Systems, 56, 1–18.

    Article  Google Scholar 

  10. Bougouin, A., Boudin, F., & Daille, B. (2013). Topicrank: Graph-based topic ranking for keyphrase extraction. In Proc.of the International Joint Conference on Natural Language Processing (IJCNLP’13) (pp. 543–551). Nagoya: Asian Federation of Natural Language Processing.

  11. Brackenbury, W., Liu, R., Mondal, M., Elmore, A., Ur, B., Chard, K., & Franklin, M. (2018). Draining the data swamp: A similarity-based approach. In Proc. of the International Workshop on Human-in-the-loop Data Analytics (HILDA’18) (p. 13). Houston: ACM.

  12. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A. (2020). YAKE! Keyword extraction from single documents using multiple local features. Information Sciences, 509, 257–289. Elsevier.

    Article  Google Scholar 

  13. Castano, S., & Antonellis, V.D. (1999). Building views over semistructured data sources. In Proc. of the International Conference on Conceptual Modeling (ER’99) (pp. 146–160). Paris: Springer.

  14. Chen, C., Shyu, M.-L., & Chen, S.-C. (2016). Weighted subspace modeling for semantic concept retrieval using gaussian mixture models. Information Systems Frontiers, 18(5), 877–889.

    Article  Google Scholar 

  15. Corbellini, A., Mateos, C., Zunino, A., Godoy, D., & Schiaffino, S. (2017). Persisting big-data: The NoSQL landscape. Information Systems, 63, 1–23. Elsevier.

    Article  Google Scholar 

  16. De Meo, P., Quattrone, G., Terracina, G., & Ursino, D. (2006). Integration of XML Schemas at various “severity” levels. Information Systems, 31(6), 397–434.

    Article  Google Scholar 

  17. Debattista, J., Lange, C., & Auer, S. (2014). Representing dataset quality metadata using multi-dimensional views. In Proc. of the International Conference on Semantic Systems (SEM’14) (pp. 92–99). Leipzig: ACM.

  18. Dessi, A., & Atzori, M. (2016). A machine-learning approach to ranking rdf properties. Future Generation Computer Systems, 54, 366–377.

    Article  Google Scholar 

  19. Dublin Core Metadata Initiative. (2012). DCMI Metadata Terms. Technical report.

  20. Fan, W., Wang, X., & Wu, Y. (2016). Answering pattern queries using views. IEEE Transactions on Knowledge and Data Engineering, 28(2), 326–341. IEEE.

    Article  Google Scholar 

  21. Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In Proc. of the International Conference on Cyber Technology in Automation (CYBER’15) (pp. 820–824). Shenyang: IEEE.

  22. Farid, M., Roatis, A., Ilyas, I., Hoffmann, H., & Chu, X. (2016). CLAMS: bringing quality to data lakes. In Proc. of the International Conference on Management of Data (SIGMOD/PODS’16) (pp. 2089–2092). San Francisco: ACM.

  23. García-Moya, L., Kudama, S., Aramburu, M., & Berlanga, R. (2013). Storing and analysing voice of the market data in the corporate data warehouse. Information Systems Frontiers, 15(3), 331–349.

    Article  Google Scholar 

  24. Hai, R., Geisler, S., & Quix, C. (2016). Constance: an intelligent data lake system. In Proc. of the International Conference on Management of Data (SIGMOD 2016) (pp. 2097–2100). San Francisco: ACM.

  25. Hai, R., Quix, C., & Zhou, C. (2018). Query rewriting for heterogeneous data lakes. In Proc. of the International Conference on European Conference on Advances in Databases and Information Systems(ADBIS’18) (pp. 35–49). Budapest: Springer.

  26. Halevy, A. (2001). Answering queries using views: A survey. The VLDB Journal, 10(4), 270–294. Springer.

    Article  Google Scholar 

  27. Hamadou, H., & Ghozzi, F. (2018). Querying heterogeneous document stores. In Proc. of the International Conference on Enterprise Information Systems (ICEIS’18) (pp. 58–68). Madeira, Portugal.

  28. Heath, T., & Bizer, C. (2011). Linked data:, Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1), 1–136.

    Article  Google Scholar 

  29. Hirschman, A. (1964). The paternity of an index. The American Economic Review, 54(5), 761–762.

    Google Scholar 

  30. Hitzler, P., & Janowicz, K. (2013). Linked data, big data, and the 4th paradigm. Semantic Web, 4(3), 233–235.

    Article  Google Scholar 

  31. Janjua, N., Hussain, F., & Hussain, O. (2013). Semantic information and knowledge integration through argumentative reasoning to support intelligent decision making. Information Systems Frontiers, 15(2), 167–192.

    Article  Google Scholar 

  32. Keith, A., Cyganiak, R., Hausenblas, M., & Zhao, J. (2011). Describing linked datasets with the void vocabulary. Technical report.

  33. Klettke, M., Awolin, H., Storl, U., Muller, D., & Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In Proc. of the International Conference on Big data (IEEE bigdata 2017) (pp. 2462–2471). Boston: IEEE.

  34. Kondrak, G. (2005). N-gram similarity and distance. In String processing and Information Retrieval (pp. 115–126): Springer.

  35. Konstantinou, N., Koehler, M., Abel, E., Civili, C., Neumayr, B., Sallinger, E., Fernandes, A., Gottlob, G., Keane, J., & Libkin, L. (2017). The VADA architecture for cost-effective data wrangling. In Proc. of the International Conference on Management of Data (SIGMOD’17) (pp. 1599–1602). Chicago: ACM.

  36. Lassila, O., Swick, R.R., & et al. (1998). Resource description framework (rdf) model and syntax specification.

  37. Maccioni, A., & Torlone, R. (2018). KAYAK: a framework for just-in-time data preparation in a data lake. In Proc. of the international Conference on Advanced information Systems Engineering (CAiSE’18) (pp. 474–489). Tallinn: Springer.

  38. Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proc.of the international conference on very large data bases (VLDB 2001) (pp. 49–58). Morgan Kaufmann: Rome.

  39. McPherson, M., Smith-Lovin, L., & Cook, J. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27, 415–444. JSTOR.

    Article  Google Scholar 

  40. Mouttham, A., Kuziemsky, C., Langayan, D., Peyton, L., & Pereira, J. (2012). Interoperable support for collaborative, mobile, and accessible health care. Information Systems Frontiers, 14(1), 73–85.

    Article  Google Scholar 

  41. Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., & Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.

    Article  Google Scholar 

  42. Tsvetovat, M., & Kouznetsov, A. (2011). Social Network Analysis for startups: Finding connections on the social web. O’Reilly Media Inc.

  43. Navigli, R., & Ponzetto, S. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, 217–250. Elsevier.

    Article  Google Scholar 

  44. Oram, A. (2015). Managing the Data Lake Sebastopol. O’Reilly: USA.

    Google Scholar 

  45. Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). Intensional and extensional integration and abstraction of heterogeneous databases. Data & Knowledge Engineering, 35(3), 201–237.

    Article  Google Scholar 

  46. Palopoli, L., Saccà, D., Terracina, G., & Ursino, D. (2003a). Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 15 (2), 271–294.

    Article  Google Scholar 

  47. Palopoli, L., Terracina, G., & Ursino, D. (2001). A graph-based approach for extracting terminological properties of elements of XML documents. In Proc. of the International Conference on Data Engineering (ICDE 2001) (pp. 330–337). Heidelberg: IEEE Computer Society.

  48. Palopoli, L., Terracina, G., & Ursino, D. (2003b). DIKE: A system supporting the semi-automatic construction of Cooperative Information Systems from heterogeneous databases. Software Practice & Experience, 33(9), 847–884.

    Article  Google Scholar 

  49. Palopoli, L., Terracina, G., & Ursino, D. (2003c). Experiences using DIKE, a system for supporting cooperative information system and data warehouse design. Information Systems, 28(7), 835–865.

    Article  Google Scholar 

  50. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1, 1–20. Wiley, New York.

    Google Scholar 

  51. Singh, K., & Singh, V. (2016). Answering graph pattern query using incremental views. In Proc.of the international conference on computing (ICCCA’16) (pp. 54–59). Greater Noida: IEEE.

  52. Spink, A., Wolfram, D., Jansen, M.B.J., & Saracevic, T. (2001). Searching the web: the public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234.

    Article  Google Scholar 

  53. Wang, J., Li, J., & Yu, J. (2011). Answering tree pattern queries using views: a revisit. In Proc.of the international conference on extending database technology (EDBT/ICDT’11) (pp. 153–164). Uppsala: ACM.

  54. Wang, J., & Yu, J. (2012). Revisiting answering tree pattern queries using views. ACM Transactions on Database Systems, 37(3), 18. ACM.

    Article  Google Scholar 

  55. Wu, X., Theodoratos, D., & Wang, W. (2009). Answering XML queries using materialized views revisited. In Proc. of the International Conference on Information and Knowledge Management (CIKM ’09) (pp. 475–484). Hong Kong: ACM.

  56. Yi, J., Maghoul, F., & Pedersen, J. (2008). Deciphering mobile search patterns: a study of yahoo! mobile search queries. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08 (pp. 257–266). New York: ACM.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Claudia Diamantini.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Diamantini, C., Lo Giudice, P., Potena, D. et al. An Approach to Extracting Topic-guided Views from the Sources of a Data Lake. Inf Syst Front 23, 243–262 (2021). https://doi.org/10.1007/s10796-020-10010-x

Download citation

Keywords

  • Data lakes
  • Unstructuted data sources
  • Metadata management
  • Thematic views
  • Semantic similarities
  • DBpedia