Abstract
Functional dependencies are important for the definition of constraints and relationships that have to be satisfied by every database instance. Relaxed functional dependencies (RFDs) can be used for data exploration and profiling in datasets with lower data quality. In this work, we present an approach for RFD discovery in heterogeneous data lakes. More specifically, the goal of this work is to find RFDs from structured, semi-structured, and graph data. Our solution brings novelty to this problem in the following aspects: (1) We introduce a generic metamodel to the problem of RFD discovery, which allows us to define and detect RFDs for data stored in heterogeneous sources in an integrated manner. (2) We apply clustering techniques during RFD discovery for partitioning and pruning. (3) We performed an intensive evaluation with nine datasets, which shows that our approach is effective for discovering meaningful RFDs, reducing redundancy, and detecting inconsistent data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Library used in implementation: https://github.com/deepcharles/ruptures.
- 4.
- 5.
Full results: http://dbis.rwth-aachen.de/cms/staff/hai/RFDDiscovery/res.
References
Bassée, R., Wijsen, J.: Neighborhood dependencies for prediction. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 562–567. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45357-1_59
Caruccio, L., Deufemia, V., Polese, G.: Relaxed functional dependencies - a survey of approaches. IEEE Trans. Knowl. Data Eng. 28(1), 147–165 (2016)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the VLDB, pp. 315–326 (2007)
Fassetti, F., Fazzinga, B.: Approximate functional dependencies for XML data. In: Proceedings of the ADBIS (2007)
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the SIGMOD, pp. 2097–2100. ACM (2016)
Hai, R., Quix, C.: Rewriting of plain SO tgds into nested tgds. Proc. VLDB Endow. 12(11), 1526–1538 (2019)
Hai, R., Quix, C., Kensche, D.: Nested schema mappings for integrating JSON. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 397–405. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_28
Hai, R., Quix, C., Zhou, C.: Query rewriting for heterogeneous data lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 35–49. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_3
Huhtala, Y., et al.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Kensche, D., Quix, C., Li, X., Li, Y., Jarke, M.: Generic schema mappings for composition and query answering. Data Knowl. Eng. 68(7), 599–621 (2009)
Kruse, S., Naumann, F.: Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11(7), 759–772 (2018)
Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data - a review. IEEE Trans. Knowl. Data Eng. 24(2), 251–264 (2012)
Pelleg, D., Moore, A.W., et al.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the ICML, pp. 727–734 (2000)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Yao, H., Hamilton, H.J., Butz, C.J.: FD\(\_\)Mine: discovering functional dependencies in a database using equivalences. In: Proceedings of the ICDM, pp. 729–732 (2002)
Yu, C., Jagadish, H.V.: XML schema refinement through redundancy detection and normalization. VLDB J. 17(2), 203–223 (2008)
Yu, Y., Heflin, J.: Extending functional dependency to detect abnormal data in RDF graphs. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 794–809. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_50
Acknowledgements
The authors would like to thank the German Research Foundation DFG for the kind support within the Cluster of Excellence “Internet of Production” (Project ID: EXC 2023/390621612).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hai, R., Quix, C., Wang, D. (2019). Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-33223-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33222-8
Online ISBN: 978-3-030-33223-5
eBook Packages: Computer ScienceComputer Science (R0)