Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes

Hai, Rihan; Quix, Christoph; Wang, Dan

doi:10.1007/978-3-030-33223-5_19

Rihan Hai¹²,
Christoph Quix^13,14 &
Dan Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11788))

Included in the following conference series:

International Conference on Conceptual Modeling

1845 Accesses
10 Citations

Abstract

Functional dependencies are important for the definition of constraints and relationships that have to be satisfied by every database instance. Relaxed functional dependencies (RFDs) can be used for data exploration and profiling in datasets with lower data quality. In this work, we present an approach for RFD discovery in heterogeneous data lakes. More specifically, the goal of this work is to find RFDs from structured, semi-structured, and graph data. Our solution brings novelty to this problem in the following aspects: (1) We introduce a generic metamodel to the problem of RFD discovery, which allows us to define and detect RFDs for data stored in heterogeneous sources in an integrated manner. (2) We apply clustering techniques during RFD discovery for partitioning and pruning. (3) We performed an intensive evaluation with nine datasets, which shows that our approach is effective for discovering meaningful RFDs, reducing redundancy, and detecting inconsistent data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://spark.apache.org/.
2.
https://neo4j-contrib.github.io/neo4j-apoc-procedures/.
3.
Library used in implementation: https://github.com/deepcharles/ruptures.
4.
Links: http://dbis.rwth-aachen.de/cms/staff/hai/RFDDiscovery/datasets.
5.
Full results: http://dbis.rwth-aachen.de/cms/staff/hai/RFDDiscovery/res.

References

Bassée, R., Wijsen, J.: Neighborhood dependencies for prediction. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 562–567. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45357-1_59
Chapter Google Scholar
Caruccio, L., Deufemia, V., Polese, G.: Relaxed functional dependencies - a survey of approaches. IEEE Trans. Knowl. Data Eng. 28(1), 147–165 (2016)
Article Google Scholar
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the VLDB, pp. 315–326 (2007)
Google Scholar
Fassetti, F., Fazzinga, B.: Approximate functional dependencies for XML data. In: Proceedings of the ADBIS (2007)
Google Scholar
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the SIGMOD, pp. 2097–2100. ACM (2016)
Google Scholar
Hai, R., Quix, C.: Rewriting of plain SO tgds into nested tgds. Proc. VLDB Endow. 12(11), 1526–1538 (2019)
Article Google Scholar
Hai, R., Quix, C., Kensche, D.: Nested schema mappings for integrating JSON. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 397–405. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_28
Chapter Google Scholar
Hai, R., Quix, C., Zhou, C.: Query rewriting for heterogeneous data lakes. In: Benczúr, A., Thalheim, B., Horváth, T. (eds.) ADBIS 2018. LNCS, vol. 11019, pp. 35–49. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98398-1_3
Chapter Google Scholar
Huhtala, Y., et al.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Article Google Scholar
Kensche, D., Quix, C., Li, X., Li, Y., Jarke, M.: Generic schema mappings for composition and query answering. Data Knowl. Eng. 68(7), 599–621 (2009)
Article Google Scholar
Kruse, S., Naumann, F.: Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11(7), 759–772 (2018)
Article Google Scholar
Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data - a review. IEEE Trans. Knowl. Data Eng. 24(2), 251–264 (2012)
Article Google Scholar
Pelleg, D., Moore, A.W., et al.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the ICML, pp. 727–734 (2000)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Article MathSciNet Google Scholar
Yao, H., Hamilton, H.J., Butz, C.J.: FD\(\_\)Mine: discovering functional dependencies in a database using equivalences. In: Proceedings of the ICDM, pp. 729–732 (2002)
Google Scholar
Yu, C., Jagadish, H.V.: XML schema refinement through redundancy detection and normalization. VLDB J. 17(2), 203–223 (2008)
Article Google Scholar
Yu, Y., Heflin, J.: Extending functional dependency to detect abnormal data in RDF graphs. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 794–809. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_50
Chapter Google Scholar

Download references

Acknowledgements

The authors would like to thank the German Research Foundation DFG for the kind support within the Cluster of Excellence “Internet of Production” (Project ID: EXC 2023/390621612).

Author information

Authors and Affiliations

RWTH Aachen University, Aachen, Germany
Rihan Hai & Dan Wang
Hochschule Niederrhein, University of Applied Sciences, Krefeld, Germany
Christoph Quix
Fraunhofer Institute for Applied Information Technology FIT, Sankt Augustin, Germany
Christoph Quix

Authors

Rihan Hai
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Quix
View author publications
You can also search for this author in PubMed Google Scholar
Dan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rihan Hai .

Editor information

Editors and Affiliations

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Alberto H. F. Laender
Politecnico di Milano, Milan, Italy
Barbara Pernici
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Univ Federal do Rio Grande do Sul, Porto Alegre, Brazil
José Palazzo M. de Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hai, R., Quix, C., Wang, D. (2019). Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes. In: Laender, A., Pernici, B., Lim, EP., de Oliveira, J. (eds) Conceptual Modeling. ER 2019. Lecture Notes in Computer Science(), vol 11788. Springer, Cham. https://doi.org/10.1007/978-3-030-33223-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-33223-5_19
Published: 15 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33222-8
Online ISBN: 978-3-030-33223-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics