DS-Prox: Dataset Proximity Mining for Governing the Data Lake

Alserafi, Ayman; Calders, Toon; Abelló, Alberto; Romero, Oscar

doi:10.1007/978-3-319-68474-1_20

Ayman Alserafi^17,18,
Toon Calders^18,19,
Alberto Abelló¹⁷ &
…
Oscar Romero¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10609))

Included in the following conference series:

International Conference on Similarity Search and Applications

2090 Accesses
4 Citations

Abstract

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abelló, A.: Big data design. In: Proceedings of ACM DOLAP, pp. 35–38 (2015). doi:10.1145/2811222.2811235
Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: DINA Workshop, ICDM (2016). doi:10.1109/ICDMW.2016.0033
Ares, L.G., Brisaboa, N.R., Ordóñez Pereira, A., Pedreira, O.: Efficient similarity search in metric spaces with cluster reduction. In: Navarro, G., Pestov, V. (eds.) SISAP 2012. LNCS, vol. 7404, pp. 70–84. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32153-5_6
Chapter Google Scholar
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endowment 4(11), 695–701 (2011)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD, pp. 39–48 (2003)
Google Scholar
Cordero Cruz, J.A., Garza, S.E., Schaeffer, S.E.: Entity recognition for duplicate filtering. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 253–264. Springer, Cham (2014). doi:10.1007/978-3-319-11988-5_24
Google Scholar
Džeroski, S., Ženko, B.: Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54(3), 255–273 (2004)
Article MATH Google Scholar
Figueroa, K., Paredes, R.: List of clustered permutations for proximity searching. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS, vol. 8199, pp. 50–58. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41062-8_6
Chapter Google Scholar
Kohavi, R.: The power of decision tables. In: Lavrac, N., Wrobel, S. (eds.) ECML 1995. LNCS, vol. 912, pp. 174–189. Springer, Heidelberg (1995). doi:10.1007/3-540-59286-5_57
Chapter Google Scholar
Lokoč, J., Čech, P., Novák, J., Skopal, T.: Cut-Region: a compact building block for hierarchical metric indexing. In: Navarro, G., Pestov, V. (eds.) SISAP 2012. LNCS, vol. 7404, pp. 85–100. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32153-5_7
Chapter Google Scholar
Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42(4), 40–49 (2014)
Article Google Scholar
Patella, M., Ciaccia, P.: Approximate similarity search: a multi-faceted problem. J. Discrete Algorithms 7(1), 36–48 (2009)
Article MathSciNet MATH Google Scholar
Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_2
Chapter Google Scholar
Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR) (2013)
Google Scholar
Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Towards next generation BI systems: the analytical metadata challenge. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 89–101. Springer, Cham (2014). doi:10.1007/978-3-319-10160-6_9
Google Scholar

Download references

Acknowledgement

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate (IT4BI-DC).

Author information

Authors and Affiliations

Universitat Politècnica de Catalunya - BarcelonaTech, Barcelona, Catalunya, Spain
Ayman Alserafi, Alberto Abelló & Oscar Romero
Université Libre de Bruxelles (ULB), Brussels, Belgium
Ayman Alserafi & Toon Calders
Universiteit Antwerpen (UAntwerp), Antwerp, Belgium
Toon Calders

Authors

Ayman Alserafi
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayman Alserafi .

Editor information

Editors and Affiliations

Fraunhofer Institute for Applied Information Technology, Sankt Augustin, Germany
Christian Beecks
Ludwig-Maximilians-Universität München, Munich, Germany
Felix Borutta
Ludwig-Maximilians-Universität München, Munich, Germany
Peer Kröger
Ludwig-Maximilians-Universität München, Munich, Germany
Thomas Seidl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alserafi, A., Calders, T., Abelló, A., Romero, O. (2017). DS-Prox: Dataset Proximity Mining for Governing the Data Lake. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-68474-1_20
Published: 28 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics