Abstract
With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abelló, A.: Big data design. In: Proceedings of ACM DOLAP, pp. 35–38 (2015). doi:10.1145/2811222.2811235
Alserafi, A., Abelló, A., Romero, O., Calders, T.: Towards information profiling: data lake content metadata management. In: DINA Workshop, ICDM (2016). doi:10.1109/ICDMW.2016.0033
Ares, L.G., Brisaboa, N.R., Ordóñez Pereira, A., Pedreira, O.: Efficient similarity search in metric spaces with cluster reduction. In: Navarro, G., Pestov, V. (eds.) SISAP 2012. LNCS, vol. 7404, pp. 70–84. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32153-5_6
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endowment 4(11), 695–701 (2011)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD, pp. 39–48 (2003)
Cordero Cruz, J.A., Garza, S.E., Schaeffer, S.E.: Entity recognition for duplicate filtering. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 253–264. Springer, Cham (2014). doi:10.1007/978-3-319-11988-5_24
Džeroski, S., Ženko, B.: Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54(3), 255–273 (2004)
Figueroa, K., Paredes, R.: List of clustered permutations for proximity searching. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS, vol. 8199, pp. 50–58. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41062-8_6
Kohavi, R.: The power of decision tables. In: Lavrac, N., Wrobel, S. (eds.) ECML 1995. LNCS, vol. 912, pp. 174–189. Springer, Heidelberg (1995). doi:10.1007/3-540-59286-5_57
Lokoč, J., Čech, P., Novák, J., Skopal, T.: Cut-Region: a compact building block for hierarchical metric indexing. In: Navarro, G., Pestov, V. (eds.) SISAP 2012. LNCS, vol. 7404, pp. 85–100. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32153-5_7
Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42(4), 40–49 (2014)
Patella, M., Ciaccia, P.: Approximate similarity search: a multi-faceted problem. J. Discrete Algorithms 7(1), 36–48 (2009)
Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_2
Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR) (2013)
Varga, J., Romero, O., Pedersen, T.B., Thomsen, C.: Towards next generation BI systems: the analytical metadata challenge. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2014. LNCS, vol. 8646, pp. 89–101. Springer, Cham (2014). doi:10.1007/978-3-319-10160-6_9
Acknowledgement
This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate (IT4BI-DC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Alserafi, A., Calders, T., Abelló, A., Romero, O. (2017). DS-Prox: Dataset Proximity Mining for Governing the Data Lake. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-68474-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)