Abstract
Record linkage is an essential process in information mix, which is utilized as a part of combining, coordinating and copy expulsion from a few databases that allude to the same substances. De-duplication is the procedure of uprooting copy records in a solitary database. Because of multifaceted nature of today’s database, coordinating records in single database is an essential one. Indexing strategies are utilized to productively actualize record linkage and De-duplication. Our additional gathering strategy with jaro-winkler similarity measure exploits the ordering used by the list to combine comparative pieces at negligible additional cost, bringing about a much higher exactness while holding the high adaptability of the base suffix array method. We complete an inside and out examination of our system what’s more, show results from examinations using Cora, restaurant and real identity data which highlights the significance of utilizing proficient as a part of indexing and hindering in true applications where information sets contain a large number of records. This paper presents suffix array blocking for efficacious record linkage and de- duplication in sliding window fashion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
P. Christen. “A survey of indexing techniques for scalable record linkage and de- duplication”, IEEE Transactions on Knowledge and Data Engineering, Vol. 24.9, pp. 1537–1555, 2012.
Winkler, William E. “Overview of record linkage and current research directions.” Bureau of the Census. 2006.
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. “Duplicate record detection: A survey”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 116, 2007.
Vladu, Adrian, and Cosmin Negrueri.“Suffix arrays programming contest approach”,2005.
C. Xiao, W. Wang, and X. Lin. “Ed-join: an efficient algorithm for similarity joins with edit distance constraints”, Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 933–944, 2008.
A. Behm, S. Ji, C. Li, and J. Lu. “Space-constrained gram-based indexing for efficient approximate string search”, IEEE ICDE09, Shanghai vol. 2,pp. 604–615, 2009.
U. Draisbach and F. Naumann. “A comparison and generalization of blocking and windowing algorithms for duplicate detection”, Workshop on Quality in Databases, held at VLDB09, Lyon vol. 3,pp. 274–283, 2009.
N. Adly. “Efficient record linkage using a double embedding scheme”, DMIN09, Las Vegas vol. 2,pp. 274–281, 2009.
T. Bernecker, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Zuefle. “Scalable probabilistic similarity ranking in uncertain databases”, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1234–1246, 2010.4.
Gog, Simon, Alistair Moffat, J. Culpepper, Andrew Turpin, and Anthony Wirth. “Largescale pattern search using reduced-space on-disk suffix arrays”, IEEE Transactions on Knowledge and Data Engineering, VOL. 26, NO. 8, AUGUST 2014.
Winkler, William E.. “Overview of record linkage and current research directions”, US Bureau of the Census., Tech. Rep. vol. 2, 2006.
M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster. “Industry-scale duplicate detection ”, Proceedings of the VLDB En- dowment, vol. 1, no. 2, pp. 1253–1264, 2008.
G. V. Moustakides and V. S. Verykios. “Optimal stopping: A record-linkage approach”, Journal Data and Information Quality vol. 1, pp. 9:19:34, 2009.
P. Christen and A. Pudjijono “Accurate synthetic generation of realistic personal information”, IEEE Transactions on Knowledge and Data Engineering, vol. 5476, pp. 507–514,20095.
P. Christen. “Automatic record linkage using seeded nearest neighbour and support vector machine classification”, ACM SIGKDD08, Las Vegas, pp. 151–159, 2008.
van der Loo, M., van der Laan, J., Team, R. C. & Logan, N, “Package stringdist”, 2013.
T. de Vries, H. Ke, S. Chawla, and P. Christen, “Robust record linkage blocking using suffix arrays,” ACM CIKM’09, pp. 305–314, 2009.
Acknowledgmens
The author would like to thank colleagues, friends, all researchers and everyone supported to and associated with the research work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Singapore
About this paper
Cite this paper
Yamini Warke (2017). Suffix Array Blocking for Efficient Record Linkage and De-duplication in Sliding Window Fashion. In: Satapathy, S., Bhateja, V., Joshi, A. (eds) Proceedings of the International Conference on Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 468. Springer, Singapore. https://doi.org/10.1007/978-981-10-1675-2_7
Download citation
DOI: https://doi.org/10.1007/978-981-10-1675-2_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-1674-5
Online ISBN: 978-981-10-1675-2
eBook Packages: EngineeringEngineering (R0)