Suffix Array Blocking for Efficient Record Linkage and De-duplication in Sliding Window Fashion

Yamini Warke

doi:10.1007/978-981-10-1675-2_7

Yamini Warke⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 468))

1461 Accesses
1 Citations

Abstract

Record linkage is an essential process in information mix, which is utilized as a part of combining, coordinating and copy expulsion from a few databases that allude to the same substances. De-duplication is the procedure of uprooting copy records in a solitary database. Because of multifaceted nature of today’s database, coordinating records in single database is an essential one. Indexing strategies are utilized to productively actualize record linkage and De-duplication. Our additional gathering strategy with jaro-winkler similarity measure exploits the ordering used by the list to combine comparative pieces at negligible additional cost, bringing about a much higher exactness while holding the high adaptability of the base suffix array method. We complete an inside and out examination of our system what’s more, show results from examinations using Cora, restaurant and real identity data which highlights the significance of utilizing proficient as a part of indexing and hindering in true applications where information sets contain a large number of records. This paper presents suffix array blocking for efficacious record linkage and de- duplication in sliding window fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

P. Christen. “A survey of indexing techniques for scalable record linkage and de- duplication”, IEEE Transactions on Knowledge and Data Engineering, Vol. 24.9, pp. 1537–1555, 2012.
Google Scholar
Winkler, William E. “Overview of record linkage and current research directions.” Bureau of the Census. 2006.
Google Scholar
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. “Duplicate record detection: A survey”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 116, 2007.
Google Scholar
Vladu, Adrian, and Cosmin Negrueri.“Suffix arrays programming contest approach”,2005.
Google Scholar
C. Xiao, W. Wang, and X. Lin. “Ed-join: an efficient algorithm for similarity joins with edit distance constraints”, Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 933–944, 2008.
Google Scholar
A. Behm, S. Ji, C. Li, and J. Lu. “Space-constrained gram-based indexing for efficient approximate string search”, IEEE ICDE09, Shanghai vol. 2,pp. 604–615, 2009.
Google Scholar
U. Draisbach and F. Naumann. “A comparison and generalization of blocking and windowing algorithms for duplicate detection”, Workshop on Quality in Databases, held at VLDB09, Lyon vol. 3,pp. 274–283, 2009.
Google Scholar
N. Adly. “Efficient record linkage using a double embedding scheme”, DMIN09, Las Vegas vol. 2,pp. 274–281, 2009.
Google Scholar
T. Bernecker, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Zuefle. “Scalable probabilistic similarity ranking in uncertain databases”, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1234–1246, 2010.4.
Google Scholar
Gog, Simon, Alistair Moffat, J. Culpepper, Andrew Turpin, and Anthony Wirth. “Largescale pattern search using reduced-space on-disk suffix arrays”, IEEE Transactions on Knowledge and Data Engineering, VOL. 26, NO. 8, AUGUST 2014.
Google Scholar
Winkler, William E.. “Overview of record linkage and current research directions”, US Bureau of the Census., Tech. Rep. vol. 2, 2006.
Google Scholar
M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster. “Industry-scale duplicate detection ”, Proceedings of the VLDB En- dowment, vol. 1, no. 2, pp. 1253–1264, 2008.
Google Scholar
G. V. Moustakides and V. S. Verykios. “Optimal stopping: A record-linkage approach”, Journal Data and Information Quality vol. 1, pp. 9:19:34, 2009.
Google Scholar
P. Christen and A. Pudjijono “Accurate synthetic generation of realistic personal information”, IEEE Transactions on Knowledge and Data Engineering, vol. 5476, pp. 507–514,20095.
Google Scholar
P. Christen. “Automatic record linkage using seeded nearest neighbour and support vector machine classification”, ACM SIGKDD08, Las Vegas, pp. 151–159, 2008.
Google Scholar
van der Loo, M., van der Laan, J., Team, R. C. & Logan, N, “Package stringdist”, 2013.
Google Scholar
T. de Vries, H. Ke, S. Chawla, and P. Christen, “Robust record linkage blocking using suffix arrays,” ACM CIKM’09, pp. 305–314, 2009.
Google Scholar

Download references

Acknowledgmens

The author would like to thank colleagues, friends, all researchers and everyone supported to and associated with the research work.

Author information

Authors and Affiliations

Dr. D.Y. Patil School of Engineering and Technology, Savitribai Phule Pune University, Pune, India
Yamini Warke

Authors

Yamini Warke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yamini Warke .

Editor information

Editors and Affiliations

Department of Computer Science & Engineering, ANITS, Visakhapatnam, India
Suresh Chandra Satapathy
Dept. of ECE, Shri Ramswaroop Mem. Group of Prof. Clg, Lucknow, Uttar Pradesh, India
Vikrant Bhateja
Sabar Institute of Technology, Tajpur, Sabarkantha, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamini Warke (2017). Suffix Array Blocking for Efficient Record Linkage and De-duplication in Sliding Window Fashion. In: Satapathy, S., Bhateja, V., Joshi, A. (eds) Proceedings of the International Conference on Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 468. Springer, Singapore. https://doi.org/10.1007/978-981-10-1675-2_7

Download citation

DOI: https://doi.org/10.1007/978-981-10-1675-2_7
Published: 24 August 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-1674-5
Online ISBN: 978-981-10-1675-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics