Skip to main content

Suffix Array Blocking for Efficient Record Linkage and De-duplication in Sliding Window Fashion

  • Conference paper
  • First Online:
Proceedings of the International Conference on Data Engineering and Communication Technology

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 468))

Abstract

Record linkage is an essential process in information mix, which is utilized as a part of combining, coordinating and copy expulsion from a few databases that allude to the same substances. De-duplication is the procedure of uprooting copy records in a solitary database. Because of multifaceted nature of today’s database, coordinating records in single database is an essential one. Indexing strategies are utilized to productively actualize record linkage and De-duplication. Our additional gathering strategy with jaro-winkler similarity measure exploits the ordering used by the list to combine comparative pieces at negligible additional cost, bringing about a much higher exactness while holding the high adaptability of the base suffix array method. We complete an inside and out examination of our system what’s more, show results from examinations using Cora, restaurant and real identity data which highlights the significance of utilizing proficient as a part of indexing and hindering in true applications where information sets contain a large number of records. This paper presents suffix array blocking for efficacious record linkage and de- duplication in sliding window fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. P. Christen. “A survey of indexing techniques for scalable record linkage and de- duplication”, IEEE Transactions on Knowledge and Data Engineering, Vol. 24.9, pp. 1537–1555, 2012.

    Google Scholar 

  2. Winkler, William E. “Overview of record linkage and current research directions.” Bureau of the Census. 2006.

    Google Scholar 

  3. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. “Duplicate record detection: A survey”, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 116, 2007.

    Google Scholar 

  4. Vladu, Adrian, and Cosmin Negrueri.“Suffix arrays programming contest approach”,2005.

    Google Scholar 

  5. C. Xiao, W. Wang, and X. Lin. “Ed-join: an efficient algorithm for similarity joins with edit distance constraints”, Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 933–944, 2008.

    Google Scholar 

  6. A. Behm, S. Ji, C. Li, and J. Lu. “Space-constrained gram-based indexing for efficient approximate string search”, IEEE ICDE09, Shanghai vol. 2,pp. 604–615, 2009.

    Google Scholar 

  7. U. Draisbach and F. Naumann. “A comparison and generalization of blocking and windowing algorithms for duplicate detection”, Workshop on Quality in Databases, held at VLDB09, Lyon vol. 3,pp. 274–283, 2009.

    Google Scholar 

  8. N. Adly. “Efficient record linkage using a double embedding scheme”, DMIN09, Las Vegas vol. 2,pp. 274–281, 2009.

    Google Scholar 

  9. T. Bernecker, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Zuefle. “Scalable probabilistic similarity ranking in uncertain databases”, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1234–1246, 2010.4.

    Google Scholar 

  10. Gog, Simon, Alistair Moffat, J. Culpepper, Andrew Turpin, and Anthony Wirth. “Largescale pattern search using reduced-space on-disk suffix arrays”, IEEE Transactions on Knowledge and Data Engineering, VOL. 26, NO. 8, AUGUST 2014.

    Google Scholar 

  11. Winkler, William E.. “Overview of record linkage and current research directions”, US Bureau of the Census., Tech. Rep. vol. 2, 2006.

    Google Scholar 

  12. M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster. “Industry-scale duplicate detection ”, Proceedings of the VLDB En- dowment, vol. 1, no. 2, pp. 1253–1264, 2008.

    Google Scholar 

  13. G. V. Moustakides and V. S. Verykios. “Optimal stopping: A record-linkage approach”, Journal Data and Information Quality vol. 1, pp. 9:19:34, 2009.

    Google Scholar 

  14. P. Christen and A. Pudjijono “Accurate synthetic generation of realistic personal information”, IEEE Transactions on Knowledge and Data Engineering, vol. 5476, pp. 507–514,20095.

    Google Scholar 

  15. P. Christen. “Automatic record linkage using seeded nearest neighbour and support vector machine classification”, ACM SIGKDD08, Las Vegas, pp. 151–159, 2008.

    Google Scholar 

  16. van der Loo, M., van der Laan, J., Team, R. C. & Logan, N, “Package stringdist”, 2013.

    Google Scholar 

  17. T. de Vries, H. Ke, S. Chawla, and P. Christen, “Robust record linkage blocking using suffix arrays,” ACM CIKM’09, pp. 305–314, 2009.

    Google Scholar 

Download references

Acknowledgmens

The author would like to thank colleagues, friends, all researchers and everyone supported to and associated with the research work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yamini Warke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Singapore

About this paper

Cite this paper

Yamini Warke (2017). Suffix Array Blocking for Efficient Record Linkage and De-duplication in Sliding Window Fashion. In: Satapathy, S., Bhateja, V., Joshi, A. (eds) Proceedings of the International Conference on Data Engineering and Communication Technology. Advances in Intelligent Systems and Computing, vol 468. Springer, Singapore. https://doi.org/10.1007/978-981-10-1675-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-1675-2_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-1674-5

  • Online ISBN: 978-981-10-1675-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics