An Efficient Algorithm for De-duplication of Demographic Data

  • Vandana Dixit Kaushik
  • Amit Bendale
  • Aditya Nigam
  • Phalguni Gupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7389)


This paper proposes an efficient algorithm to de-duplicate based on demographic information which contains two name strings, viz. GivenName and Surname of individuals. The algorithm consists of two stages - enrolment and de-duplication. In both stages, all name strings are reduced to generic name strings with the help of phonetic based reduction rules. Thus there may be several name strings having same generic name and also there may be many individuals having the same name. The generic name with all name strings and their Ids forms a bin. At the enrolment stage, a database with demographic information is efficiently created which is an array of bins and each bin is a singly linked list. At the de-duplication stage, name strings are reduced and all neighbouring bins of the reduced name strings are used to determine the top k best matches. In order to see the performance of the proposed algorithm, we have considered a large demographic database of 4,85,136 individuals. It has been observed that the phonetic reduction rules could reduce both the name strings by more than 90%. Experimental results reveal that there is very high hit rate against a low penetration rate.


De-duplication Demographic Data Edit Distance Levenshtein Distance Phonetics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Jaro, M.: Advances in Record-linkage Methodology as Applied to Matching the 1985 Census Google Scholar
  2. 2.
    Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 10, 707–710 (1966)MathSciNetGoogle Scholar
  3. 3.
    Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2001)CrossRefGoogle Scholar
  4. 4.
    Oommen, B., Loke, R.: Pattern Recognition of Strings with Substitutions, Insertions, Deletions and Generalized Transpositions. Pattern Recognition 30(5), 789–800 (1997)CrossRefGoogle Scholar
  5. 5.
    Sankoff, D., Kruskal, J.B. (eds.): Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Publication, Reading (1983)Google Scholar
  6. 6.
    Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication Using Active Learning. In: Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), pp. 269–278. ACM (2002)Google Scholar
  7. 7.
    Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Winkler, W.: Matching and Record Linkage. Wiley Online Library (1993)Google Scholar
  9. 9.
    Winkler, W.: The State of Record Linkage and Current Research Problems. Statistical Research Division, US Census Bureau, Citeseer (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Vandana Dixit Kaushik
    • 1
  • Amit Bendale
    • 2
  • Aditya Nigam
    • 2
  • Phalguni Gupta
    • 2
  1. 1.Department of Computer Science & EngineeringHarcourt Butler Technological InstituteKanpurIndia
  2. 2.Department of Computer Science & EngineeringIndian Institute of Technology KanpurKanpurIndia

Personalised recommendations