An Efficient Algorithm for De-duplication of Demographic Data
This paper proposes an efficient algorithm to de-duplicate based on demographic information which contains two name strings, viz. GivenName and Surname of individuals. The algorithm consists of two stages - enrolment and de-duplication. In both stages, all name strings are reduced to generic name strings with the help of phonetic based reduction rules. Thus there may be several name strings having same generic name and also there may be many individuals having the same name. The generic name with all name strings and their Ids forms a bin. At the enrolment stage, a database with demographic information is efficiently created which is an array of bins and each bin is a singly linked list. At the de-duplication stage, name strings are reduced and all neighbouring bins of the reduced name strings are used to determine the top k best matches. In order to see the performance of the proposed algorithm, we have considered a large demographic database of 4,85,136 individuals. It has been observed that the phonetic reduction rules could reduce both the name strings by more than 90%. Experimental results reveal that there is very high hit rate against a low penetration rate.
KeywordsDe-duplication Demographic Data Edit Distance Levenshtein Distance Phonetics
Unable to display preview. Download preview PDF.
- 1.Jaro, M.: Advances in Record-linkage Methodology as Applied to Matching the 1985 Census Google Scholar
- 5.Sankoff, D., Kruskal, J.B. (eds.): Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Publication, Reading (1983)Google Scholar
- 6.Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication Using Active Learning. In: Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), pp. 269–278. ACM (2002)Google Scholar
- 8.Winkler, W.: Matching and Record Linkage. Wiley Online Library (1993)Google Scholar
- 9.Winkler, W.: The State of Record Linkage and Current Research Problems. Statistical Research Division, US Census Bureau, Citeseer (1999)Google Scholar