Skip to main content

A Framework for Statistical Entity Identification in R

  • Conference paper

Abstract

Entity identification deals with matching records from different datasets or within one dataset that represent the same real-world entity when unique identifiers are not available. Enabling data integration at record level as well as the detection of duplicates, entity identification plays a major role in data preprocessing, especially concerning data quality. This paper presents a framework for statistical entity identification in particular focusing on probabilistic record linkage and string matching and its implementation in R. According to the stages of the entity identification process, the framework is structured into seven core components: data preparation, candidate selection, comparison, scoring, classification, decision, and evaluation. Samples of real-world CRM datasets serve as illustrative examples.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. 1st Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD. Washington, D.C., August 2003.

    Google Scholar 

  • BELIN, T.R. and RUBIN, D.B. (1995): A Method for Calibrating False-Match Rates in Record Linkage. J. American Statistical Association, 90, 694-707.

    Article  MATH  Google Scholar 

  • DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum Likelihood from In-complete Data via the EM-Algorithm. J. Royal Statistical Society (B), 39, 1-38.

    MATH  MathSciNet  Google Scholar 

  • DENK, M. (2002): Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Doctoral thesis, Dept. of Statistics, University of Vienna.

    Google Scholar 

  • DENK, M. (2006): A Framework for Statistical Entity Identification to Enhance Data Quality. Report wp6dBiz14_br1. (EC3, Vienna, Austria). Submitted.

    Google Scholar 

  • DENK, M. (2007): The StringMatch Toolbox: Determining String Compliance in R. In: Proc. IASC 07 - Statistics for Data Mining, Learning and Knowledge Extraction. Aveiro, Por-tugal, August 2007. Accepted.

    Google Scholar 

  • DENK, M., FROESCHL, K.A., HACKL, P. and RAINER, N. (Eds.) (2004): Special Issue on Data Integration and Record Matching, Austrian J. Statistics, 33.

    Google Scholar 

  • DENK, M., HACKL, P. and RAINER, N. (2005): String Matching Techniques: An Empirical Assessment Based on Statistics Austria’s Business Register. Austrian J. Statistics, 34(3), 235-250.

    Google Scholar 

  • FELLEGI, I.P. and SUNTER, A.B. (1969): A Theory for Record Linkage. J. American Statis-tical Association, 64, 1183-1210.

    Article  Google Scholar 

  • GILL, L.E. (2001): Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25, ONS UK.

    Google Scholar 

  • HERZOG, T.N., SCHEUREN, F.J. and WINKLER, W.E. (2007): Data Quality and Record Linkage Techniques. Springer, New York.

    MATH  Google Scholar 

  • JARO, M.A. (1989): Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84, 414-420.

    Article  Google Scholar 

  • NAVARRO, G. (2001): A guided tour to approximate string matching. ACM Computing Sur-veys, 33(1), 31-88.

    Article  Google Scholar 

  • NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Doc-toral thesis, TU Cottbus. In German.

    Google Scholar 

  • R DEVELOPMENT CORE TEAM (2006): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

    Google Scholar 

  • WINKLER, W.E. (1994): Advanced Methods for Record Linkage. In: Proc. Section on Survey Research Methods. American Statistical Association, 467-472.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Denk, M. (2008). A Framework for Statistical Entity Identification in R. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_40

Download citation

Publish with us

Policies and ethics