A Framework for Statistical Entity Identification in R

Denk, Michaela

doi:10.1007/978-3-540-78246-9_40

A Framework for Statistical Entity Identification in R

Michaela Denk⁵

Conference paper

5999 Accesses
1 Citations

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

Entity identification deals with matching records from different datasets or within one dataset that represent the same real-world entity when unique identifiers are not available. Enabling data integration at record level as well as the detection of duplicates, entity identification plays a major role in data preprocessing, especially concerning data quality. This paper presents a framework for statistical entity identification in particular focusing on probabilistic record linkage and string matching and its implementation in R. According to the stages of the entity identification process, the framework is structured into seven core components: data preparation, candidate selection, comparison, scoring, classification, decision, and evaluation. Samples of real-world CRM datasets serve as illustrative examples.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

BAXTER, R., CHRISTEN, P., and CHURCHES, T. (2003): A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. 1st Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 9th ACM SIGKDD. Washington, D.C., August 2003.
Google Scholar
BELIN, T.R. and RUBIN, D.B. (1995): A Method for Calibrating False-Match Rates in Record Linkage. J. American Statistical Association, 90, 694-707.
Article MATH Google Scholar
DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum Likelihood from In-complete Data via the EM-Algorithm. J. Royal Statistical Society (B), 39, 1-38.
MATH MathSciNet Google Scholar
DENK, M. (2002): Statistical Data Combination: A Metadata Framework for Record Linkage Procedures. Doctoral thesis, Dept. of Statistics, University of Vienna.
Google Scholar
DENK, M. (2006): A Framework for Statistical Entity Identification to Enhance Data Quality. Report wp6dBiz14_br1. (EC3, Vienna, Austria). Submitted.
Google Scholar
DENK, M. (2007): The StringMatch Toolbox: Determining String Compliance in R. In: Proc. IASC 07 - Statistics for Data Mining, Learning and Knowledge Extraction. Aveiro, Por-tugal, August 2007. Accepted.
Google Scholar
DENK, M., FROESCHL, K.A., HACKL, P. and RAINER, N. (Eds.) (2004): Special Issue on Data Integration and Record Matching, Austrian J. Statistics, 33.
Google Scholar
DENK, M., HACKL, P. and RAINER, N. (2005): String Matching Techniques: An Empirical Assessment Based on Statistics Austria’s Business Register. Austrian J. Statistics, 34(3), 235-250.
Google Scholar
FELLEGI, I.P. and SUNTER, A.B. (1969): A Theory for Record Linkage. J. American Statis-tical Association, 64, 1183-1210.
Article Google Scholar
GILL, L.E. (2001): Methods for automatic record matching and linking in their use in National Statistics. GSS Methodology Series, NSMS25, ONS UK.
Google Scholar
HERZOG, T.N., SCHEUREN, F.J. and WINKLER, W.E. (2007): Data Quality and Record Linkage Techniques. Springer, New York.
MATH Google Scholar
JARO, M.A. (1989): Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84, 414-420.
Article Google Scholar
NAVARRO, G. (2001): A guided tour to approximate string matching. ACM Computing Sur-veys, 33(1), 31-88.
Article Google Scholar
NEILING, M. (2004): Identifizierung von Realwelt-Objekten in multiplen Datenbanken. Doc-toral thesis, TU Cottbus. In German.
Google Scholar
R DEVELOPMENT CORE TEAM (2006): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Google Scholar
WINKLER, W.E. (1994): Advanced Methods for Record Linkage. In: Proc. Section on Survey Research Methods. American Statistical Association, 467-472.
Google Scholar

Download references

Author information

Authors and Affiliations

EC3 - E-Commerce Competence Center, Donau-City-Str. 1, 1220, Vienna, Austria
Michaela Denk

Authors

Michaela Denk
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Institute of Business Economics and Information Systems, University of Hildesheim, Marienburgerplatz 22, 31141, Hildesheim, Germany
Christine Preisach
Lehrstuhl für Mustererkennung und Bildverarbeitung, Universität Freiburg, Gebäude 052, 79110, Freiburg i. Br, Germany
Hans Burkhardt
Institute of Computer Science and Institute of Business Economics and Information Systems, Marienburgerplatz 22, 31141, Hildesheim, Germany
Lars Schmidt-Thieme
Fakultät für Wirtschaftswissenschaften, Lehrstuhl für Betriebswirtschaftslehre, insbes. Marketing, Universitätsstraße 25, 33615, Bielefeld, Germany
Reinhold Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Denk, M. (2008). A Framework for Statistical Entity Identification in R. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_40

Download citation

DOI: https://doi.org/10.1007/978-3-540-78246-9_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics