Record Linkage in Medieval and Early Modern Text

Georgala, Kleanthi; van der Burgh, Benjamin; Meeng, Marvin; Knobbe, Arno

doi:10.1007/978-3-319-19884-2_9

Record Linkage in Medieval and Early Modern Text

Kleanthi Georgala⁵,
Benjamin van der Burgh⁵,
Marvin Meeng⁵ &
…
Arno Knobbe^5,6

Chapter
First Online: 01 January 2015

612 Accesses
1 Citations

Abstract

This chapter covers the topic of record linkage in historic texts, specifically documents from the Middle Ages and Early Modern period. The challenge of record linkage, in general, is to analyze large collections of data recording people, with the aim of recognizing links between these people, and deciding whether multiple mentions of people actually refer to one and the same person. The typical record linkage application, for example, involving birth and marriage certificates, deals with well-structured descriptions of people in terms of their first and last name, date and place of birth, and so on. In historic texts, however, specifically the medieval ones, people are not identified systematically, and one has to include a lot of the context of the occurrences in order to decide whether two descriptions actually refer to the same historic person. Here, we report on two recent projects, ChartEx and Traces Through Time, related to these challenges. We have been developing automatic techniques for recognizing links between documents, and determining the confidence that we have in the correctness of these links, based on the evidence provided in the text. Much of the work deals with the varied nature of the evidence, specifically with the role of first and last name being much more limited in the periods involved. We thus had to include identifying properties such as titles, professions, provenances, and family relationships, to determine confident links. This chapter describes the probabilistic record linkage system that was developed for this task, and presents a number of experiments on artificial data to test the workings of the system. Finally, we present some insightful examples of matches that our system was able to find in the Medieval and Early Modern data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bard, G. V. (2007). Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In L. Brankovic & C. Steketee (Eds.), Fifth Australasian Information Security Workshop (Privacy Enhancing Technologies) (AISW) (Vol. 68, pp. 117–124). Ballarat, Australia: CRPIT, ACS.
Google Scholar
Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In Sixth SIAM Conference on Data Mining (Vol. 5(7), pp. 47–58). Bethesda, MD, USA.
Google Scholar
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (pp. 175–186). New York, USA: ACM.
Google Scholar
Brizan, D. G., & Tansel, A. U. (2006). A survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3), 41–50.
Google Scholar
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24, 1537–1555.
Article Google Scholar
Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of the Workshop on Data Cleaning and Object Consolidation at the International Conference on Knowledge Discovery and Data Mining, KDD (pp. 73–78). Washington, USA.
Google Scholar
Dunn, H. L. (1946). Record linkage. American Journal of Public Health and the Nation’s Health, 36, 1412–1416.
Article Google Scholar
Flach, P. A. (2003). The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. In Proceedings Twentieth International Conference on Machine Learning (ICML’03) (pp. 194–201). California, USA: AAAI Press.
Google Scholar
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36.
Article Google Scholar
Herzog, T. N., Scheuren, F., & Winkler, W. E. (2010). Record linkage. In Wiley Interdisciplinary Reviews: Computational Statistics (pp. 535–543). New York, USA: Wiley.
Google Scholar
Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69, 197–210.
Article Google Scholar
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
Google Scholar
McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘00) (pp. 169–178). New York, NY, USA: ACM.
Google Scholar
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26(24).
Google Scholar
National Archives and Records Administration (2007). The Soundex Indexing System. http://www.archives.gov/research/census/soundex.html. Accessed May 30, 2007.
Perrow, M., & Barber, D. (2006). Tagging of name records for genealogical data browsing. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ‘06) (pp. 316–325). New York, USA: ACM.
Google Scholar
Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL ‘09) (pp. 147–155). Stroudsburg, PA, USA: Association for Computational Linguistics.
Google Scholar
Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘02) (pp. 269–278). New York, USA: ACM.
Google Scholar
Singla, P., & Domingos, P. (2006). Entity resolution with Markov logic. In Proceedings of the Sixth International Conference on Data Mining (ICDM ‘06) (pp. 572–582). Piscataway: IEEE.
Google Scholar
Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods (American Statistical Association) (pp. 354–359).
Google Scholar
Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.
Google Scholar

Download references

Author information

Authors and Affiliations

Leiden Institute of Advanced Computer Science, Universiteit Leiden, Leiden, The Netherlands
Kleanthi Georgala, Benjamin van der Burgh, Marvin Meeng & Arno Knobbe
Universiteit Utrecht, Utrecht, The Netherlands
Arno Knobbe

Authors

Kleanthi Georgala
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin van der Burgh
View author publications
You can also search for this author in PubMed Google Scholar
Marvin Meeng
View author publications
You can also search for this author in PubMed Google Scholar
Arno Knobbe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arno Knobbe .

Editor information

Editors and Affiliations

Utrecht Universty, Utrecht, The Netherlands
Gerrit Bloothooft
The Australian National University, Canberra, Aust Capital Terr, Australia
Peter Christen
International Inst. of Social History, Amsterdam, The Netherlands
Kees Mandemakers
Leiden University, Leiden, The Netherlands
Marijn Schraagen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Georgala, K., van der Burgh, B., Meeng, M., Knobbe, A. (2015). Record Linkage in Medieval and Early Modern Text. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-19884-2_9
Published: 23 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics