Skip to main content

Record Linkage in Medieval and Early Modern Text

  • Chapter
  • First Online:

Abstract

This chapter covers the topic of record linkage in historic texts, specifically documents from the Middle Ages and Early Modern period. The challenge of record linkage, in general, is to analyze large collections of data recording people, with the aim of recognizing links between these people, and deciding whether multiple mentions of people actually refer to one and the same person. The typical record linkage application, for example, involving birth and marriage certificates, deals with well-structured descriptions of people in terms of their first and last name, date and place of birth, and so on. In historic texts, however, specifically the medieval ones, people are not identified systematically, and one has to include a lot of the context of the occurrences in order to decide whether two descriptions actually refer to the same historic person. Here, we report on two recent projects, ChartEx and Traces Through Time, related to these challenges. We have been developing automatic techniques for recognizing links between documents, and determining the confidence that we have in the correctness of these links, based on the evidence provided in the text. Much of the work deals with the varied nature of the evidence, specifically with the role of first and last name being much more limited in the periods involved. We thus had to include identifying properties such as titles, professions, provenances, and family relationships, to determine confident links. This chapter describes the probabilistic record linkage system that was developed for this task, and presents a number of experiments on artificial data to test the workings of the system. Finally, we present some insightful examples of matches that our system was able to find in the Medieval and Early Modern data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Bard, G. V. (2007). Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In L. Brankovic & C. Steketee (Eds.), Fifth Australasian Information Security Workshop (Privacy Enhancing Technologies) (AISW) (Vol. 68, pp. 117–124). Ballarat, Australia: CRPIT, ACS.

    Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In Sixth SIAM Conference on Data Mining (Vol. 5(7), pp. 47–58). Bethesda, MD, USA.

    Google Scholar 

  • Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (pp. 175–186). New York, USA: ACM.

    Google Scholar 

  • Brizan, D. G., & Tansel, A. U. (2006). A survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3), 41–50.

    Google Scholar 

  • Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24, 1537–1555.

    Article  Google Scholar 

  • Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of the Workshop on Data Cleaning and Object Consolidation at the International Conference on Knowledge Discovery and Data Mining, KDD (pp. 73–78). Washington, USA.

    Google Scholar 

  • Dunn, H. L. (1946). Record linkage. American Journal of Public Health and the Nation’s Health, 36, 1412–1416.

    Article  Google Scholar 

  • Flach, P. A. (2003). The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. In Proceedings Twentieth International Conference on Machine Learning (ICML’03) (pp. 194–201). California, USA: AAAI Press.

    Google Scholar 

  • Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36.

    Article  Google Scholar 

  • Herzog, T. N., Scheuren, F., & Winkler, W. E. (2010). Record linkage. In Wiley Interdisciplinary Reviews: Computational Statistics (pp. 535–543). New York, USA: Wiley.

    Google Scholar 

  • Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69, 197–210.

    Article  Google Scholar 

  • Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.

    Google Scholar 

  • McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘00) (pp. 169–178). New York, NY, USA: ACM.

    Google Scholar 

  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26(24).

    Google Scholar 

  • National Archives and Records Administration (2007). The Soundex Indexing System. http://www.archives.gov/research/census/soundex.html. Accessed May 30, 2007.

  • Perrow, M., & Barber, D. (2006). Tagging of name records for genealogical data browsing. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ‘06) (pp. 316–325). New York, USA: ACM.

    Google Scholar 

  • Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL ‘09) (pp. 147–155). Stroudsburg, PA, USA: Association for Computational Linguistics.

    Google Scholar 

  • Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘02) (pp. 269–278). New York, USA: ACM.

    Google Scholar 

  • Singla, P., & Domingos, P. (2006). Entity resolution with Markov logic. In Proceedings of the Sixth International Conference on Data Mining (ICDM ‘06) (pp. 572–582). Piscataway: IEEE.

    Google Scholar 

  • Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods (American Statistical Association) (pp. 354–359).

    Google Scholar 

  • Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arno Knobbe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Georgala, K., van der Burgh, B., Meeng, M., Knobbe, A. (2015). Record Linkage in Medieval and Early Modern Text. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_9

Download citation

Publish with us

Policies and ethics