Abstract
This chapter covers the topic of record linkage in historic texts, specifically documents from the Middle Ages and Early Modern period. The challenge of record linkage, in general, is to analyze large collections of data recording people, with the aim of recognizing links between these people, and deciding whether multiple mentions of people actually refer to one and the same person. The typical record linkage application, for example, involving birth and marriage certificates, deals with well-structured descriptions of people in terms of their first and last name, date and place of birth, and so on. In historic texts, however, specifically the medieval ones, people are not identified systematically, and one has to include a lot of the context of the occurrences in order to decide whether two descriptions actually refer to the same historic person. Here, we report on two recent projects, ChartEx and Traces Through Time, related to these challenges. We have been developing automatic techniques for recognizing links between documents, and determining the confidence that we have in the correctness of these links, based on the evidence provided in the text. Much of the work deals with the varied nature of the evidence, specifically with the role of first and last name being much more limited in the periods involved. We thus had to include identifying properties such as titles, professions, provenances, and family relationships, to determine confident links. This chapter describes the probabilistic record linkage system that was developed for this task, and presents a number of experiments on artificial data to test the workings of the system. Finally, we present some insightful examples of matches that our system was able to find in the Medieval and Early Modern data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bard, G. V. (2007). Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric. In L. Brankovic & C. Steketee (Eds.), Fifth Australasian Information Security Workshop (Privacy Enhancing Technologies) (AISW) (Vol. 68, pp. 117–124). Ballarat, Australia: CRPIT, ACS.
Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In Sixth SIAM Conference on Data Mining (Vol. 5(7), pp. 47–58). Bethesda, MD, USA.
Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (pp. 175–186). New York, USA: ACM.
Brizan, D. G., & Tansel, A. U. (2006). A survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3), 41–50.
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24, 1537–1555.
Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of the Workshop on Data Cleaning and Object Consolidation at the International Conference on Knowledge Discovery and Data Mining, KDD (pp. 73–78). Washington, USA.
Dunn, H. L. (1946). Record linkage. American Journal of Public Health and the Nation’s Health, 36, 1412–1416.
Flach, P. A. (2003). The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. In Proceedings Twentieth International Conference on Machine Learning (ICML’03) (pp. 194–201). California, USA: AAAI Press.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36.
Herzog, T. N., Scheuren, F., & Winkler, W. E. (2010). Record linkage. In Wiley Interdisciplinary Reviews: Computational Statistics (pp. 535–543). New York, USA: Wiley.
Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69, 197–210.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘00) (pp. 169–178). New York, NY, USA: ACM.
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26(24).
National Archives and Records Administration (2007). The Soundex Indexing System. http://www.archives.gov/research/census/soundex.html. Accessed May 30, 2007.
Perrow, M., & Barber, D. (2006). Tagging of name records for genealogical data browsing. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ‘06) (pp. 316–325). New York, USA: ACM.
Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL ‘09) (pp. 147–155). Stroudsburg, PA, USA: Association for Computational Linguistics.
Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘02) (pp. 269–278). New York, USA: ACM.
Singla, P., & Domingos, P. (2006). Entity resolution with Markov logic. In Proceedings of the Sixth International Conference on Data Mining (ICDM ‘06) (pp. 572–582). Piscataway: IEEE.
Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods (American Statistical Association) (pp. 354–359).
Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Georgala, K., van der Burgh, B., Meeng, M., Knobbe, A. (2015). Record Linkage in Medieval and Early Modern Text. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-19884-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)