Multi-Source Entity Resolution for Genealogical Data

Efremova, Julia; Ranjbar-Sahraei, Bijan; Rahmani, Hossein; Oliehoek, Frans A.; Calders, Toon; Tuyls, Karl; Weiss, Gerhard

doi:10.1007/978-3-319-19884-2_7

Julia Efremova⁵,
Bijan Ranjbar-Sahraei⁶,
Hossein Rahmani⁶,
Frans A. Oliehoek^7,9,
Toon Calders^5,8,
Karl Tuyls⁹ &
…
Gerhard Weiss⁶

655 Accesses
12 Citations

Abstract

In this chapter, we study the application of existing entity resolution (ER) techniques on a real-world multi-source genealogical dataset. Our goal is to identify all persons involved in various notary acts and link them to their birth, marriage, and death certificates. We analyze the influence of additional ER features, such as name popularity, geographical distance, and co-reference information on the overall ER performance. We study two prediction models: regression trees and logistic regression. In order to evaluate the performance of the applied algorithms and to obtain a training set for learning the models we developed an interactive interface for getting feedback from human experts. We perform an empirical evaluation on the manually annotated dataset in terms of precision, recall, and F-score. We show that using name popularity, geographical distance together with co-reference information helps to significantly improve ER results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.bhic.nl/, the website of BHIC is available in Dutch only.
2.
‘zoon van’ is the Dutch term for ‘son of’.
3.
http://www.meertens.knaw.nl/nvb/
4.
‘onbekend’ is the Dutch term for ‘unknown’.
5.
‘niet vermeld’ is the Dutch term for ‘not mentioned’.
6.
‘zoon van’ and ‘zoontje van’ are Dutch terms for ‘son of’.
7.
http://www.iisg.nl/hsn/data/place-names.html
8.
https://developers.google.com/maps/documentation/geocoding/
9.
http://lucene.apache.org/solr/

References

Alsaleh, M., & van Oorschot, P. C. (2013). Evaluation in the absence of absolute ground truth: Toward reliable evaluation methodology for scan detectors. International Journal of Information Security, 12(2), 97–110.
Article Google Scholar
Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD’04 (pp. 11–18). USA: ACM.
Google Scholar
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.
Article Google Scholar
Bilenko, M. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE international conference on data mining ICDM-2006 (pp. 87–96). Piscataway: IEEE.
Google Scholar
Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37(1), 51–89.
Article Google Scholar
Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In Proceedings of the ‘workshop on mining complex data’ (MCD’06), held at IEEE ICDM’06 (pp. 290–294).
Google Scholar
Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ‘08 (pp. 151–159). USA: ACM.
Chapter Google Scholar
Christen, P. (2012). Data matching. New York: Springer Publishing Company, Incorporated.
Book Google Scholar
Cohen, W. W., Kautz, H. A., & McAllester, D. A. (2000). Hardening soft information sources. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo & I. Parsa (Eds.) KDD (pp. 255–259). USA: ACM.
Google Scholar
Efremova, J., Montes García, A., & Calders, T. (2015). Classification of historical notary acts with noisy labels. In Proceedings of the 37th European conference on information retrieval, ECIR’15. Vienna, Austria: Springer.
Google Scholar
Efremova, J., Ranjbar-Sahraei, B., & Calders, T. (2014). A hybrid disambiguation measure for inaccurate cultural heritage data. In The 8th workshop on LaTeCH (pp. 47–55).
Google Scholar
Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2013). An interactive, web-based tool for genealogical entity resolution. In 25th Benelux Conference on Artificial Intelligence (BNAIC’13), The Netherlands.
Google Scholar
Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2014). A baseline method for genealogical entity resolution. In Proceedings of the Workshop on Population Reconstruction, Organized in the Framework of the LINKS Project.
Google Scholar
Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Article Google Scholar
Florian, R., Ittycheriah, A., Jing, H., & Zhang, T. (2003). Named entity recognition through classifier combination. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL ‘03 (Vol. 4, pp. 168–171). USA: Association for Computational Linguistics.
Chapter Google Scholar
Getoor, L., & Machanavajjhala, A. (2012). Entity resolution: Theory, practice & open challenges. In International Conference on Very Large Data Bases.
Google Scholar
Getoor, L., & Machanavajjhala, A. (2013). Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1527–1527). USA: ACM.
Google Scholar
Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. SIGMOD Record, 24(2), 127–138.
Article Google Scholar
Huijsmans, D. (2013). Dataset historische Nederlandse toponiemen spatio-temporeel 1812–2012. In IISG-LINKS.
Google Scholar
Ivie, S., Henry, G., Gatrell, H., & Giraud-Carrier, C. (2007). A metricbased machine learning approach to genealogical record linkage. In Proceedings of the 7th Annual Workshop on Technology for Family History and Genealogical Research.
Google Scholar
Lawson, J. S. (2006). Record linkage techniques for improving online genealogical research using census index records. In Proceedings of the Section on Survey Research Methods.
Google Scholar
McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–178). USA: ACM.
Google Scholar
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1), 3–26.
Article Google Scholar
Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. San Rafael: Morgan and Claypool Publishers.
Google Scholar
Nuanmeesri, S., & Baitiang, C. (2008). Genealogical information searching system. In 4th IEEE International Conference on Management of Innovation and Technology, ICMIT 2008 (pp. 1255–1259).
Google Scholar
Rahmani, H., Ranjbar-Sahraei, B., Weiss, G., & Tuyls, K. (2014). Contextual entity resolution approach for genealogical data. In Workshop on knowledge discovery, data mining and machine learning.
Google Scholar
Ramachandran, S., Deshpande, O., Roseman, C. C., Rosenberg, N. A., Feldman, M. W., & Cavalli-Sforza, L. L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proceedings of the National Academy of Sciences of the United States of America, 102(44), 15942–15947.
Article Google Scholar
Sammut, C., & Webb, G. I. (2010). Encyclopedia of machine learning. Berlin: Springer.
Book Google Scholar
Schraagen, M. (2011). Complete coverage for approximate string matching in record linkage using bit vectors. In ICTAI’11 (pp. 740–747).
Google Scholar
Schraagen, M., & Hoogeboom, H. J. (2011). Predicting record linkage potential in a family reconstruction graph. In 23th Benelux Conference on Artificial Intelligence (BNAIC’11), Belgium.
Google Scholar
Schraagen, M., & Kosters, W. (2014). Record linkage using graph consistency. In Machine learning and data mining in pattern recognition. Lecture Notes in Computer Science (pp. 471–483). New York: Springer International Publishing
Google Scholar
Schulz, K. U., & Mihov, S. (2002). Fast string correction with levenshtein automata. International Journal of Document Analysis and Recognition (IJDAR), 5(1), 67–85.
Article Google Scholar
Singla, P., & Domingos, P. (2006). Entity resolution with markov logic. Proceedings of the Sixth International Conference on Data Mining, ICDM’06 (pp. 572–582). USA: IEEE Computer Society.
Chapter Google Scholar
Štajner, T., & Mladenić, D. (2009). Entity resolution in texts using statistical learning and ontologies. Proceedings of the 4th Asian Conference on the Semantic Web, ASWC’09 (pp. 91–104). Berlin: Springer.
Google Scholar
Sweet, C., Özyer, T., & Alhajj, R. (2007). Enhanced graph based genealogical record linkage. Proceedings of the 3rd International Conference on Advanced Data Mining and Applications, ADMA’07 (pp. 476–487). Berlin: Springer.
Chapter Google Scholar
Van den Bosch, A., Busser, B., Canisius, S., & Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational linguistics in the Netherlands: Selected papers from the seventeenth CLIN meeting (pp. 99–114).
Google Scholar
Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.
Google Scholar

Download references

Acknowledgements

The authors are grateful to the BHIC Center for the support in data gathering, data analysis and direction. In particular, we would like to thank Rien Wols and Anton Schuttelaars whose efforts were instrumental to this research and their patience and support appeared infinite. This research has been carried under Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).

Author information

Authors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Julia Efremova & Toon Calders
Maastricht University, Maastricht, The Netherlands
Bijan Ranjbar-Sahraei, Hossein Rahmani & Gerhard Weiss
University of Amsterdam, Amsterdam, The Netherlands
Frans A. Oliehoek
Université Libre de Bruxelles, Brussels, Belgium
Toon Calders
University of Liverpool, Liverpool, UK
Frans A. Oliehoek & Karl Tuyls

Authors

Julia Efremova
View author publications
You can also search for this author in PubMed Google Scholar
Bijan Ranjbar-Sahraei
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Frans A. Oliehoek
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar
Karl Tuyls
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Weiss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julia Efremova .

Editor information

Editors and Affiliations

Utrecht Universty, Utrecht, The Netherlands
Gerrit Bloothooft
The Australian National University, Canberra, Aust Capital Terr, Australia
Peter Christen
International Inst. of Social History, Amsterdam, The Netherlands
Kees Mandemakers
Leiden University, Leiden, The Netherlands
Marijn Schraagen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Efremova, J. et al. (2015). Multi-Source Entity Resolution for Genealogical Data. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-19884-2_7
Published: 23 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics