Skip to main content

Multi-Source Entity Resolution for Genealogical Data

  • Chapter
  • First Online:
Population Reconstruction

Abstract

In this chapter, we study the application of existing entity resolution (ER) techniques on a real-world multi-source genealogical dataset. Our goal is to identify all persons involved in various notary acts and link them to their birth, marriage, and death certificates. We analyze the influence of additional ER features, such as name popularity, geographical distance, and co-reference information on the overall ER performance. We study two prediction models: regression trees and logistic regression. In order to evaluate the performance of the applied algorithms and to obtain a training set for learning the models we developed an interactive interface for getting feedback from human experts. We perform an empirical evaluation on the manually annotated dataset in terms of precision, recall, and F-score. We show that using name popularity, geographical distance together with co-reference information helps to significantly improve ER results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.bhic.nl/, the website of BHIC is available in Dutch only.

  2. 2.

    ‘zoon van’ is the Dutch term for ‘son of’.

  3. 3.

    http://www.meertens.knaw.nl/nvb/

  4. 4.

    ‘onbekend’ is the Dutch term for ‘unknown’.

  5. 5.

    ‘niet vermeld’ is the Dutch term for ‘not mentioned’.

  6. 6.

    ‘zoon van’ and ‘zoontje van’ are Dutch terms for ‘son of’.

  7. 7.

    http://www.iisg.nl/hsn/data/place-names.html

  8. 8.

    https://developers.google.com/maps/documentation/geocoding/

  9. 9.

    http://lucene.apache.org/solr/

References

  • Alsaleh, M., & van Oorschot, P. C. (2013). Evaluation in the absence of absolute ground truth: Toward reliable evaluation methodology for scan detectors. International Journal of Information Security, 12(2), 97–110.

    Article  Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD’04 (pp. 11–18). USA: ACM.

    Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.

    Article  Google Scholar 

  • Bilenko, M. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE international conference on data mining ICDM-2006 (pp. 87–96). Piscataway: IEEE.

    Google Scholar 

  • Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37(1), 51–89.

    Article  Google Scholar 

  • Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In Proceedings of the ‘workshop on mining complex data’ (MCD’06), held at IEEE ICDM’06 (pp. 290–294).

    Google Scholar 

  • Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ‘08 (pp. 151–159). USA: ACM.

    Chapter  Google Scholar 

  • Christen, P. (2012). Data matching. New York: Springer Publishing Company, Incorporated.

    Book  Google Scholar 

  • Cohen, W. W., Kautz, H. A., & McAllester, D. A. (2000). Hardening soft information sources. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo & I. Parsa (Eds.) KDD (pp. 255–259). USA: ACM.

    Google Scholar 

  • Efremova, J., Montes García, A., & Calders, T. (2015). Classification of historical notary acts with noisy labels. In Proceedings of the 37th European conference on information retrieval, ECIR’15. Vienna, Austria: Springer.

    Google Scholar 

  • Efremova, J., Ranjbar-Sahraei, B., & Calders, T. (2014). A hybrid disambiguation measure for inaccurate cultural heritage data. In The 8th workshop on LaTeCH (pp. 47–55).

    Google Scholar 

  • Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2013). An interactive, web-based tool for genealogical entity resolution. In 25th Benelux Conference on Artificial Intelligence (BNAIC’13), The Netherlands.

    Google Scholar 

  • Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2014). A baseline method for genealogical entity resolution. In Proceedings of the Workshop on Population Reconstruction, Organized in the Framework of the LINKS Project.

    Google Scholar 

  • Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.

    Article  Google Scholar 

  • Florian, R., Ittycheriah, A., Jing, H., & Zhang, T. (2003). Named entity recognition through classifier combination. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL ‘03 (Vol. 4, pp. 168–171). USA: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Getoor, L., & Machanavajjhala, A. (2012). Entity resolution: Theory, practice & open challenges. In International Conference on Very Large Data Bases.

    Google Scholar 

  • Getoor, L., & Machanavajjhala, A. (2013). Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1527–1527). USA: ACM.

    Google Scholar 

  • Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. SIGMOD Record, 24(2), 127–138.

    Article  Google Scholar 

  • Huijsmans, D. (2013). Dataset historische Nederlandse toponiemen spatio-temporeel 1812–2012. In IISG-LINKS.

    Google Scholar 

  • Ivie, S., Henry, G., Gatrell, H., & Giraud-Carrier, C. (2007). A metricbased machine learning approach to genealogical record linkage. In Proceedings of the 7th Annual Workshop on Technology for Family History and Genealogical Research.

    Google Scholar 

  • Lawson, J. S. (2006). Record linkage techniques for improving online genealogical research using census index records. In Proceedings of the Section on Survey Research Methods.

    Google Scholar 

  • McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–178). USA: ACM.

    Google Scholar 

  • Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1), 3–26.

    Article  Google Scholar 

  • Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. San Rafael: Morgan and Claypool Publishers.

    Google Scholar 

  • Nuanmeesri, S., & Baitiang, C. (2008). Genealogical information searching system. In 4th IEEE International Conference on Management of Innovation and Technology, ICMIT 2008 (pp. 1255–1259).

    Google Scholar 

  • Rahmani, H., Ranjbar-Sahraei, B., Weiss, G., & Tuyls, K. (2014). Contextual entity resolution approach for genealogical data. In Workshop on knowledge discovery, data mining and machine learning.

    Google Scholar 

  • Ramachandran, S., Deshpande, O., Roseman, C. C., Rosenberg, N. A., Feldman, M. W., & Cavalli-Sforza, L. L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proceedings of the National Academy of Sciences of the United States of America, 102(44), 15942–15947.

    Article  Google Scholar 

  • Sammut, C., & Webb, G. I. (2010). Encyclopedia of machine learning. Berlin: Springer.

    Book  Google Scholar 

  • Schraagen, M. (2011). Complete coverage for approximate string matching in record linkage using bit vectors. In ICTAI’11 (pp. 740–747).

    Google Scholar 

  • Schraagen, M., & Hoogeboom, H. J. (2011). Predicting record linkage potential in a family reconstruction graph. In 23th Benelux Conference on Artificial Intelligence (BNAIC’11), Belgium.

    Google Scholar 

  • Schraagen, M., & Kosters, W. (2014). Record linkage using graph consistency. In Machine learning and data mining in pattern recognition. Lecture Notes in Computer Science (pp. 471–483). New York: Springer International Publishing

    Google Scholar 

  • Schulz, K. U., & Mihov, S. (2002). Fast string correction with levenshtein automata. International Journal of Document Analysis and Recognition (IJDAR), 5(1), 67–85.

    Article  Google Scholar 

  • Singla, P., & Domingos, P. (2006). Entity resolution with markov logic. Proceedings of the Sixth International Conference on Data Mining, ICDM’06 (pp. 572–582). USA: IEEE Computer Society.

    Chapter  Google Scholar 

  • Štajner, T., & Mladenić, D. (2009). Entity resolution in texts using statistical learning and ontologies. Proceedings of the 4th Asian Conference on the Semantic Web, ASWC’09 (pp. 91–104). Berlin: Springer.

    Google Scholar 

  • Sweet, C., Özyer, T., & Alhajj, R. (2007). Enhanced graph based genealogical record linkage. Proceedings of the 3rd International Conference on Advanced Data Mining and Applications, ADMA’07 (pp. 476–487). Berlin: Springer.

    Chapter  Google Scholar 

  • Van den Bosch, A., Busser, B., Canisius, S., & Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational linguistics in the Netherlands: Selected papers from the seventeenth CLIN meeting (pp. 99–114).

    Google Scholar 

  • Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.

    Google Scholar 

Download references

Acknowledgements

The authors are grateful to the BHIC Center for the support in data gathering, data analysis and direction. In particular, we would like to thank Rien Wols and Anton Schuttelaars whose efforts were instrumental to this research and their patience and support appeared infinite. This research has been carried under Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julia Efremova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Efremova, J. et al. (2015). Multi-Source Entity Resolution for Genealogical Data. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_7

Download citation

Publish with us

Policies and ethics