Abstract
In this chapter, we study the application of existing entity resolution (ER) techniques on a real-world multi-source genealogical dataset. Our goal is to identify all persons involved in various notary acts and link them to their birth, marriage, and death certificates. We analyze the influence of additional ER features, such as name popularity, geographical distance, and co-reference information on the overall ER performance. We study two prediction models: regression trees and logistic regression. In order to evaluate the performance of the applied algorithms and to obtain a training set for learning the models we developed an interactive interface for getting feedback from human experts. We perform an empirical evaluation on the manually annotated dataset in terms of precision, recall, and F-score. We show that using name popularity, geographical distance together with co-reference information helps to significantly improve ER results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
http://www.bhic.nl/, the website of BHIC is available in Dutch only.
- 2.
‘zoon van’ is the Dutch term for ‘son of’.
- 3.
- 4.
‘onbekend’ is the Dutch term for ‘unknown’.
- 5.
‘niet vermeld’ is the Dutch term for ‘not mentioned’.
- 6.
‘zoon van’ and ‘zoontje van’ are Dutch terms for ‘son of’.
- 7.
- 8.
- 9.
References
Alsaleh, M., & van Oorschot, P. C. (2013). Evaluation in the absence of absolute ground truth: Toward reliable evaluation methodology for scan detectors. International Journal of Information Security, 12(2), 97–110.
Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD’04 (pp. 11–18). USA: ACM.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.
Bilenko, M. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th IEEE international conference on data mining ICDM-2006 (pp. 87–96). Piscataway: IEEE.
Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37(1), 51–89.
Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In Proceedings of the ‘workshop on mining complex data’ (MCD’06), held at IEEE ICDM’06 (pp. 290–294).
Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ‘08 (pp. 151–159). USA: ACM.
Christen, P. (2012). Data matching. New York: Springer Publishing Company, Incorporated.
Cohen, W. W., Kautz, H. A., & McAllester, D. A. (2000). Hardening soft information sources. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo & I. Parsa (Eds.) KDD (pp. 255–259). USA: ACM.
Efremova, J., Montes García, A., & Calders, T. (2015). Classification of historical notary acts with noisy labels. In Proceedings of the 37th European conference on information retrieval, ECIR’15. Vienna, Austria: Springer.
Efremova, J., Ranjbar-Sahraei, B., & Calders, T. (2014). A hybrid disambiguation measure for inaccurate cultural heritage data. In The 8th workshop on LaTeCH (pp. 47–55).
Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2013). An interactive, web-based tool for genealogical entity resolution. In 25th Benelux Conference on Artificial Intelligence (BNAIC’13), The Netherlands.
Efremova, J., Ranjbar-Sahraei, B., Oliehoek, F.A., Calders, T., & Tuyls, K. (2014). A baseline method for genealogical entity resolution. In Proceedings of the Workshop on Population Reconstruction, Organized in the Framework of the LINKS Project.
Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Florian, R., Ittycheriah, A., Jing, H., & Zhang, T. (2003). Named entity recognition through classifier combination. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, CONLL ‘03 (Vol. 4, pp. 168–171). USA: Association for Computational Linguistics.
Getoor, L., & Machanavajjhala, A. (2012). Entity resolution: Theory, practice & open challenges. In International Conference on Very Large Data Bases.
Getoor, L., & Machanavajjhala, A. (2013). Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1527–1527). USA: ACM.
Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. SIGMOD Record, 24(2), 127–138.
Huijsmans, D. (2013). Dataset historische Nederlandse toponiemen spatio-temporeel 1812–2012. In IISG-LINKS.
Ivie, S., Henry, G., Gatrell, H., & Giraud-Carrier, C. (2007). A metricbased machine learning approach to genealogical record linkage. In Proceedings of the 7th Annual Workshop on Technology for Family History and Genealogical Research.
Lawson, J. S. (2006). Record linkage techniques for improving online genealogical research using census index records. In Proceedings of the Section on Survey Research Methods.
McCallum, A., Nigam, K., & Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169–178). USA: ACM.
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes, 30(1), 3–26.
Naumann, F., & Herschel, M. (2010). An introduction to duplicate detection. San Rafael: Morgan and Claypool Publishers.
Nuanmeesri, S., & Baitiang, C. (2008). Genealogical information searching system. In 4th IEEE International Conference on Management of Innovation and Technology, ICMIT 2008 (pp. 1255–1259).
Rahmani, H., Ranjbar-Sahraei, B., Weiss, G., & Tuyls, K. (2014). Contextual entity resolution approach for genealogical data. In Workshop on knowledge discovery, data mining and machine learning.
Ramachandran, S., Deshpande, O., Roseman, C. C., Rosenberg, N. A., Feldman, M. W., & Cavalli-Sforza, L. L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proceedings of the National Academy of Sciences of the United States of America, 102(44), 15942–15947.
Sammut, C., & Webb, G. I. (2010). Encyclopedia of machine learning. Berlin: Springer.
Schraagen, M. (2011). Complete coverage for approximate string matching in record linkage using bit vectors. In ICTAI’11 (pp. 740–747).
Schraagen, M., & Hoogeboom, H. J. (2011). Predicting record linkage potential in a family reconstruction graph. In 23th Benelux Conference on Artificial Intelligence (BNAIC’11), Belgium.
Schraagen, M., & Kosters, W. (2014). Record linkage using graph consistency. In Machine learning and data mining in pattern recognition. Lecture Notes in Computer Science (pp. 471–483). New York: Springer International Publishing
Schulz, K. U., & Mihov, S. (2002). Fast string correction with levenshtein automata. International Journal of Document Analysis and Recognition (IJDAR), 5(1), 67–85.
Singla, P., & Domingos, P. (2006). Entity resolution with markov logic. Proceedings of the Sixth International Conference on Data Mining, ICDM’06 (pp. 572–582). USA: IEEE Computer Society.
Štajner, T., & Mladenić, D. (2009). Entity resolution in texts using statistical learning and ontologies. Proceedings of the 4th Asian Conference on the Semantic Web, ASWC’09 (pp. 91–104). Berlin: Springer.
Sweet, C., Özyer, T., & Alhajj, R. (2007). Enhanced graph based genealogical record linkage. Proceedings of the 3rd International Conference on Advanced Data Mining and Applications, ADMA’07 (pp. 476–487). Berlin: Springer.
Van den Bosch, A., Busser, B., Canisius, S., & Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In Computational linguistics in the Netherlands: Selected papers from the seventeenth CLIN meeting (pp. 99–114).
Winkler, W. E. (1995). Matching and record linkage. In Business survey methods (pp. 355–384). New York: Wiley.
Acknowledgements
The authors are grateful to the BHIC Center for the support in data gathering, data analysis and direction. In particular, we would like to thank Rien Wols and Anton Schuttelaars whose efforts were instrumental to this research and their patience and support appeared infinite. This research has been carried under Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Efremova, J. et al. (2015). Multi-Source Entity Resolution for Genealogical Data. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-19884-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)