Abstract
There is rising demand for the retrieval of genealogical information from semi-structured books. Nuggets of personal interest are currently transcribed piecemeal by volunteers. GreenBook is an effective alternative for recovering most of the desired information from any of the hundreds of thousands of books of ancestral records that have already been scanned and digitized. It minimizes human intervention by letting the user benefit from automatically compiled statistics from the book, and lets computer search benefit from user insights. GreenBook combines enhanced template matching with spreadsheet-based interaction for the rapid specification of the text to be extracted. The accuracy and completeness of the extracted information are limited only by the user’s stamina. About 3 h of user interaction yielded 99% precision and 97% recall on books of Scottish birth and marriage records, Ohio funeral home records, and a family history spanning 300 years. The system is designed to facilitate transition to new books.
Similar content being viewed by others
References
Grant FJ, editor. Index to the Register of Marriages and Baptisms in the Parish of Kilbarchan, 1649–1772. Edinburgh: J. Skinner & Company LTD; 1912.
Miller Funeral Home Records, 1917–1950, Greenville, Ohio. Darke County Ohio Genealogical Society; 1990.
The Ely Ancestry. Collected by the late Moses S Beach, of New York & by the Rev. William Ely, DD of Philadelphia. Edited & Enlarged by Geo B Vanderpoel. New York: The Calumet Press; 1902.
Salton G. Automatic information organization and retrieval. McGrawHill; 1968.
Text Retrieval Conference (TREC). http://trec.nist.gov. Accessed 12 Sept 2019.
Stanford Named Entity Recognizer (NER). https://nlp.stanford.edu/software/CRF-NER.shtml. Accessed 12 Sept 2019.
Searls DB, Taylor SL. Document image analysis using logic-grammar-based syntactic pattern recognition. In: Baird HS, Bunke H, Yamamoto K, editors. Structured document analysis. Berlin: Springer; 1992. p. 520–45.
Kushmerick N, Weld DS, Doorenbos R. Wrapper induction for information extraction. Proceedings of the 1997 international joint conference on artificial intelligence; 1997, p. 729–735.
Ittner DJ, Baird HS. Programmable document analysis. In: Spitz AL, Dengel A, editors. Proceedings of the first IAPR international workshop on document analysis systems, DAS’94. Singapore: World Scientific; 1995, p. 76–93.
Belaïd A, Chenvoy Y. Document analysis for retrospective conversion of library reference catalogues. Proc. ICDAR’97. Ulm, Germany; 1997.
Turmo J, Ageno A, Català N. Adaptive information extraction. ACM Comput Surv. 2006;38:2.
Sarawagi S. Information extraction. Found Trends Databases. 2008;1(3):261–377.
Grishman R. Information extraction. IEEE Intell Syst. 2015;30:8–15.
Jiménez P, Corchuelo R, Sleiman HA. ARIEX: automated ranking of information extractors. Knowl Syst. 2016;93(2):84–108.
Embley DW, Liddle SW, Lonsdale DW, Woodfield SN. Ontological document reading, an experience report. In: Coneptual Modeling, 37th international conference, ER 2018, Xi'an, China, October 22–25, 2018; p. 133–11.
Embley DW, Liddle SW, Eastmond S, Lonsdale DW, Woodfield SN. Conceptual modeling in accelerating information ingest into family tree. In: Cabot J, Gómez C, Pastor O, Sancho M, editors. Conceptual modeling perspectives. Cham: Springer; 2017. p. 69–84.
Woodfield N, Seeger S, Litster S, Liddle SW, Grace B, Embley DW. Ontological Deep Data Cleaning: 37th International Conference, ER 2018, Xi’an, China, Proceedings. https://doi.org/10.1007/978-3-030-00847-5_9.
Schone P, Gehring J. Genealogical indexing of obituaries using automatic processes, proceedings of the family history technical workshop (FHTW’16), Provo, Utah, USA, February, 2016. https://fhtw.byu.edu/archive/2016).
Packer TL, Embley DW. Unsupervised training of HMM structure and parameters for OCRed list recognition and ontology population. Proceedings of the 3rd international workshop on historical document imaging and processing, Nancy, France, 22 August 2015, p. 23–30.
Rose Finkel J, Grenager T, Manning C. Incorporating Non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005); 2005, p. 363–70.
Embley DW. Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference. Anaheim, California; 1980, p. 301–5.
Schuster D et al. Intellix—end-user trained information extraction for document archiving. Proc. ICDAR’13, Washington, DC; 2013.
Sutherland S. Learning information extraction rules for semi-structured and free text. Mach Learn. 1999;34:232–72.
Taghve K, Nartker TA, Borsack J. Information access in the presence of OCR errors. Procs. ACM Hardcopy Document Processing Workshop. Washington, DC; 2004, p. 1–8.
Nagy G. Disruptive developments in document recognition. Pattern Recognit Lett. 2015. https://doi.org/10.1016/j.patrec.2015.11.024).
Chiticariu L, Li Y, Reiss FR. Rule-based information extraction is dead! Long live rule-based information extraction systems! Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA, October, 2013, p. 827–32.
Embley DW, Nagy G. Green interaction for extracting family information from OCR’d books, Document Analysis Systems Workshop (DAS’18). Vienna, April 2018.
Embley DW, Nagy G. Extraction rule creation by text snippet examples. Provo, UT: Family History Technology Workshop; 2018.
Nagy G. Estimation, learning, and adaptation: systems that improve with use, Pierre DeVijver Award lecture. Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, Hiroshima, Japan, November, 2012, 1–10.
Table analysis for generating ontologies. https://tango.byu.edu/.
Embley DW, Liddle SW, Lonsdale DW, Woodfield SN. Inter-generational family reconstitution with enriched ontologies, ER 2019. First International Workshop on Conceptual Modeling for Digital Humanities, Salvador, Bahia, Brazil, Nov. 4–7, 2019.
Acknowledgements
My friendship and collaboration with Professor (now Emeritus) David W. Embley of Brigham Young University dates back so many decades that I can no longer tell which ideas were his and which were mine. This work would not have been possible without his and his team’s essential contributions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declared that he has no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nagy, G. Green Information Extraction from Family Books. SN COMPUT. SCI. 1, 23 (2020). https://doi.org/10.1007/s42979-019-0024-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-019-0024-x