Skip to main content
Log in

Green Information Extraction from Family Books

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

There is rising demand for the retrieval of genealogical information from semi-structured books. Nuggets of personal interest are currently transcribed piecemeal by volunteers. GreenBook is an effective alternative for recovering most of the desired information from any of the hundreds of thousands of books of ancestral records that have already been scanned and digitized. It minimizes human intervention by letting the user benefit from automatically compiled statistics from the book, and lets computer search benefit from user insights. GreenBook combines enhanced template matching with spreadsheet-based interaction for the rapid specification of the text to be extracted. The accuracy and completeness of the extracted information are limited only by the user’s stamina. About 3 h of user interaction yielded 99% precision and 97% recall on books of Scottish birth and marriage records, Ohio funeral home records, and a family history spanning 300 years. The system is designed to facilitate transition to new books.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Grant FJ, editor. Index to the Register of Marriages and Baptisms in the Parish of Kilbarchan, 1649–1772. Edinburgh: J. Skinner & Company LTD; 1912.

    Google Scholar 

  2. Miller Funeral Home Records, 1917–1950, Greenville, Ohio. Darke County Ohio Genealogical Society; 1990.

  3. The Ely Ancestry. Collected by the late Moses S Beach, of New York & by the Rev. William Ely, DD of Philadelphia. Edited & Enlarged by Geo B Vanderpoel. New York: The Calumet Press; 1902.

  4. Salton G. Automatic information organization and retrieval. McGrawHill; 1968.

  5. Text Retrieval Conference (TREC). http://trec.nist.gov. Accessed 12 Sept 2019.

  6. Stanford Named Entity Recognizer (NER). https://nlp.stanford.edu/software/CRF-NER.shtml. Accessed 12 Sept 2019.

  7. Searls DB, Taylor SL. Document image analysis using logic-grammar-based syntactic pattern recognition. In: Baird HS, Bunke H, Yamamoto K, editors. Structured document analysis. Berlin: Springer; 1992. p. 520–45.

    Chapter  Google Scholar 

  8. Kushmerick N, Weld DS, Doorenbos R. Wrapper induction for information extraction. Proceedings of the 1997 international joint conference on artificial intelligence; 1997, p. 729–735.

  9. Ittner DJ, Baird HS. Programmable document analysis. In: Spitz AL, Dengel A, editors. Proceedings of the first IAPR international workshop on document analysis systems, DAS’94. Singapore: World Scientific; 1995, p. 76–93.

  10. Belaïd A, Chenvoy Y. Document analysis for retrospective conversion of library reference catalogues. Proc. ICDAR’97. Ulm, Germany; 1997.

  11. Turmo J, Ageno A, Català N. Adaptive information extraction. ACM Comput Surv. 2006;38:2.

    Article  Google Scholar 

  12. Sarawagi S. Information extraction. Found Trends Databases. 2008;1(3):261–377.

    Article  Google Scholar 

  13. Grishman R. Information extraction. IEEE Intell Syst. 2015;30:8–15.

    Article  Google Scholar 

  14. Jiménez P, Corchuelo R, Sleiman HA. ARIEX: automated ranking of information extractors. Knowl Syst. 2016;93(2):84–108.

    Article  Google Scholar 

  15. Embley DW, Liddle SW, Lonsdale DW, Woodfield SN. Ontological document reading, an experience report. In: Coneptual Modeling, 37th international conference, ER 2018, Xi'an, China, October 22–25, 2018; p. 133–11.

  16. Embley DW, Liddle SW, Eastmond S, Lonsdale DW, Woodfield SN. Conceptual modeling in accelerating information ingest into family tree. In: Cabot J, Gómez C, Pastor O, Sancho M, editors. Conceptual modeling perspectives. Cham: Springer; 2017. p. 69–84.

    Chapter  Google Scholar 

  17. Woodfield N, Seeger S, Litster S, Liddle SW, Grace B, Embley DW. Ontological Deep Data Cleaning: 37th International Conference, ER 2018, Xi’an, China, Proceedings. https://doi.org/10.1007/978-3-030-00847-5_9.

  18. Schone P, Gehring J. Genealogical indexing of obituaries using automatic processes, proceedings of the family history technical workshop (FHTW’16), Provo, Utah, USA, February, 2016. https://fhtw.byu.edu/archive/2016).

  19. Packer TL, Embley DW. Unsupervised training of HMM structure and parameters for OCRed list recognition and ontology population. Proceedings of the 3rd international workshop on historical document imaging and processing, Nancy, France, 22 August 2015, p. 23–30.

  20. Rose Finkel J, Grenager T, Manning C. Incorporating Non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005); 2005, p. 363–70.

  21. Embley DW. Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference. Anaheim, California; 1980, p. 301–5.

  22. Schuster D et al. Intellix—end-user trained information extraction for document archiving. Proc. ICDAR’13, Washington, DC; 2013.

  23. Sutherland S. Learning information extraction rules for semi-structured and free text. Mach Learn. 1999;34:232–72.

    Google Scholar 

  24. Taghve K, Nartker TA, Borsack J. Information access in the presence of OCR errors. Procs. ACM Hardcopy Document Processing Workshop. Washington, DC; 2004, p. 1–8.

  25. Nagy G. Disruptive developments in document recognition. Pattern Recognit Lett. 2015. https://doi.org/10.1016/j.patrec.2015.11.024).

    Article  Google Scholar 

  26. Chiticariu L, Li Y, Reiss FR. Rule-based information extraction is dead! Long live rule-based information extraction systems! Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA, October, 2013, p. 827–32.

  27. Embley DW, Nagy G. Green interaction for extracting family information from OCR’d books, Document Analysis Systems Workshop (DAS’18). Vienna, April 2018.

  28. Embley DW, Nagy G. Extraction rule creation by text snippet examples. Provo, UT: Family History Technology Workshop; 2018.

    Google Scholar 

  29. Nagy G. Estimation, learning, and adaptation: systems that improve with use, Pierre DeVijver Award lecture. Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, Hiroshima, Japan, November, 2012, 1–10.

  30. Table analysis for generating ontologies. https://tango.byu.edu/.

  31. Embley DW, Liddle SW, Lonsdale DW, Woodfield SN. Inter-generational family reconstitution with enriched ontologies, ER 2019. First International Workshop on Conceptual Modeling for Digital Humanities, Salvador, Bahia, Brazil, Nov. 4–7, 2019.

Download references

Acknowledgements

My friendship and collaboration with Professor (now Emeritus) David W. Embley of Brigham Young University dates back so many decades that I can no longer tell which ideas were his and which were mine. This work would not have been possible without his and his team’s essential contributions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Nagy.

Ethics declarations

Conflict of interest

The author declared that he has no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nagy, G. Green Information Extraction from Family Books. SN COMPUT. SCI. 1, 23 (2020). https://doi.org/10.1007/s42979-019-0024-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-019-0024-x

Keywords

Navigation