Green Information Extraction from Family Books

Nagy, George

doi:10.1007/s42979-019-0024-x

Green Information Extraction from Family Books

Original Research
Published: 16 September 2019

Volume 1, article number 23, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

George Nagy ORCID: orcid.org/0000-0002-0521-1443¹

803 Accesses
4 Citations
Explore all metrics

Abstract

There is rising demand for the retrieval of genealogical information from semi-structured books. Nuggets of personal interest are currently transcribed piecemeal by volunteers. GreenBook is an effective alternative for recovering most of the desired information from any of the hundreds of thousands of books of ancestral records that have already been scanned and digitized. It minimizes human intervention by letting the user benefit from automatically compiled statistics from the book, and lets computer search benefit from user insights. GreenBook combines enhanced template matching with spreadsheet-based interaction for the rapid specification of the text to be extracted. The accuracy and completeness of the extracted information are limited only by the user’s stamina. About 3 h of user interaction yielded 99% precision and 97% recall on books of Scottish birth and marriage records, Ohio funeral home records, and a family history spanning 300 years. The system is designed to facilitate transition to new books.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 7

Near-Perfect Relation Extraction from Family Books

Conceptual Modeling in Accelerating Information Ingest into Family Tree

DataGorri: a tool for automated data collection of tabular web content

Article 01 October 2018

Julian Hackinger

References

Grant FJ, editor. Index to the Register of Marriages and Baptisms in the Parish of Kilbarchan, 1649–1772. Edinburgh: J. Skinner & Company LTD; 1912.
Google Scholar
Miller Funeral Home Records, 1917–1950, Greenville, Ohio. Darke County Ohio Genealogical Society; 1990.
The Ely Ancestry. Collected by the late Moses S Beach, of New York & by the Rev. William Ely, DD of Philadelphia. Edited & Enlarged by Geo B Vanderpoel. New York: The Calumet Press; 1902.
Salton G. Automatic information organization and retrieval. McGrawHill; 1968.
Text Retrieval Conference (TREC). http://trec.nist.gov. Accessed 12 Sept 2019.
Stanford Named Entity Recognizer (NER). https://nlp.stanford.edu/software/CRF-NER.shtml. Accessed 12 Sept 2019.
Searls DB, Taylor SL. Document image analysis using logic-grammar-based syntactic pattern recognition. In: Baird HS, Bunke H, Yamamoto K, editors. Structured document analysis. Berlin: Springer; 1992. p. 520–45.
Chapter Google Scholar
Kushmerick N, Weld DS, Doorenbos R. Wrapper induction for information extraction. Proceedings of the 1997 international joint conference on artificial intelligence; 1997, p. 729–735.
Ittner DJ, Baird HS. Programmable document analysis. In: Spitz AL, Dengel A, editors. Proceedings of the first IAPR international workshop on document analysis systems, DAS’94. Singapore: World Scientific; 1995, p. 76–93.
Belaïd A, Chenvoy Y. Document analysis for retrospective conversion of library reference catalogues. Proc. ICDAR’97. Ulm, Germany; 1997.
Turmo J, Ageno A, Català N. Adaptive information extraction. ACM Comput Surv. 2006;38:2.
Article Google Scholar
Sarawagi S. Information extraction. Found Trends Databases. 2008;1(3):261–377.
Article Google Scholar
Grishman R. Information extraction. IEEE Intell Syst. 2015;30:8–15.
Article Google Scholar
Jiménez P, Corchuelo R, Sleiman HA. ARIEX: automated ranking of information extractors. Knowl Syst. 2016;93(2):84–108.
Article Google Scholar
Embley DW, Liddle SW, Lonsdale DW, Woodfield SN. Ontological document reading, an experience report. In: Coneptual Modeling, 37th international conference, ER 2018, Xi'an, China, October 22–25, 2018; p. 133–11.
Embley DW, Liddle SW, Eastmond S, Lonsdale DW, Woodfield SN. Conceptual modeling in accelerating information ingest into family tree. In: Cabot J, Gómez C, Pastor O, Sancho M, editors. Conceptual modeling perspectives. Cham: Springer; 2017. p. 69–84.
Chapter Google Scholar
Woodfield N, Seeger S, Litster S, Liddle SW, Grace B, Embley DW. Ontological Deep Data Cleaning: 37th International Conference, ER 2018, Xi’an, China, Proceedings. https://doi.org/10.1007/978-3-030-00847-5_9.
Schone P, Gehring J. Genealogical indexing of obituaries using automatic processes, proceedings of the family history technical workshop (FHTW’16), Provo, Utah, USA, February, 2016. https://fhtw.byu.edu/archive/2016).
Packer TL, Embley DW. Unsupervised training of HMM structure and parameters for OCRed list recognition and ontology population. Proceedings of the 3rd international workshop on historical document imaging and processing, Nancy, France, 22 August 2015, p. 23–30.
Rose Finkel J, Grenager T, Manning C. Incorporating Non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005); 2005, p. 363–70.
Embley DW. Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference. Anaheim, California; 1980, p. 301–5.
Schuster D et al. Intellix—end-user trained information extraction for document archiving. Proc. ICDAR’13, Washington, DC; 2013.
Sutherland S. Learning information extraction rules for semi-structured and free text. Mach Learn. 1999;34:232–72.
Google Scholar
Taghve K, Nartker TA, Borsack J. Information access in the presence of OCR errors. Procs. ACM Hardcopy Document Processing Workshop. Washington, DC; 2004, p. 1–8.
Nagy G. Disruptive developments in document recognition. Pattern Recognit Lett. 2015. https://doi.org/10.1016/j.patrec.2015.11.024).
Article Google Scholar
Chiticariu L, Li Y, Reiss FR. Rule-based information extraction is dead! Long live rule-based information extraction systems! Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA, October, 2013, p. 827–32.
Embley DW, Nagy G. Green interaction for extracting family information from OCR’d books, Document Analysis Systems Workshop (DAS’18). Vienna, April 2018.
Embley DW, Nagy G. Extraction rule creation by text snippet examples. Provo, UT: Family History Technology Workshop; 2018.
Google Scholar
Nagy G. Estimation, learning, and adaptation: systems that improve with use, Pierre DeVijver Award lecture. Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, Hiroshima, Japan, November, 2012, 1–10.
Table analysis for generating ontologies. https://tango.byu.edu/.
Embley DW, Liddle SW, Lonsdale DW, Woodfield SN. Inter-generational family reconstitution with enriched ontologies, ER 2019. First International Workshop on Conceptual Modeling for Digital Humanities, Salvador, Bahia, Brazil, Nov. 4–7, 2019.

Download references

Acknowledgements

My friendship and collaboration with Professor (now Emeritus) David W. Embley of Brigham Young University dates back so many decades that I can no longer tell which ideas were his and which were mine. This work would not have been possible without his and his team’s essential contributions.

Author information

Authors and Affiliations

Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY, 12180, USA
George Nagy

Authors

George Nagy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Nagy.

Ethics declarations

Conflict of interest

The author declared that he has no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nagy, G. Green Information Extraction from Family Books. SN COMPUT. SCI. 1, 23 (2020). https://doi.org/10.1007/s42979-019-0024-x

Download citation

Received: 07 June 2019
Accepted: 09 September 2019
Published: 16 September 2019
DOI: https://doi.org/10.1007/s42979-019-0024-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Green Information Extraction from Family Books

Abstract

Access this article

Similar content being viewed by others

Near-Perfect Relation Extraction from Family Books

Conceptual Modeling in Accelerating Information Ingest into Family Tree

DataGorri: a tool for automated data collection of tabular web content

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Green Information Extraction from Family Books

Abstract

Access this article

Similar content being viewed by others

Near-Perfect Relation Extraction from Family Books

Conceptual Modeling in Accelerating Information Ingest into Family Tree

DataGorri: a tool for automated data collection of tabular web content

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation