Abstract
Historical records of daily activities provide intriguing insights into the life of our ancestors, useful for demographic and genealogical research. For example, marriage license books have been used for centuries by ecclesiastical and secular institutions to register marriages. These books follow a simple structure of the text in the records with a evolutionary vocabulary, mainly composed of proper names that change along the time. This distinct vocabulary makes automatic transcription and semantic information extraction difficult tasks. In previous works we studied the use of category-based language models and how a Grammatical Inference technique known as MGGI could improve the accuracy of these tasks. In this work we analyze the main causes of the semantic errors observed in previous results and apply a better implementation of the MGGI technique to solve these problems. Using the resulting language model, transcription and information extraction experiments have been carried out, and the results support our proposed approach.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
It is publicly available at: http://dag.cvc.uab.es/the-esposalles-database/.
References
Eilenberg, S.: Automata, Languages, and Machines, vol. 1. Academic Press, Orlando (1974)
Garcia, P., Vidal, E., Casacuberta, F.: Local languages, the succesor method, and a step towards a general methodology for the inference of regular grammars. IEEE Trans. PAMI 6, 841–845 (1987)
Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: NIPS, pp. 545–552 (2008)
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)
Marti, U.-V., Bunke, H.: Using a statistical language model to improve the preformance of an HMM-based cursive handwriting recognition system. IJPRAI 15(1), 65–90 (2001)
Niesler, T., Woodland, P.: A variable-length category-based n-gram language model. In: Proceedings of ICASSP 1996, vol. 1, pp. 164 –167, May 1996
Romero, V., Fornés, A., Serrano, N., Sánchez, J.A., Toselli, A., Frinken, V., Vidal, E., Lladós, J.: The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn. 46, 1658–1669 (2013)
Romero, V., Sánchez, J.A.: Category-based language models for handwriting recognition of marriage license books. In: Proceedings of ICDAR 2013, pp. 788–792 (2013)
Toselli, A.H., Juan, A., Keysers, D., González, J., Salvador, I., Ney, H., Vidal, E., Casacuberta, F.: Integrated handwriting recognition and interpretation using finite-state models. IJPRAI 18(4), 519–539 (2004)
Romero, E.V.V., Fornés, A., Sánchez, J.A.: Using the MGGI methodology for category-based language modeling in handwritten marriage licenses books. In: ICFHR, Shenzhen, China (2016)
Vidal, E., Llorens, D.: Using knowledge to improve N-gram language modelling through the MGGI methodology. In: Miclet, L., Higuera, C. (eds.) ICGI 1996. LNCS, vol. 1147, pp. 179–190. Springer, Heidelberg (1996). doi:10.1007/BFb0033353
Vidal, E., Thollard, F., De La Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines-part I. IEEE Trans. PAMI 27(7), 1013–1025 (2005)
Vidal, E., Thollard, F., De La Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines-part II. IEEE Trans. PAMI 27(7), 1026–1039 (2005)
Acknowledgment
This work has been partially supported through the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), the European project ERC-2010-AdG-20100407-269796, the MINECO/FEDER, UE projects TIN2015-70924-C2-1-R and TIN2015-70924-C2-2-R, and the Ramon y Cajal Fellowship RYC-2014-16831.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Romero, V., Fornés, A., Vidal, E., Sánchez, J.A. (2017). Information Extraction in Handwritten Marriage Licenses Books Using the MGGI Methodology. In: Alexandre, L., Salvador Sánchez, J., Rodrigues, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2017. Lecture Notes in Computer Science(), vol 10255. Springer, Cham. https://doi.org/10.1007/978-3-319-58838-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-58838-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58837-7
Online ISBN: 978-3-319-58838-4
eBook Packages: Computer ScienceComputer Science (R0)