Advertisement

Language Resources and Evaluation

, Volume 52, Issue 1, pp 1–28 | Cite as

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

  • Attila Novák
  • Katalin Gugán
  • Mónika Varga
  • Adrienne Dömötör
Original Paper

Abstract

The paper introduces a novel annotated corpus of Old and Middle Hungarian (16–18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address http://tmk.nytud.hu/.

Keywords

Historical corpus Corpus annotation Morphological analysis PoS tagging Middle Hungarian Old Hungarian Corpus query tool 

Notes

Acknowledgements

The project Morphologically analysed corpus of Old and Middle Hungarian texts representative of informal language use was funded by the Hungarian Scientific Research Fund (OTKA) Project Grant No. OTKA 81189. The participants of the project mainly include historical linguists working at the Department of Finno-Ugric and Historical Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences, but the funding of OTKA made it possible to involve MA and doctoral students as well as participation of the computational linguist of the team. The project greatly benefited from regular consultations with experts of etymology (László Horváth) and historical syntax (Lea Haader). The follow-up project Competing structures in the Middle Hungarian vernacular: a variationist approach has been funded by the Hungarian Scientific Research Fund project Grant No. OTKA K 116217.

References

  1. Alberti, G. (2006). Generatív grammatikai gyakorlókönyv III. A háttérelmélet [Exercises for generative grammar. III. Theoretical background]. PTE—Bölcsész konzorcium—HEFOP Iroda, Pécs.Google Scholar
  2. Archer, D., et al. (2014). Normalising the corpus of English dialogues (1560–1760) using VARD2: Decisions and justifications. In 35th ICAME conference, April 30–May 04, 2014. Nottingham. Abstract: http://eprints.lancs.ac.uk/72803/.
  3. Archer, D., et al. (2015). Guidelines for normalising Early Modern English corpora: Decisions and justifications. ICAME Journal. doi: 10.1515/icame-2015-0001.Google Scholar
  4. Baron, A., Rayson, P., & Archer, D. (2011). Quantifying early modern English spelling variation: Change over time and genre. In Conference on new methods in historical corpora, University of Manchester. Presentation: http://eprints.lancs.ac.uk/60258/1/Presentation.pdf.
  5. Bennet, P., Durell, M., Scheible, S., & Whitt, R. J. (2010). Annotating a historical corpus of German: A case study. In Proceedings of the LREC 2010 workshop on Language Resources and Language Technology Standards, Valletta, Malta, May 18, 2010, pp. 64–68. http://www.ims.uni-stuttgart.de/institut/mitarbeiter/scheible/publications/lrec2010.pdf.
  6. Bollmann, M. (2013). Spelling normalization of historical German with sparse training data. In Proceedings of the Corpus Analysis with Noise in the Signal workshop (CANS 2013). http://ucrel.lancs.ac.uk/cans2013/abstracts/Bollmann.pdf.
  7. Claridge, C. (2008). Historical corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 242–259). Berlin, NJ: Walter DE GRUYTER.Google Scholar
  8. Csendes, D., Csirik, J., Gyimóthy, T., & Kocsor, A. (2005). The szeged treebank. In 8th International Conference Text, Speech and Dialogue, TSD 2005 (pp. 123–131). Springer.Google Scholar
  9. Halácsy, P., Kornai, A., & Oravecz, Cs. (2007). HunPos: An open source trigram tagger. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, ACL ’07 (pp. 209–212). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  10. Hendrickx, I., & Marquilhas, R. (2011). From old texts to modern spelling: An experiment in automatic normalisation. JLCL, 26(2), 65–76.Google Scholar
  11. Hulden, M., & Francom, J. (2012). Boosting statistical tagger accuracy with simple rule-based grammars. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth International Conference on Language Resources and Evaluation (LREC’12). Istanbul: European Language Resources Association (ELRA).Google Scholar
  12. Hunston, S. (2008). Collection strategies and design decisions. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook (Vol. 1, pp. 154–168). Berlin, NJ: Walter de Gruyter.Google Scholar
  13. Jackendoff, R. (1977). X-bar-syntax: A study of phrase structure. Linguistic inquiry monograph 2. Cambridge, MA: MIT Press.Google Scholar
  14. Jakab, L. (2002). A Jókai-kódex mint nyelvi emlék: szótárszerű feldolgozásban. Debrecen: Debreceni Egyetem.Google Scholar
  15. Jakab, L., & Kiss, A. (1994). A Guary-kódex ábécérendes adattára. Számítógépes nyelvtörténeti adattár. Debrecen: Debreceni Egyetem.Google Scholar
  16. Jakab, L., & Kiss, A. (2001). A Festetics-kódex ábécérendes adattára. Számítógépes nyelvtörténeti adattár. Debrecen: Debreceni Egyetem.Google Scholar
  17. Kiss, K. É. (1987). Configurationality in Hungarian. Budapest: Reidel, Dordrecht & Akadémiai Kiadó.CrossRefGoogle Scholar
  18. Lehto, A., Baron, A., Ratia, M., & Rayson, P. (2010). Improving the precision of corpus methods: The standardized version of early modern English medical texts. In I. Taavitsainen & P. Pahta (Eds.), Early modern English medical texts (pp. 279–290). Amsterdam: Benjamins.Google Scholar
  19. Lüdeling, A., & Kytö, M. (Eds.). (2008). Corpus linguistics. An international handbook. Berlin, NY: Walter de Gruyter.Google Scholar
  20. McEnery, T., & Hardie, A. (2010). Investigating the journalism of the seventeenth century. http://www.lancaster.ac.uk/fass/projects/newsbooks/default.htm.
  21. Meyer, C. F. (2002). English corpus linguistics. An introduction. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  22. Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal dependencies v1: A Multilingual treebank collection. In Proceedings of the tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 1659–1666). European Language Resources Association (ELRA).Google Scholar
  23. Novák, A. (2003). Milyen a Jó Humor? [What is good humor like?]. In I. Magyar Számítógépes Nyelvészeti Konferencia (pp. 138–144). Szeged: SZTE.Google Scholar
  24. Novák, A., Rebrus, P., & Ludányi, Zs. (2017). Az emMorph morfológiai elemző annotációs formalizmusa [The annotation formalism of the emMorph morphological analyzer]. In XIII. Magyar Számítógépes Nyelvészeti Konferencia (pp. 70–78). Szeged: SZTE.Google Scholar
  25. Orosz, Gy., & Novák, A. (2013). PurePos 2.0: A hybrid tool for morphological disambiguation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (pp. 539–545). Hissar, Bulgaria.Google Scholar
  26. Pahta, P., Palander-Collin, M., Nevala, M., & Nurmi, A. (2010). Language practices in the construction of social roles in late modern English. In P. Pahta, M. Nevala, A. Nurmi, & M. Palander-Collin (Eds.), Social roles and language practices in late modern English, (Pragmatics and Beyond NS 195). Amsterdam: Benjamins.CrossRefGoogle Scholar
  27. Petersen, U. (2004). Emdros—A text database engine for analyzed or annotated text. In Proceedings of COLING 2004 (pp. 1190–1193).Google Scholar
  28. Prószéky, G., & Novák, B. (2005). Computational morphologies for small Uralic languages. Inquiries into Words, Constraints and Contexts, 116–125.Google Scholar
  29. Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on early modern English corpora. In Proceedings of the Corpus Linguistics conference: CL2007. UCREL. http://eprints.lancs.ac.uk/13011/1/192_Paper.pdf.
  30. Schneider, P. (2002). Computer assisted spelling normalization of 18th century English. In P. Peters, P. Collins, & A. Smith (Eds.), New frontiers of corpus research: Papers from the 21st International Conference on English Language Research on Computerized Corpora, Sydney, 2000 (pp. 199–211). Amsterdam: Rodopi.Google Scholar
  31. Simon, E. (2014). Corpus building from Old Hungarian codices. In The evolution of functional left peripheries in Hungarian syntax (pp. 224–236). Oxford: Oxford University Press. ISBN 978-0-19-870985-5.Google Scholar
  32. Simon, E., & Vincze, V. (2016). Universal morphology for Old Hungarian. In Proceedings of the 10th SIGHUM workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH@ACL 2016 (pp. 118–127). Association for Computational Linguistics.Google Scholar
  33. Vincze, V., Szauter, D., Almási, A., Móra, Gy., Alexin, Z., & Csirik, J. (2010). Hungarian dependency treebank. In Proceedings of the seventh International Conference on Language Resources and Evaluation (LREC’10) (pp. 1855–1862). European Language Resources Association (ELRA).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2017

Authors and Affiliations

  • Attila Novák
    • 1
    • 2
  • Katalin Gugán
    • 3
  • Mónika Varga
    • 3
  • Adrienne Dömötör
    • 3
  1. 1.MTA-PPKE Hungarian Language Technology Research GroupBudapestHungary
  2. 2.Pázmány Péter Catholic University, Faculty of Information Technology and BionicsBudapestHungary
  3. 3.Research Institute for Linguistics of the Hungarian Academy of SciencesBudapestHungary

Personalised recommendations