Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation

Bollmann, Marcel; Petran, Florian; Dipper, Stefanie

doi:10.1007/978-3-319-08958-4_14

Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation

Marcel Bollmann⁶,
Florian Petran⁶ &
Stefanie Dipper⁶

Conference paper
First Online: 01 January 2014

851 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Abstract

This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rule-based approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. Applying the normalization rules to texts by Luther results in 91 % exact matches, clearly outperforming the baseline (65 %). Matches can be improved to 93 % by combining the approach with a word substitution list. If applied to more diverse language data from roughly the same period, performance goes down to 43 % exact matches (baseline: 35 %), and to 46 % using the combined method. The results show that rules derived from a highly different type of text can support normalization to a certain extent.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
This is a revised and extended version of [1], evaluated on a larger test corpus than the original paper. The research reported here was financed by Deutsche Forschungsgemeinschaft (DFG), Grant DI 1558/4-1.
2.
Some of these characteristics in fact show up again in specific uses of modern language, such as contributions in chat rooms.
3.
http://www.sermon-online.de
4.
http://www.linguistics.rub.de/anselm/
5.
Identity rules are excluded from the merging process. Otherwise, merging would result in mappings of entire words instead of character sequences, basically identical to a word substitution list.
6.
Note that we ignore capitalization for the time being.

References

Bollmann, M., Petran, F., Dipper, S.: Applying rule-based normalization to different types of historical texts. An evaluation. In: Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland (2011)
Google Scholar
Bollmann, M., Petran, F., Dipper, S.: Rule-based normalization of historical texts. In: Proceedings of the International Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 34–42 (2011)
Google Scholar
Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Poster Volume, Beijing, China, pp. 81–89 (2010)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Bollmann, M., Dipper, S., Krasselt, J., Petran, F.: Manual and semi-automatic normalization of historical spelling. Case studies from Early New High German. In: Proceedings of the First International Workshop on Language Technology for Historical Text(s), Vienna, Austria (2012)
Google Scholar
Piotrowski, M.: Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies, vol. 17. Morgan & Claypool, San Rafael (2012)
Google Scholar
Bollmann, M.: (Semi-)automatic normalization of historical texts using distance measures and the Norma tool. In: Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, Portugal (2012)
Google Scholar
Bollmann, M.: POS tagging for historical texts with sparse training data. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability in Discourse, Sofia, Bulgaria, pp. 11–18 (2013)
Google Scholar
Baron, A., Rayson, P., Archer, D.: Automatic standardization of spelling for historical text mining. In: Proceedings of Digital Humanities 2009, Maryland, USA (2009)
Google Scholar
van Halteren, H., Rem, M.: Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters. Lang. Resour. Eval. 47(4), 1233–1259 (2013)
Article Google Scholar
Adesam, Y., Ahlberg, M., Bouma, G.: \(bokstaffua, bokstaffwa, bokstafwa,\) \(bokstaua, bokstawa...\) Towards lexical link-up for a corpus of Old Swedish. In: Proceedings of KONVENS 2012 (LThist 2012 Workshop), Vienna, Austria, pp. 365–369 (2012)
Google Scholar
Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in Old Spanish. In: Proceedings of the NODALIDA Workshop on Computational Historical Linguistics, Oslo, Norway (2013)
Google Scholar
Jurish, B.: More than words: using token context to improve canonicalization of historical German. J. Lang. Technol. Comput. Linguist. 25(1), 23–39 (2010)
Google Scholar
Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic annotation of historical text. In: Proceedings of the NODALIDA Workshop on Computational Historical Linguistics, Oslo, Norway (2013)
Google Scholar
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the 4th Biennial Workshop on Balto-Slavic Natural Language Processing (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics, Ruhr-Universität Bochum, 44780, Bochum, Germany
Marcel Bollmann, Florian Petran & Stefanie Dipper

Authors

Marcel Bollmann
View author publications
You can also search for this author in PubMed Google Scholar
Florian Petran
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Dipper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefanie Dipper .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
IMMI-CNRS, Orsay, France
Joseph Mariani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bollmann, M., Petran, F., Dipper, S. (2014). Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-08958-4_14
Published: 26 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08957-7
Online ISBN: 978-3-319-08958-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics