Skip to main content

Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation

  • Conference paper
  • First Online:
  • 851 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Abstract

This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rule-based approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. Applying the normalization rules to texts by Luther results in 91 % exact matches, clearly outperforming the baseline (65 %). Matches can be improved to 93 % by combining the approach with a word substitution list. If applied to more diverse language data from roughly the same period, performance goes down to 43 % exact matches (baseline: 35 %), and to 46 % using the combined method. The results show that rules derived from a highly different type of text can support normalization to a certain extent.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This is a revised and extended version of [1], evaluated on a larger test corpus than the original paper. The research reported here was financed by Deutsche Forschungsgemeinschaft (DFG), Grant DI 1558/4-1.

  2. 2.

    Some of these characteristics in fact show up again in specific uses of modern language, such as contributions in chat rooms.

  3. 3.

    http://www.sermon-online.de

  4. 4.

    http://www.linguistics.rub.de/anselm/

  5. 5.

    Identity rules are excluded from the merging process. Otherwise, merging would result in mappings of entire words instead of character sequences, basically identical to a word substitution list.

  6. 6.

    Note that we ignore capitalization for the time being.

References

  1. Bollmann, M., Petran, F., Dipper, S.: Applying rule-based normalization to different types of historical texts. An evaluation. In: Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland (2011)

    Google Scholar 

  2. Bollmann, M., Petran, F., Dipper, S.: Rule-based normalization of historical texts. In: Proceedings of the International Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 34–42 (2011)

    Google Scholar 

  3. Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), Poster Volume, Beijing, China, pp. 81–89 (2010)

    Google Scholar 

  4. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  5. Bollmann, M., Dipper, S., Krasselt, J., Petran, F.: Manual and semi-automatic normalization of historical spelling. Case studies from Early New High German. In: Proceedings of the First International Workshop on Language Technology for Historical Text(s), Vienna, Austria (2012)

    Google Scholar 

  6. Piotrowski, M.: Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies, vol. 17. Morgan & Claypool, San Rafael (2012)

    Google Scholar 

  7. Bollmann, M.: (Semi-)automatic normalization of historical texts using distance measures and the Norma tool. In: Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, Portugal (2012)

    Google Scholar 

  8. Bollmann, M.: POS tagging for historical texts with sparse training data. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability in Discourse, Sofia, Bulgaria, pp. 11–18 (2013)

    Google Scholar 

  9. Baron, A., Rayson, P., Archer, D.: Automatic standardization of spelling for historical text mining. In: Proceedings of Digital Humanities 2009, Maryland, USA (2009)

    Google Scholar 

  10. van Halteren, H., Rem, M.: Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters. Lang. Resour. Eval. 47(4), 1233–1259 (2013)

    Article  Google Scholar 

  11. Adesam, Y., Ahlberg, M., Bouma, G.: \(bokstaffua, bokstaffwa, bokstafwa,\) \(bokstaua, bokstawa...\) Towards lexical link-up for a corpus of Old Swedish. In: Proceedings of KONVENS 2012 (LThist 2012 Workshop), Vienna, Austria, pp. 365–369 (2012)

    Google Scholar 

  12. Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in Old Spanish. In: Proceedings of the NODALIDA Workshop on Computational Historical Linguistics, Oslo, Norway (2013)

    Google Scholar 

  13. Jurish, B.: More than words: using token context to improve canonicalization of historical German. J. Lang. Technol. Comput. Linguist. 25(1), 23–39 (2010)

    Google Scholar 

  14. Pettersson, E., Megyesi, B., Tiedemann, J.: An SMT approach to automatic annotation of historical text. In: Proceedings of the NODALIDA Workshop on Computational Historical Linguistics, Oslo, Norway (2013)

    Google Scholar 

  15. Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the 4th Biennial Workshop on Balto-Slavic Natural Language Processing (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefanie Dipper .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Bollmann, M., Petran, F., Dipper, S. (2014). Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08958-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08957-7

  • Online ISBN: 978-3-319-08958-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics