Advertisement

Token-based spelling variant detection in Middle Low German texts

  • Fabian BarteldEmail author
  • Chris Biemann
  • Heike Zinsmeister
Original Paper
  • 15 Downloads

Abstract

In this paper we present a pipeline for the detection of spelling variants, i.e., different spellings that represent the same word, in non-standard texts. For example, in Middle Low German texts in and ihn (among others) are potential spellings of a single word, the personal pronoun ‘him’. Spelling variation is usually addressed by normalization, in which non-standard variants are mapped to a corresponding standard variant, e.g. the Modern German word ihn in the case of in. However, the approach to spelling variant detection presented here does not need such a reference to a standard variant and can therefore be applied to data for which a standard variant is missing. The pipeline we present first generates spelling variants for a given word using rewrite rules and surface similarity. Afterwards, the generated types are filtered. We present a new filter that works on the token level, i.e., taking the context of a word into account. Through this mechanism ambiguities on the type level can be resolved. For instance, the Middle Low German word in can not only be the personal pronoun ‘him’, but also the preposition ‘in’, and each of these has different variants. The detected spelling variants can be used in two settings for Digital Humanities research: On the one hand, they can be used to facilitate searching in non-standard texts. On the other hand, they can be used to improve the performance of natural language processing tools on the data by reducing the number of unknown words. To evaluate the utility of the pipeline in both applications, we present two evaluation settings and evaluate the pipeline on Middle Low German texts. We were able to improve the F1 score compared with previous work from \(0.39\) to \(0.52\) for the search setting and from \(0.23\) to \(0.30\) when detecting spelling variants of unknown words.

Keywords

Spelling variation Non-standard language Historical texts Information retrieval 

Notes

Acknowledgements

The first author was supported by the German Research Foundation (DFG), grant SCHR 999/5-2. We would like to thank the anonymous reviewers for their helpful remarks and Adam Roussel for improving our English. All remaining errors are ours.

References

  1. Adesam, Y., & Bouma, G. (2016). Old Swedish part-of-speech tagging between variation and external knowledge. In Proceedings of the 10th SIGHUM workshop on language technology for cultural heritage, social sciences, and humanities, (pp. 32–42). Berlin, Germany: Association for Computational Linguistics.Google Scholar
  2. Barteld, F. (2017). Detecting spelling variants in non-standard texts. In Proceedings of the student research workshop at the 15th conference of the European chapter of the association for computational linguistics, (pp. 11–22). Valencia, Spain: Association for Computational Linguistics.Google Scholar
  3. Barteld, F., Schröder, I., & Zinsmeister, H. (2015). Unsupervised regularization of historical texts for POS tagging. In Proceedings of the workshop on corpus-based research in the humanities (CRH), (pp. 3–12). Warsaw, Poland.Google Scholar
  4. Barteld, F., Schröder, I., & Zinsmeister, H. (2016). Dealing with word-internal modification and spelling variation in data-driven lemmatization. In Proceedings of the 10th SIGHUM workshop on language technology for cultural heritage, social sciences, and humanities, (pp. 52–62). Berlin, Germany: Association for Computational Linguistics.Google Scholar
  5. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.CrossRefGoogle Scholar
  6. Bollmann, M. (2012). (Semi-)automatic normalization of historical texts using distance measures and the Norma tool. In Mambrini F, Passarotti M, Sporleder C (eds) Proceedings of the second workshop on annotation of corpora for research in the humanities (ACRH-2), (pp. 3–14). Lisbon, Portugal.Google Scholar
  7. Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, (pp. 11–18). Sofia, Bulgaria: Association for Computational Linguistics.Google Scholar
  8. Bollmann, M., & Søgaard, A. (2016). Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, (pp. 131–139). Osaka, Japan.Google Scholar
  9. Bollmann, M., Petran, F., & Dipper, S. (2011). Applying rule-based normalization to different types of historical texts—An evaluation. In Proceedings of the 5th language & technology conference: human language technologies as a challenge for computer science and linguistics (LTC 2011), (pp. 339–344). Poznan, Poland.Google Scholar
  10. Bollmann, M., Bingel, J., & Søgaard, A. (2017). Learning attention for historical text normalization by learning to pronounce. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), (pp. 332–344). Vancouver, Canada : Association for Computational LinguisticsGoogle Scholar
  11. Chakrabarty, A., Pandit, O.A., & Garain, U. (2017). Context sensitive lemmatization using two successive bidirectional gated recurrent networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), (pp. 1481–1491). Vancouver, Canada: Association for Computational Linguistics.Google Scholar
  12. Chollet F, et al. (2015). Keras. https://github.com/fchollet/keras
  13. Chrupała, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), (pp. 680–686). Baltimore, Maryland, USA: Association for Computational Linguistics.Google Scholar
  14. Ciobanu, M.A., & Dinu, L.P. (2014). Automatic detection of cognates using orthographic alignment. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), (pp. 99–105). Baltimore, Maryland, USA: Association for Computational Linguistics.Google Scholar
  15. Costa Bertaglia, T.F., & Volpe Nunes, MdG. (2016). Exploring word embeddings for unsupervised textual user-generated content normalization. In Proceedings of the 2nd workshop on noisy user-generated text (WNUT), (pp. 112–120). Osaka, Japan: The COLING 2016 Organizing Committee.Google Scholar
  16. Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), 171–176.CrossRefGoogle Scholar
  17. Derczynski, L., Ritter, A., Clark, S., & Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the international conference recent advances in natural language processing (RANLP 2013), (pp. 198–206). Hissar, Bulgaria.Google Scholar
  18. Dipper, S., Lüdeling, A., & Reznicek, M. (2013). NoSta-D: A corpus of German Non-Standard Varieties. In M. Zampieri, S. Diwersy (eds) Non-standard data sources in corpus-based research, ZSM-Studien, vol 5, Shaker, (pp. 69–76).Google Scholar
  19. Ernst-Gerlach, A., & Fuhr, N. (2006). Generating search term variants for text collections with historic spellings. In M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika & A. Yavlinsky (Eds.), Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science (Vol 3936, pp. 49–60). Berlin: Springer.Google Scholar
  20. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463–484.CrossRefGoogle Scholar
  21. Gomes, L., & Pereira Lopes, J.G. (2011). Measuring spelling similarity for cognate identification. In L. Antunes, H. S. Pinto (Eds.), Progress in artificial intelligence: 15th Portuguese conference on artificial intelligence. EPIA 2011. Lecture Notes in Computer Science (Vol 7026, pp. 624–633). Berlin: Springer.Google Scholar
  22. Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), (pp. 1489–1501). Berlin, Germany: Association for Computational Linguistics.Google Scholar
  23. Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology, 4(1), 5:1–5:27.CrossRefGoogle Scholar
  24. Hauser, A.W., & Schulz, K.U. (2007). Unsupervised learning of edit distance weights for retrieving historical spelling variations. In Proceedings of the first workshop on finite-state techniques and approximate search, (pp. 1–6). Borovets, Bulgaria.Google Scholar
  25. Hogenboom, A., Bal, D., Frasincar, F., Bal, M., De Jong, F., & Kaymak, U. (2015). Exploiting emoticons in polarity classification of text. Journal of Web Engineering, 14(1–2), 22–40.Google Scholar
  26. Jurish, B. (2010a). Comparing canonicalizations of historical German text. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, (pp. 72–77). Uppsala, Sweden: Association for Computational Linguistics.Google Scholar
  27. Jurish, B. (2010b). More than words: Using token context to improve canonicalization of historical German. Journal for Language Technology and Computational Linguistics (JLCL), 25(1), 23–39.Google Scholar
  28. Jurish, B. (2011). Finite-state canonicalization techniques for historical German. Ph.D. thesis, University of Potsdam, Germany.Google Scholar
  29. Jurish, B., Thomas, C., & Wiegand, F. (2014). Querying the Deutsches Textarchiv. In U. Kruschwitz , F. Hopfgartner & C. Gurrin (eds) MindTheGap2014 (pp. 25–30). Berlin, Germany.Google Scholar
  30. Kestemont, M., Daelemans, W., & Pauw, G. D. (2010). Weigh your words–memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing, 25(3), 287–301.CrossRefGoogle Scholar
  31. Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1746–1751). Doha, Qatar: Association for Computational Linguistics.Google Scholar
  32. Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. CoRR abs/1412.6980.Google Scholar
  33. Kobus, C., Yvon, F., & Damnati, G. (2008). Normalizing SMS: Are two metaphors better than one? In Proceedings of the 22nd international conference on computational linguistics (Coling 2008), (pp. 441–448). Manchester, United Kingdom.Google Scholar
  34. Koleva, M., Farasyn, M., Desmet, B., Breitbarth, A., & Hoste, V. (2017). An automatic part-of-speech tagger for Middle Low German. International Journal of Corpus Linguistics, 22(1), 107–140.CrossRefGoogle Scholar
  35. Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1–5.Google Scholar
  36. Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.Google Scholar
  37. Li X. L., Liu B. (2005). Learning from positive and unlabeled examples with different data distributions. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge & L. Torgo (Eds.), Machine Learning: ECML 2005. Lecture Notes in Computer Science (Vol 3720, pp. 218–229). Berlin: Springer.Google Scholar
  38. Ljubešić, N., Zupan, K., Fišer, D., & Erjavec, T. (2016). Normalising Slovene data: historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), (pp. 146–155). Bochum, Germany.Google Scholar
  39. Logačev, P., Goldschmidt, K., & Demske, U. (2014). POS-tagging historical corpora: The case of Early New High German. In Proceedings of the thirteenth international workshop on treebanks and linguistic theories (TLT-13), (pp. 103–112). Tübingen, Germany.Google Scholar
  40. Mihov, S., & Schulz, K. U. (2004). Fast approximate search in large dictionaries. Computational Linguistics, 30(4), 451–477.CrossRefGoogle Scholar
  41. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR abs/1301.3781.Google Scholar
  42. Mordelet, F., & Vert, J.P. (2014). A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters 37(Supplement C):201–209.Google Scholar
  43. Niebaum, H. (2000). Phonetik und Phonologie, Graphetik und Graphemik des Mittelniederdeutschen. In Sprachgeschichte: Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung, 2nd edn (pp. 1422–1430). Berlin, Boston: DeGruyter.Google Scholar
  44. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(Oct):2825–2830.Google Scholar
  45. Pettersson, E., Megyesi, B., & Nivre, J. (2012). Rule-based normalisation of historical text — A diachronic study. In Proceedings of KONVENS 2012 (LThist 2012 workshop), (pp. 333–341). Vienna, Austria.Google Scholar
  46. Pettersson, E., Megyesi, B., & Nivre, J. (2013a). Normalisation of historical text using context-sensitive weighted levenshtein distance and compound splitting. In Proceedings of the 19th Nordic conference of computational linguistics (NODALIDA 2013), (pp. 163–179). Oslo, Norway.Google Scholar
  47. Pettersson, E., Megyesi, B., & Tiedemann, J. (2013). An SMT approach to automatic annotation of historical text. Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA, 2013, (pp. 54–69). Oslo, Norway.Google Scholar
  48. Pilz, T., Luther, W., Fuhr, N., & Ammon, U. (2006). Rule-based search in text databases with nonstandard orthography. Literary and Linguistic Computing, 21(2), 179–186.CrossRefGoogle Scholar
  49. Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies 17, Morgan & Claypool Publishers.Google Scholar
  50. Schulz, K., & Mihov, S. (2001). Fast string correction with Levenshtein-automata. CIS-Report 01-127. Tech. rep., CIS, University of Munich.Google Scholar
  51. Schulz, K., & Mihov, S. (2002). Fast string correction with Levenshtein-automata. International Journal on Document Analysis and Recognition, 5, 67–85.CrossRefGoogle Scholar
  52. Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., et al. (2017). The CLIN27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal, 7, 53–64.Google Scholar
  53. van der Goot, R., Plank, B., & Nissim, M. (2017). To normalize, or not to normalize: The impact of normalization on part-of-speech tagging. In Proceedings of the 3rd workshop on noisy user-generated text, (pp. 31–39). Copenhagen, Denmark: Association for Computational Linguistics.Google Scholar
  54. Weissweiler L. & Fraser A. (2018). Developing a Stemmer for German Based on a Comparative Analysis of Publicly Available Stemmers. In G. Rehm, T. Declerck (Eds.), Language Technologies for the Challenges of the Digital Age. GSCL 2017. Lecture Notes in Computer Science (Vol 10713, pp. 81–94). Cham: Springer.Google Scholar
  55. Yang, Y., & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT 2016), (pp. 1318–1328). San Diego, California, USA: Association for Computational LinguisticsGoogle Scholar
  56. Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095.CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Institut für GermanistikUniversität HamburgHamburgGermany
  2. 2.Department of Informatics, Language Technology GroupUniversität HamburgHamburgGermany

Personalised recommendations