Abstract
We describe Dsolve, a system for the segmentation of morphologically complex German words into their constituent morphs. Our approach treats morphological segmentation as a classification task, in which the locations and types of morph boundaries are predicted by a Conditional Random Field model trained from manually annotated data. The prediction of morph-boundary types in addition to their locations distinguishes Dsolve from similar approaches previously suggested in the literature. We show that the use of boundary types provides a (somewhat counter-intuitive) performance boost with respect to the simpler task of predicting only segment locations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Although as correctly noted in [23], any class-string c which maximizes P(c, o) will also maximize P(c|o) if the observation string o is held fixed.
- 2.
Note that our use of “model order” in this paper refers only to the context window size used to define the feature function inventory, and is unrelated to the order of linear-chain feature dependencies in the underlying CRF models.
- 3.
- 4.
- 5.
http://www.cis.hut.fi/projects/morpho/morfessorflatcat.shtml; FlatCat models were trained with perplexity threshold 10.0 using annotated corpus data in semi-supervised mode.
References
Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI, Stanford (2003)
Chang, J.Z., Chang, J.S.: Word root finder: a morphological segmentor based on CRF. In: Proceedings of COLING 2012: Demonstration Papers, pp. 51–58 (2012)
Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the ACL 2002 Workshop on Morphological and Phonological Learning, pp. 21–30 (2002)
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4(1), 3:1–3:34 (2007)
Creutz, M., Lindén, K.: Morpheme segmentation gold standards for Finnish and English. Technical report A77, Helsinki University of Technology (2004)
Daelemans, W.: Grafon: a grapheme-to-phoneme conversion system for Dutch. In: Proceedings of COLING 1988, pp. 133–138 (1988)
Déjean, H.: Morphemes as necessary concept for structures discovery from untagged corpora. In: Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, pp. 295–298 (1998)
Frakes, W.B.: Stemming algorithms. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval, pp. 131–160. Prentice-Hall, Upper Saddle River (1992)
Geyken, A., Hanneforth, T.: TAGH: a complete morphology for German based on weighted finite state automata. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds.) FSMNLP 2005. LNCS (LNAI), vol. 4002, pp. 55–66. Springer, Heidelberg (2006)
Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27(2), 153–198 (2001)
Green, S., DeNero, J.: A class-based agreement model for generating accurately inflected translations. In: Proceedings of ACL 2012, pp. 146–155 (2012)
Haapalainen, M., Ari, M.: GERTWOL und morphologische Disambiguierung für das Deutsche. In: Proceedings of the 10th Nordic Conference of Computational Linguistics. University of Helsinki, Department of General Linguistics (1995)
Harris, Z.: From phoneme to morpheme. Language 31, 190–222 (1955)
Klenk, U., Langer, H.: Morphological segmentation without a lexicon. Literary Linguist. Comput. 4(4), 247–253 (1989)
Kohonen, O., Virpioja, S., Lagus, K.: Semi-supervised learning of concatenative morphology. In: Proceedings of SIGMORPHON 2010, pp. 78–86 (2010)
Kurimo, M., Virpioja, S., Turunen, V., Lagus, K.: Morpho challenge competition 2005–2010: evaluations and results. Proceedings of SIGMORPHON 2010, pp. 87–95 (2010)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann (2001)
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of ACL 2010, pp. 504–513 (2010)
Müller, C., Gurevych, I.: Semantically enhanced term frequency. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 598–601. Springer, Heidelberg (2010)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (1999)
Pfeifer, W.: Etymologisches Wörterbuch des Deutschen, 2nd edn. Akademie-Verlag, Berlin (1993)
Porter, M.F.: An algorithm for suffix stripping. Electron. Libr. Inf. Syst. 14(3), 130–137 (1980)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–285 (1989)
Reichel, U.D., Weilhammer, K.: Automated morphological segmentation and evaluation. In: Proceedings of LREC, pp. 503–506 (2004)
van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton (1979)
Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37 (2013)
Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Painless semi-supervised morphological segmentation using conditional random fields. In: Proceedings of EACL 2014, pp. 84–89 (2014)
Schmid, H., Fitschen, A., Heid, U.: SMOR: a German computational morphology covering derivation, composition and inflection. In: Proceedings of LREC (2004)
Selkirk, E.O.: On the nature of phonological representation. In: Myers, T., Laver, J., Anderson, J. (eds.) The Cognitive Representation of Speech, pp. 379–388. North-Holland Publishing Company, Dordrecht (1981)
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for SIGHAN bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)
Wallach, H.M.: Conditional random fields: an introduction. Technical report MS-CIS-04-21, University of Pennsylvania, Department of Computer and Information Science (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Würzner, KM., Jurish, B. (2015). Dsolve—Morphological Segmentation for German Using Conditional Random Fields. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2015. Communications in Computer and Information Science, vol 537. Springer, Cham. https://doi.org/10.1007/978-3-319-23980-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-23980-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23978-1
Online ISBN: 978-3-319-23980-4
eBook Packages: Computer ScienceComputer Science (R0)