Learning Morphology of Natural Language as a Finite-State Grammar

  • Javad Nouri
  • Roman YangarberEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10583)


We present algorithms that learn to segment words in morphologically rich languages, in an unsupervised fashion. Morphology of many languages can be modeled by finite state machines (FSMs). We start with a baseline MDL-based learning algorithm. We then formulate well-motivated and general linguistic principles about morphology, and incorporate them into the algorithm as heuristics, to constrain the search space. We evaluate the algorithm on two highly-inflecting languages. Evaluation of segmentation shows gains in performance compared to the state of the art. We conclude with a discussion about how the learned model relates to a morphological FSM, which is the ultimate goal.


Unsupervised morphology induction Minimum description length principle MDL Finite-state automata 



This research was supported in part by the FinUgRevita Project, No. 267097, of the Academy of Finland. We thank Hannes Wettig for his contributions to this work.


  1. 1.
    Černý, V.: Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J. Optim. Theory Appl. 45(1), 41–51 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Chomsky, N.: Rules and Representations. Basil Blackwell, Oxford (1980)Google Scholar
  3. 3.
    Creutz, M.: Unsupervised segmentation of words using prior distributions of morph length and frequency. In: Proceedings of 41st Meeting of ACL, Sapporo, Japan (2003)Google Scholar
  4. 4.
    Creutz, M.: Induction of a simple morphology for highly-inflecting languages. In: Proceedings of ACL SIGPHON, Barcelona, Spain (2004)Google Scholar
  5. 5.
    Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of Workshop on Morphological and Phonological Learning, Philadelphia, PA, USA (2002)Google Scholar
  6. 6.
    Creutz, M., Lagus, K., Lindén, K., Virpioja, S.: Morfessor and Hutmegs: unsupervised morpheme segmentation for highly-inflecting and compounding languages. In: Proceedings of 2nd Baltic Conference on Human Language Technologies, Tallinn, Estonia (2005)Google Scholar
  7. 7.
    Creutz, M., Lindén, K.: Morpheme segmentation gold standards for Finnish and English. Technical report A77, HUT (2004)Google Scholar
  8. 8.
    Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural language from unannotated text. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR-05), Espoo, Finland (2005)Google Scholar
  9. 9.
    Dawid, A.: Statistical theory: the prequential approach. J. Roy. Stat. Soc. A 147(2), 278–292 (1984)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. 39(B), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Goldsmith, J.: Unsupervised learning of the morphology of a natural language. ACL 27(2), 153–198 (2001)MathSciNetGoogle Scholar
  12. 12.
    Goldsmith, J., Hu, Y.: From signatures to finite state automata. In: Midwest Computational Linguistics Colloquium, Bloomington, IN (2004)Google Scholar
  13. 13.
    Grönroos, S.A., Virpioja, S., Smit, P., Kurimo, M.: Morfessor FlatCat: an HMM-based method for unsupervised and semi-supervised learning of morphology. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, Dublin, Ireland (2014)Google Scholar
  14. 14.
    Grünwald, P.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)Google Scholar
  15. 15.
    Hammarström, H., Borin, L.: Unsupervised learning of morphology. Comput. Linguist. 37(2), 309–350 (2011)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Koskenniemi, K.: Two-level morphology: a general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki, Finland (1983)Google Scholar
  17. 17.
    Kurimo, M., Creutz, M., Lagus, K. (eds.): Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. PASCAL European Network of Excellence, Venice (2006)Google Scholar
  18. 18.
    Kurimo, M., Turunen, V., Varjokallio, M.: Overview of morpho challenge 2008. In: Peters, C., et al. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 951–966. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-04447-2_127 CrossRefGoogle Scholar
  19. 19.
    Kurimo, M., Virpioja, S., Turunen, V., Lagus, K.: Morpho-challenge 2005–2010: evaluations and results. In: Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology. Association for Computational Linguistics, Uppsala (2010)Google Scholar
  20. 20.
    Kurimo, M., Virpioja, S., Turunen, V.T.: Proceedings of the Morpho Challenge 2010 Workshop. Technical report, TKK-ICS-R37, Aalto University, School of Science and Technology, Department of Information and Computer Science, Espoo, Finland (2010)Google Scholar
  21. 21.
    Narasimhan, K., Barzilay, R., Jaakkola, T.: An unsupervised method for uncovering morphological chains. Trans. Assoc. Comput. Linguist. 3, 157–167 (2015)Google Scholar
  22. 22.
    Nouri, J., Yangarber, R.: A novel evaluation method for morphological segmentation. In: Proceedings of LREC 2016: The Tenth International Conference on Language Resources and Evaluation, Portorož, Slovenia (2016)Google Scholar
  23. 23.
    Spiegler, S., Monson, C.: EMMA: a novel evaluation metric for morphological analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics (2010)Google Scholar
  24. 24.
    Virpioja, S., Kohonen, O., Lagus, K.: Unsupervised morpheme analysis with allomorfessor. In: Peters, C., Di Nunzio, G.M., Kurimo, M., Mandl, T., Mostefa, D., Peñas, A., Roda, G. (eds.) CLEF 2009. LNCS, vol. 6241, pp. 609–616. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15754-7_73 CrossRefGoogle Scholar
  25. 25.
    Virpioja, S., Kohonen, O., Lagus, K.: Evaluating the effect of word frequencies in a probabilistic generative model of morphology. In: Proceedings of the NODALIDA Conference (2011)Google Scholar
  26. 26.
    Virpioja, S., Turunen, V.T., Spiegler, S., Kohonen, O., Kurimo, M.: Empirical comparison of evaluation methods for unsupervised learning of morphology. TAL 52(2), 45–90 (2011). Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations