Data-Driven Identification of German Phrasal Compounds

  • Adrien BarbaresiEmail author
  • Katrin Hein
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.


Corpus linguistics Word segmentation Morphological analysis Web corpora 


  1. 1.
    Agirre, E., Alegria, I., Arregi, X., Artola, X., de Ilarraza, A.D., Maritxalar, M., Sarasola, K., Urkia, M.: XUXEN: a spelling checker/corrector for Basque based on two-level morphology. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, pp. 119–125. Association for Computational Linguistics (1992)Google Scholar
  2. 2.
    Barbaresi, A.: Ad hoc and general-purpose corpus construction from web sources. Ph.D. thesis, École Normale Supérieure de Lyon, France (2015)Google Scholar
  3. 3.
    Barbaresi, A.: An unsupervised morphological criterion for discriminating similar languages. In: Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J. (eds.) Proceedings of the 3rd VarDial Workshop, pp. 212–220 (2016)Google Scholar
  4. 4.
    Barbaresi, A.: Bootstrapped OCR error detection for a less-resourced language variant. In: Dipper, S., Neubarth, F., Zinsmeister, H. (eds.) Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), pp. 21–26. University of Bochum (2016)Google Scholar
  5. 5.
    Barbaresi, A.: Efficient construction of metadata-enhanced web corpora. In: Cook, P., Evert, S., Schäfer, R., Stemle, E. (eds.) Proceedings of the 10th Web as Corpus Workshop, pp. 7–16. Association for Computational Linguistics (2016)Google Scholar
  6. 6.
    Barbaresi, A., Würzner, K.M.: For a fistful of blogs: discovery and comparative benchmarking of republishable German content. In: Beißwenger, M., Zesch, T. (eds.) KONVENS 2014, NLP4CMC Workshop Proceedings, pp. 2–10. Hildesheim University Press (2014)Google Scholar
  7. 7.
    Ben Hamadou, A.: A compression technique for Arabic dictionaries: the affix analysis. In: Proceedings of the 11th Conference on Computational Linguistics, pp. 286–288. Association for Computational Linguistics (1986)Google Scholar
  8. 8.
    Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: HLT-NAACL, pp. 155–163 (2007)Google Scholar
  9. 9.
    Demberg, V.: A language-independent unsupervised model for morphological segmentation. In: Annual Meeting of the Association for Computational Linguistics, vol. 45, pp. 920–927 (2007)Google Scholar
  10. 10.
    Finkbeiner, R., Meibauer, J.: Boris “Ich bin drin” Becker (“Boris I am in Becker”). Syntax, semantics and pragmatics of a special naming construction. Lingua 181, 36–57 (2016)Google Scholar
  11. 11.
    Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)CrossRefGoogle Scholar
  12. 12.
    Geyken, A.: The DWDS corpus: a reference corpus for the German language of the 20th century. In: Fellbaum, C. (ed.) Collocations and Idioms: Linguistic, Lexicographic, and Computational Aspects, pp. 23–41. Continuum Press (2007)Google Scholar
  13. 13.
    Hafer, M.A., Weiss, S.F.: Word segmentation by letter successor varieties. Inf. Storage Retrieval 10, 371–385 (1974)CrossRefGoogle Scholar
  14. 14.
    Harris, Z.S.: From phoneme to morphemes. Language 31(2), 190–222 (1955)CrossRefGoogle Scholar
  15. 15.
    Hein, K.: Phrasenkomposita - ein wortbildungsfremdes Randphänomen zwischen Morphologie und Syntax? Deutsche Sprache 39, 331–361 (2011)Google Scholar
  16. 16.
    Hein, K.: Phrasenkomposita im Deutschen. Empirische Untersuchung und konstruktionsgrammatische Modellierung. Narr (2015)Google Scholar
  17. 17.
    Hein, K.: Modeling the properties of German phrasal compounds within a usage-based constructional approach. In: Trips, C., Kornflit, J. (eds.) Further Investigations into the Nature of Phrasal Compounding. Language Science Press, Berlin (2017, to appear)Google Scholar
  18. 18.
    Henrich, V., Hinrichs, E.W.: Determining immediate constituents of compounds in GermaNet. In: Proceedings of Recent Advances in Natural Language Processing, pp. 420–426 (2011)Google Scholar
  19. 19.
    IDS: Deutsches Referenzkorpus/Archiv der Korpora geschriebener Gegenwartssprache 2011-I. Technical report, Institut für Deutsche Sprache Mannheim (2011).
  20. 20.
    Jones, M.A., Silverman, A.: A spelling checker based on affix classes. In: Agrawal, J.C., Zunde, P. (eds.) Empirical Foundations of Information and Software Science, pp. 373–379. Springer, Boston (1985)CrossRefGoogle Scholar
  21. 21.
    Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: Proceedings of 2nd Pascal Challenges Workshop, pp. 31–35 (2006)Google Scholar
  22. 22.
    Lawrenz, B.: Moderne deutsche Wortbildung. Phrasale Wortbildung im Deutschen: Linguistische Untersuchung und sprachdidaktische Behandlung. Dr. Kovaĉ (2006)Google Scholar
  23. 23.
    Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the 3rd VarDial Workshop (2016)Google Scholar
  24. 24.
    Meibauer, J.: Phrasenkomposita zwischen Wortsyntax und Lexikon. Zeitschrift für Sprachwissenschaft 22, 153–188 (2003)Google Scholar
  25. 25.
    Meibauer, J.: How marginal are phrasal compounds? Generalized insertion, expressivity, and I/Q-interaction. Morphology 17, 233–259 (2007)CrossRefGoogle Scholar
  26. 26.
    Müller, T.: General methods for fine-grained morphological and syntactic disambiguation. Ph.D. thesis, LMU Munich (2015)Google Scholar
  27. 27.
    Olsen, S.: Composition. In: Müller, P.O., Ohnheiser, I., Olsen, S., Rainer, F. (eds.) Word-formation. An International Handbook of the Languages of Europe, II: Units and Processes in Word-formation I: General Aspects, vol. 1, pp. 364–386. De Gruyter Mouton, Berlin/Boston (2015)Google Scholar
  28. 28.
    Ortner, L., Müller-Bollhagen, E.: Substantivkomposita. Deutsche Wortbildung: Typen und Tendenzen in der Gegenwartssprache, Schwann (1991)Google Scholar
  29. 29.
    Particke, H.J.: Phrasenkomposita: eine morphosyntaktische Beschreibung und Korpusstudie am Beispiel des Deutschen. Diplomica-Verlag, Hamburg (2015)Google Scholar
  30. 30.
    Peterson, J.L.: Computer programs for detecting and correcting spelling errors. Commun. ACM 23(12), 676–687 (1980)CrossRefGoogle Scholar
  31. 31.
    Schlücker, B.: Die deutsche Kompositionsfreudigkeit. Übersicht und Einführung. In: Gaeta, L., Schlücker, B. (eds.) Deutsche als kompositionsfreudige Sprache. Strukturelle Eigenschaften und systembezogene Aspekte, pp. 1–25. de Gruyter (2012)Google Scholar
  32. 32.
    Schmid, H., Fitschen, A., Heid, U.: SMOR: a German computational morphology covering derivation, composition and inflection. In: Proceedings of LREC, pp. 233–259 (2004)Google Scholar
  33. 33.
    Steyer, K., Hein, K.: Satzwertige usuelle Wortverbindungen und gebrauchsbasierte Muster. In: Engelberg, S., Lobin, H., Steyer, K., Wolfer, S. (eds.) Wortschätze: Dynamik, Muster, Komplexität, Jahrbuch des Instituts für Deutsche Sprache 2017. de Gruyter (2018, to appear)Google Scholar
  34. 34.
    Trips, C.: The relevance of phrasal compounds for the architecture of grammar. In: ten Hacken, P. (ed.) The Semantics of Compounding, pp. 153–177. Oxford University Press (2016)Google Scholar
  35. 35.
    Trips, C., Kornfilt, J. (eds.): Phrasal compounds from a typological and theoretical perspective. Special issue of STUF. Language Typology and Universals (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Zentrum SpracheBerlin-Brandenburg Academy of SciencesBerlinGermany
  2. 2.Lexical DepartmentInstitute for the German LanguageMannheimGermany
  3. 3.Academy CorporaAustrian Academy of SciencesViennaAustria

Personalised recommendations