Skip to main content

Data-Driven Identification of German Phrasal Compounds

  • Conference paper
  • First Online:
Book cover Text, Speech, and Dialogue (TSD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

Abstract

We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    “One-should-be-able-to-talk-about-everything motto”, “I-like-you-policy”. All examples appear in their original graphic realization, as found in previous studies [19] and billion-token web corpora [5, 6].

  2. 2.

    One PC-type defined is not captured by our automatic detection: Word formations whose non-head consists of not explicitly coordinated NPs, e.g. Frage-Antwort-Stunde, cf. p. 194 f.

References

  1. Agirre, E., Alegria, I., Arregi, X., Artola, X., de Ilarraza, A.D., Maritxalar, M., Sarasola, K., Urkia, M.: XUXEN: a spelling checker/corrector for Basque based on two-level morphology. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, pp. 119–125. Association for Computational Linguistics (1992)

    Google Scholar 

  2. Barbaresi, A.: Ad hoc and general-purpose corpus construction from web sources. Ph.D. thesis, École Normale Supérieure de Lyon, France (2015)

    Google Scholar 

  3. Barbaresi, A.: An unsupervised morphological criterion for discriminating similar languages. In: Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J. (eds.) Proceedings of the 3rd VarDial Workshop, pp. 212–220 (2016)

    Google Scholar 

  4. Barbaresi, A.: Bootstrapped OCR error detection for a less-resourced language variant. In: Dipper, S., Neubarth, F., Zinsmeister, H. (eds.) Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), pp. 21–26. University of Bochum (2016)

    Google Scholar 

  5. Barbaresi, A.: Efficient construction of metadata-enhanced web corpora. In: Cook, P., Evert, S., Schäfer, R., Stemle, E. (eds.) Proceedings of the 10th Web as Corpus Workshop, pp. 7–16. Association for Computational Linguistics (2016)

    Google Scholar 

  6. Barbaresi, A., Würzner, K.M.: For a fistful of blogs: discovery and comparative benchmarking of republishable German content. In: Beißwenger, M., Zesch, T. (eds.) KONVENS 2014, NLP4CMC Workshop Proceedings, pp. 2–10. Hildesheim University Press (2014)

    Google Scholar 

  7. Ben Hamadou, A.: A compression technique for Arabic dictionaries: the affix analysis. In: Proceedings of the 11th Conference on Computational Linguistics, pp. 286–288. Association for Computational Linguistics (1986)

    Google Scholar 

  8. Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: HLT-NAACL, pp. 155–163 (2007)

    Google Scholar 

  9. Demberg, V.: A language-independent unsupervised model for morphological segmentation. In: Annual Meeting of the Association for Computational Linguistics, vol. 45, pp. 920–927 (2007)

    Google Scholar 

  10. Finkbeiner, R., Meibauer, J.: Boris “Ich bin drin” Becker (“Boris I am in Becker”). Syntax, semantics and pragmatics of a special naming construction. Lingua 181, 36–57 (2016)

    Google Scholar 

  11. Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)

    Article  Google Scholar 

  12. Geyken, A.: The DWDS corpus: a reference corpus for the German language of the 20th century. In: Fellbaum, C. (ed.) Collocations and Idioms: Linguistic, Lexicographic, and Computational Aspects, pp. 23–41. Continuum Press (2007)

    Google Scholar 

  13. Hafer, M.A., Weiss, S.F.: Word segmentation by letter successor varieties. Inf. Storage Retrieval 10, 371–385 (1974)

    Article  Google Scholar 

  14. Harris, Z.S.: From phoneme to morphemes. Language 31(2), 190–222 (1955)

    Article  Google Scholar 

  15. Hein, K.: Phrasenkomposita - ein wortbildungsfremdes Randphänomen zwischen Morphologie und Syntax? Deutsche Sprache 39, 331–361 (2011)

    Google Scholar 

  16. Hein, K.: Phrasenkomposita im Deutschen. Empirische Untersuchung und konstruktionsgrammatische Modellierung. Narr (2015)

    Google Scholar 

  17. Hein, K.: Modeling the properties of German phrasal compounds within a usage-based constructional approach. In: Trips, C., Kornflit, J. (eds.) Further Investigations into the Nature of Phrasal Compounding. Language Science Press, Berlin (2017, to appear)

    Google Scholar 

  18. Henrich, V., Hinrichs, E.W.: Determining immediate constituents of compounds in GermaNet. In: Proceedings of Recent Advances in Natural Language Processing, pp. 420–426 (2011)

    Google Scholar 

  19. IDS: Deutsches Referenzkorpus/Archiv der Korpora geschriebener Gegenwartssprache 2011-I. Technical report, Institut für Deutsche Sprache Mannheim (2011). www.ids-mannheim.de/dereko

  20. Jones, M.A., Silverman, A.: A spelling checker based on affix classes. In: Agrawal, J.C., Zunde, P. (eds.) Empirical Foundations of Information and Software Science, pp. 373–379. Springer, Boston (1985)

    Chapter  Google Scholar 

  21. Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: Proceedings of 2nd Pascal Challenges Workshop, pp. 31–35 (2006)

    Google Scholar 

  22. Lawrenz, B.: Moderne deutsche Wortbildung. Phrasale Wortbildung im Deutschen: Linguistische Untersuchung und sprachdidaktische Behandlung. Dr. Kovaĉ (2006)

    Google Scholar 

  23. Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the 3rd VarDial Workshop (2016)

    Google Scholar 

  24. Meibauer, J.: Phrasenkomposita zwischen Wortsyntax und Lexikon. Zeitschrift für Sprachwissenschaft 22, 153–188 (2003)

    Google Scholar 

  25. Meibauer, J.: How marginal are phrasal compounds? Generalized insertion, expressivity, and I/Q-interaction. Morphology 17, 233–259 (2007)

    Article  Google Scholar 

  26. Müller, T.: General methods for fine-grained morphological and syntactic disambiguation. Ph.D. thesis, LMU Munich (2015)

    Google Scholar 

  27. Olsen, S.: Composition. In: Müller, P.O., Ohnheiser, I., Olsen, S., Rainer, F. (eds.) Word-formation. An International Handbook of the Languages of Europe, II: Units and Processes in Word-formation I: General Aspects, vol. 1, pp. 364–386. De Gruyter Mouton, Berlin/Boston (2015)

    Google Scholar 

  28. Ortner, L., Müller-Bollhagen, E.: Substantivkomposita. Deutsche Wortbildung: Typen und Tendenzen in der Gegenwartssprache, Schwann (1991)

    Google Scholar 

  29. Particke, H.J.: Phrasenkomposita: eine morphosyntaktische Beschreibung und Korpusstudie am Beispiel des Deutschen. Diplomica-Verlag, Hamburg (2015)

    Google Scholar 

  30. Peterson, J.L.: Computer programs for detecting and correcting spelling errors. Commun. ACM 23(12), 676–687 (1980)

    Article  Google Scholar 

  31. Schlücker, B.: Die deutsche Kompositionsfreudigkeit. Übersicht und Einführung. In: Gaeta, L., Schlücker, B. (eds.) Deutsche als kompositionsfreudige Sprache. Strukturelle Eigenschaften und systembezogene Aspekte, pp. 1–25. de Gruyter (2012)

    Google Scholar 

  32. Schmid, H., Fitschen, A., Heid, U.: SMOR: a German computational morphology covering derivation, composition and inflection. In: Proceedings of LREC, pp. 233–259 (2004)

    Google Scholar 

  33. Steyer, K., Hein, K.: Satzwertige usuelle Wortverbindungen und gebrauchsbasierte Muster. In: Engelberg, S., Lobin, H., Steyer, K., Wolfer, S. (eds.) Wortschätze: Dynamik, Muster, Komplexität, Jahrbuch des Instituts für Deutsche Sprache 2017. de Gruyter (2018, to appear)

    Google Scholar 

  34. Trips, C.: The relevance of phrasal compounds for the architecture of grammar. In: ten Hacken, P. (ed.) The Semantics of Compounding, pp. 153–177. Oxford University Press (2016)

    Google Scholar 

  35. Trips, C., Kornfilt, J. (eds.): Phrasal compounds from a typological and theoretical perspective. Special issue of STUF. Language Typology and Universals (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrien Barbaresi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Barbaresi, A., Hein, K. (2017). Data-Driven Identification of German Phrasal Compounds. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64206-2_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64205-5

  • Online ISBN: 978-3-319-64206-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics