Data-Driven Identification of German Phrasal Compounds

Barbaresi, Adrien; Hein, Katrin

doi:10.1007/978-3-319-64206-2_22

Adrien Barbaresi^15,17 &
Katrin Hein¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1488 Accesses
1 Citations

Abstract

We present a method to identify and document a phenomenon on which there is very little empirical data: German phrasal compounds occurring in the form of as a single token (without punctuation between their components). Relying on linguistic criteria, our approach implies to have an operational notion of compounds which can be systematically applied as well as (web) corpora which are large and diverse enough to contain rarely seen phenomena. The method is based on word segmentation and morphological analysis, it takes advantage of a data-driven learning process. Our results show that coarse-grained identification of phrasal compounds is best performed with empirical data, whereas fine-grained detection could be improved with a combination of rule-based and frequency-based word lists. Along with the characteristics of web texts, the orthographic realizations seem to be linked to the degree of expressivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
“One-should-be-able-to-talk-about-everything motto”, “I-like-you-policy”. All examples appear in their original graphic realization, as found in previous studies [19] and billion-token web corpora [5, 6].
2.
One PC-type defined is not captured by our automatic detection: Word formations whose non-head consists of not explicitly coordinated NPs, e.g. Frage-Antwort-Stunde, cf. p. 194 f.

References

Agirre, E., Alegria, I., Arregi, X., Artola, X., de Ilarraza, A.D., Maritxalar, M., Sarasola, K., Urkia, M.: XUXEN: a spelling checker/corrector for Basque based on two-level morphology. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, pp. 119–125. Association for Computational Linguistics (1992)
Google Scholar
Barbaresi, A.: Ad hoc and general-purpose corpus construction from web sources. Ph.D. thesis, École Normale Supérieure de Lyon, France (2015)
Google Scholar
Barbaresi, A.: An unsupervised morphological criterion for discriminating similar languages. In: Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J. (eds.) Proceedings of the 3rd VarDial Workshop, pp. 212–220 (2016)
Google Scholar
Barbaresi, A.: Bootstrapped OCR error detection for a less-resourced language variant. In: Dipper, S., Neubarth, F., Zinsmeister, H. (eds.) Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), pp. 21–26. University of Bochum (2016)
Google Scholar
Barbaresi, A.: Efficient construction of metadata-enhanced web corpora. In: Cook, P., Evert, S., Schäfer, R., Stemle, E. (eds.) Proceedings of the 10th Web as Corpus Workshop, pp. 7–16. Association for Computational Linguistics (2016)
Google Scholar
Barbaresi, A., Würzner, K.M.: For a fistful of blogs: discovery and comparative benchmarking of republishable German content. In: Beißwenger, M., Zesch, T. (eds.) KONVENS 2014, NLP4CMC Workshop Proceedings, pp. 2–10. Hildesheim University Press (2014)
Google Scholar
Ben Hamadou, A.: A compression technique for Arabic dictionaries: the affix analysis. In: Proceedings of the 11th Conference on Computational Linguistics, pp. 286–288. Association for Computational Linguistics (1986)
Google Scholar
Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: HLT-NAACL, pp. 155–163 (2007)
Google Scholar
Demberg, V.: A language-independent unsupervised model for morphological segmentation. In: Annual Meeting of the Association for Computational Linguistics, vol. 45, pp. 920–927 (2007)
Google Scholar
Finkbeiner, R., Meibauer, J.: Boris “Ich bin drin” Becker (“Boris I am in Becker”). Syntax, semantics and pragmatics of a special naming construction. Lingua 181, 36–57 (2016)
Google Scholar
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Article Google Scholar
Geyken, A.: The DWDS corpus: a reference corpus for the German language of the 20th century. In: Fellbaum, C. (ed.) Collocations and Idioms: Linguistic, Lexicographic, and Computational Aspects, pp. 23–41. Continuum Press (2007)
Google Scholar
Hafer, M.A., Weiss, S.F.: Word segmentation by letter successor varieties. Inf. Storage Retrieval 10, 371–385 (1974)
Article Google Scholar
Harris, Z.S.: From phoneme to morphemes. Language 31(2), 190–222 (1955)
Article Google Scholar
Hein, K.: Phrasenkomposita - ein wortbildungsfremdes Randphänomen zwischen Morphologie und Syntax? Deutsche Sprache 39, 331–361 (2011)
Google Scholar
Hein, K.: Phrasenkomposita im Deutschen. Empirische Untersuchung und konstruktionsgrammatische Modellierung. Narr (2015)
Google Scholar
Hein, K.: Modeling the properties of German phrasal compounds within a usage-based constructional approach. In: Trips, C., Kornflit, J. (eds.) Further Investigations into the Nature of Phrasal Compounding. Language Science Press, Berlin (2017, to appear)
Google Scholar
Henrich, V., Hinrichs, E.W.: Determining immediate constituents of compounds in GermaNet. In: Proceedings of Recent Advances in Natural Language Processing, pp. 420–426 (2011)
Google Scholar
IDS: Deutsches Referenzkorpus/Archiv der Korpora geschriebener Gegenwartssprache 2011-I. Technical report, Institut für Deutsche Sprache Mannheim (2011). www.ids-mannheim.de/dereko
Jones, M.A., Silverman, A.: A spelling checker based on affix classes. In: Agrawal, J.C., Zunde, P. (eds.) Empirical Foundations of Information and Software Science, pp. 373–379. Springer, Boston (1985)
Chapter Google Scholar
Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: Proceedings of 2nd Pascal Challenges Workshop, pp. 31–35 (2006)
Google Scholar
Lawrenz, B.: Moderne deutsche Wortbildung. Phrasale Wortbildung im Deutschen: Linguistische Untersuchung und sprachdidaktische Behandlung. Dr. Kovaĉ (2006)
Google Scholar
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the 3rd VarDial Workshop (2016)
Google Scholar
Meibauer, J.: Phrasenkomposita zwischen Wortsyntax und Lexikon. Zeitschrift für Sprachwissenschaft 22, 153–188 (2003)
Google Scholar
Meibauer, J.: How marginal are phrasal compounds? Generalized insertion, expressivity, and I/Q-interaction. Morphology 17, 233–259 (2007)
Article Google Scholar
Müller, T.: General methods for fine-grained morphological and syntactic disambiguation. Ph.D. thesis, LMU Munich (2015)
Google Scholar
Olsen, S.: Composition. In: Müller, P.O., Ohnheiser, I., Olsen, S., Rainer, F. (eds.) Word-formation. An International Handbook of the Languages of Europe, II: Units and Processes in Word-formation I: General Aspects, vol. 1, pp. 364–386. De Gruyter Mouton, Berlin/Boston (2015)
Google Scholar
Ortner, L., Müller-Bollhagen, E.: Substantivkomposita. Deutsche Wortbildung: Typen und Tendenzen in der Gegenwartssprache, Schwann (1991)
Google Scholar
Particke, H.J.: Phrasenkomposita: eine morphosyntaktische Beschreibung und Korpusstudie am Beispiel des Deutschen. Diplomica-Verlag, Hamburg (2015)
Google Scholar
Peterson, J.L.: Computer programs for detecting and correcting spelling errors. Commun. ACM 23(12), 676–687 (1980)
Article Google Scholar
Schlücker, B.: Die deutsche Kompositionsfreudigkeit. Übersicht und Einführung. In: Gaeta, L., Schlücker, B. (eds.) Deutsche als kompositionsfreudige Sprache. Strukturelle Eigenschaften und systembezogene Aspekte, pp. 1–25. de Gruyter (2012)
Google Scholar
Schmid, H., Fitschen, A., Heid, U.: SMOR: a German computational morphology covering derivation, composition and inflection. In: Proceedings of LREC, pp. 233–259 (2004)
Google Scholar
Steyer, K., Hein, K.: Satzwertige usuelle Wortverbindungen und gebrauchsbasierte Muster. In: Engelberg, S., Lobin, H., Steyer, K., Wolfer, S. (eds.) Wortschätze: Dynamik, Muster, Komplexität, Jahrbuch des Instituts für Deutsche Sprache 2017. de Gruyter (2018, to appear)
Google Scholar
Trips, C.: The relevance of phrasal compounds for the architecture of grammar. In: ten Hacken, P. (ed.) The Semantics of Compounding, pp. 153–177. Oxford University Press (2016)
Google Scholar
Trips, C., Kornfilt, J. (eds.): Phrasal compounds from a typological and theoretical perspective. Special issue of STUF. Language Typology and Universals (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Zentrum Sprache, Berlin-Brandenburg Academy of Sciences, Jägerstraße 22/23, 10117, Berlin, Germany
Adrien Barbaresi
Lexical Department, Institute for the German Language, R5, 6-13, 68161, Mannheim, Germany
Katrin Hein
Academy Corpora, Austrian Academy of Sciences, Sonnenfelsgasse 19, 1010, Vienna, Austria
Adrien Barbaresi

Authors

Adrien Barbaresi
View author publications
You can also search for this author in PubMed Google Scholar
Katrin Hein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrien Barbaresi .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barbaresi, A., Hein, K. (2017). Data-Driven Identification of German Phrasal Compounds. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-64206-2_22
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics