Vowel Disharmony in Czech Words and Stems

Milička, Jiří; Kalábová, Hana

doi:10.1007/978-3-319-98017-1_3

Vowel Disharmony in Czech Words and Stems

Jiří Milička⁸ &
Hana Kalábová⁸

Chapter
First Online: 10 November 2018

483 Accesses

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

Abstract

This corpus study describes vowel phonotactics in Czech words. The results suggest that some probabilistic patterns are employed in Czech: some vowel combinations are overrepresented, while others are underrepresented. A syllable containing a short front vowel tends to be followed by a syllable with a long front vowel. A long front vowel is typically followed by a back vowel and a long back vowel tends to be followed by a short vowel; thus, an interesting circular dissimilative pattern can be observed. An explanation of the phenomena can be facilitated by the Shannonian theory of communication. The analysis was performed both on words and word stems (i.e, words without endings), obtaining different results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Paradigm in the Kuhnian sense (Kuhn, 1962).
2.
Phonological Lexical Corpus, which is not a corpus in the traditional sense; it is a list of lexemes (available at http://www.ujc.cas.cz/phword) (Bičan, 2015c).
3.
Details on the Hungarian National Corpus and the data are available at http://corpus.nytud.hu/mnsz/index_eng.html (Oravecz, Váradi, & Sass, 2014).
4.
The full dataset for this study can be found at http://www.milicka.cz/kestazeni/vowels.zip.
5.
As you can see, abbreviations were not excluded from the corpus. This is why some of the rare and underrepresented vowel pairs are instantiated by abbreviations; otherwise, their frequency would be even lower.
6.
The black “smudge” near the /r/ vertex is a thick “loop edge.” This means that the /r/–/r/ pairs are really rare.
7.
The examples show that there are some errors with stem extraction, namely ubrousek (‘napkin’) is stemmed as ubrous, due to the alternations in the (diminutive) suffix –ek (e.g, obdélníček (‘little rectangle nom sg’)—obdélníčku (‘little rectangle gen sg’)).
8.
The (l → é), (ú → ou), (é → ú), and (ú → é) pairs are so rare within stems that there is no example for them in the corpus; the (r → r) example is a long interjection.
9.
There might be other patterns that affect the word—its paradigm assignment, both diachronically and synchronically. Their effects might be even stronger than the effects of the phenomena under consideration, but this study does not focus on them.
10.
As the number of the statistical units in our corpora is very large, even a small effect size causes statistically significant differences. For example, the overall number of short front–short front pairs in SYN2010 corpus is 14,328,194 out of all 61,503,108 pairs. The same figure for the SYN2015 is 14,243,894 out of 60,963,320 pairs. According to Fisher’s test, the frequencies are significantly different (p < 0.001), while the real-life significance of the difference is quite low—the 95% confidence interval of the risk ratio lies between 0.9964 and 0.9977 (calculated according to Altman, 1990), which is very close to 1, i.e, the relative frequency of the specified vowel pair in the two corpora is close to being identical.
11.
Here, we mean entropy in the Shannonian sense, i.e, \( H=-\sum \limits_{a\in A}f(a){\log}_2f(a) \), where A is set of all vowels in the language system. If the phonotactics are not taken into account, then the entropy of a vowel pair is just the doubled entropy of a single vowel.
12.
The entropy of the vowel pair is calculated like the entropy of a single vowel, i.e, \( H=-\sum \limits_{a\in A}\sum \limits_{b\in A}f\left(a;b\right){\log}_2f\left(a;b\right), \) where A is set of all vowels in the language system.
13.
Admittedly, this principle belongs to the generativist linguistic framework rather than corpus or quantitative linguistics, as it was developed to describe one of the possible transformations of “deep structure” into “surface structure.” But, it is nonetheless worth noting that even the generativist descriptions suggest that the phenomenon of Czech vowel disharmony is not an isolated linguistic process.

References

Altman, D. G. (1990). Practical statistics for medical research. Cleveland, OH: CRC Press.
Google Scholar
Altmann, G. (1980). Prolegomena to Menzerath’s law. In R. Grotjahn (Ed.), Glottometrika 2 (pp. 1–10). Bochum, Germany: Brockmeyer.
Google Scholar
Anderson, L. B. (1980). Using asymmetrical and gradient data in the study of vowel harmony. In R. M. Vago (Ed.), Issues in vowel harmony (pp. 271–340). Amsterdam, The Netherlands: John Benjamins.
Chapter Google Scholar
Bičan, A. (2011). Phonotactics of Czech. Ph.D. thesis, Masaryk University, Brno, Czech Republic. Retrieved October 12, 2017, from https://theses.cz/id/eguqrt
Bičan, A. (2015b). Corpus-based analysis of the Czech syllable. In E. Guetiérrez Rubio (Ed.), Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 18 (pp. 26–36). Munich, Germany: Harrasowitz Verlag.
Google Scholar
Bičan, A. (2015c). Fonologický lexikální korpus češtiny a slabičná struktura českého slova [Phonological Lexical Corpus of Czech Language and the Syllabic Structure of Czech Words]. Bohemica Olomucensia, 7(3-4), 45–59.
Google Scholar
Bičan, A. (2015a). Distribution of vocalic quantity in Czech. Grazer Linguistische Studien, 83, 133–138.
Google Scholar
Čermák, F., Doležalová-Spoustová, D., Hlaváčová, J., Hnátková, M., Jelínek, T., Kocek, J., et al. (2005). SYN2005: žánrově vyvážený korpus psané češtiny [SYN 2005: Genre-Balanced Corpus of Written Czech]. Praha, Slovakia: Ústav Českého národního korpusu FF UK Retrieved October 12, 2017, from http://www.korpus.cz
Cvrček, V., Čermáková, A., & Křen, M. (2016). Nová koncepce synchronních korpusů psané češtiny [New Conception of the Synchronic Corpora of Written Czech]. Slovo a slovesnost, 77(2), 83–101.
Google Scholar
Dankovičová, J. (1999). Czech. In Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet (pp. 70–74). Cambridge, UK: Cambridge University Press.
Google Scholar
Goldsmith, J. (1976). Autosegmental phonology. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA.
Google Scholar
Hnátková, M., Křen, M., Procházka, P., & Skoumalová, H. (2014). The SYN-series corpora of written Czech. Proceedings of the ninth international conference on Language Resources and Evaluation (LREC”14), 160–164.
Google Scholar
Johnson, D. C. (1980). Regular disharmony in Kirghiz. In R. M. Vago (Ed.), Issues in vowel harmony (pp. 89–100). Amsterdam, The Netherlands: John Benjamins.
Chapter Google Scholar
Křen, M., Cvrček, V., Čapka, T., Čermáková, A., Hnátková, M., Chlumská, L., Jelínek, T., Kováříková, D., Petkevič, V., Procházka, P., Skoumalová, H., Škrabal, M., Truneček, P., Vondřička, P., & Zasina, A. (2016). SYN2015: Representative corpus of contemporary written Czech. Proceedings of the tenth international conference on Language Resources and Evaluation (LREC”16), 2522–2528.
Google Scholar
Křen, M., Bartoň, T., Cvrček, V., Hnátková, M.,Jelínek, T., Kocek, J., Novotná, R., Petkevič, V., Procházka, P., Schmiedtová, V., & Skoumalová, H. (2010). SYN2010: žánrově vyvážený korpus psané češtiny [SYN 2010: Genre-Balanced Corpus of Written Czech]. Praha, Slovakia: Ústav Českého národního korpusu FF UK. Retrieved October 12, 2017, from http://www.korpus.cz
Křen, M., Cvrček, V., Čapka, T., Čermáková, A., Hnátková, M., Chlumská, L., Jelínek, T., Kováříková, D., Petkevič, V., Procházka, P., Skoumalová, H., Škrabal, M., Truneček, P., Vondřička, P., & Zasina, A. J. (2015). SYN2015: reprezentativní korpus psané češtiny [SYN 2015: Representative Corpus of Written Czech]. Praha, Slovakia: Ústav Českého národního korpusu FF UK. Retrieved October 12, 2017, from http://www.korpus.cz
Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: University of Chicago Press.
Google Scholar
Leben, W. R. (1973). Suprasegmental phonology. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA.
Google Scholar
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge, UK: Cambridge University Press.
MATH Google Scholar
McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17(2), 207–263.
Google Scholar
Menzerath, P. (1928). Über einige phonetische Probleme. In Actes du premier Congres International de Linguistes. Leiden, Netherlands: Sijthoff.
Google Scholar
Milička, J. (2016). Teorie komunikace jakožto explanatorní princip přirozené víceúrovňové segmentace textů [The Theory of Communication as an Explanatory Principle for Natural Multilevel Text Segmentation]. Ph.D. thesis, Charles University, Prague, Czech Republic. Retrieved October 12, 2017, from https://is.cuni.cz/webapps/zzp/detail/104810
Nguyen, N., & Fagyal, Z. (2008). Acoustic aspects of vowel harmony in French. Journal of Phonetics, 36(1), 1–27.
Article Google Scholar
Ohala, J. J. (1994). Towards a universal, phonetically-based, theory of vowel harmony. Third international conference on spoken language processing, 491–494.
Google Scholar
Oravecz, C., Váradi, T., & Sass, B. (2014). The Hungarian Gigaword Corpus. In: Proceedings of LREC 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/681_Paper.pdf
Palková, Z. (1994). Fonetika a fonologie češtiny [Phonetics and Phonology of Czech]. Praha, Slovakia: Karolinum.
Google Scholar
Petkevič, V. (2014). Problémy automatické morfologické disambiguace češtiny [Problems of Automated Disambiguation of Czech Morphology]. Naše řeč, 97, 194–207.
Google Scholar
Poldauf, I. (1969). Máme v češtině harmonii samohlásek? [Do We Have Vowel Harmony in Czech?]. Naše řeč, 52, 201–209.
Google Scholar
Ringen, C. O., & Kontra, M. (1989). Hungarian neutral vowels. Lingua, 78(2-3), 181–191.
Article Google Scholar
Rounds, C. (2001). Hungarian: An essential grammar. Hove, UK: Psychology Press.
Book Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
Article MathSciNet Google Scholar
Suomi, K., McQueen, J. M., & Cutler, A. (1997). Vowel harmony and speech segmentation in finnish. Journal of Memory and Language, 36(3), 422–444.
Article Google Scholar
Vago, R. M. (1976). Theoretical implications of Hungarian vowel harmony. Linguistic Inquiry, 7(2), 243–263.
Google Scholar

Download references

Acknowledgments

This study was written within the programme Progres Q08 Czech National Corpus implemented at the Faculty of Arts, Charles University. We would like to thank Václav Cvrček and Masako Ueda Fidler (the editors of this volume), Alžběta Růžičková, Jakub Sláma, and Sadie Gold-Shapiro for their suggestions and comments.

Author information

Authors and Affiliations

Institute of Comparative Linguistics, Faculty of Arts, Charles University, Nam. Jana Palacha 2, Praha 1, Czech Republic
Jiří Milička & Hana Kalábová

Authors

Jiří Milička
View author publications
You can also search for this author in PubMed Google Scholar
Hana Kalábová
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Slavic Studies, Brown University, Providence, RI, USA
Masako Fidler
Institute of the Czech National Corpus, Charles University, Prague 1, Czech Republic
Václav Cvrček

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Milička, J., Kalábová, H. (2018). Vowel Disharmony in Czech Words and Stems. In: Fidler, M., Cvrček, V. (eds) Taming the Corpus. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-98017-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-98017-1_3
Published: 10 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98016-4
Online ISBN: 978-3-319-98017-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics