Abstract
This corpus study describes vowel phonotactics in Czech words. The results suggest that some probabilistic patterns are employed in Czech: some vowel combinations are overrepresented, while others are underrepresented. A syllable containing a short front vowel tends to be followed by a syllable with a long front vowel. A long front vowel is typically followed by a back vowel and a long back vowel tends to be followed by a short vowel; thus, an interesting circular dissimilative pattern can be observed. An explanation of the phenomena can be facilitated by the Shannonian theory of communication. The analysis was performed both on words and word stems (i.e, words without endings), obtaining different results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Paradigm in the Kuhnian sense (Kuhn, 1962).
- 2.
Phonological Lexical Corpus, which is not a corpus in the traditional sense; it is a list of lexemes (available at http://www.ujc.cas.cz/phword) (Bičan, 2015c).
- 3.
Details on the Hungarian National Corpus and the data are available at http://corpus.nytud.hu/mnsz/index_eng.html (Oravecz, Váradi, & Sass, 2014).
- 4.
The full dataset for this study can be found at http://www.milicka.cz/kestazeni/vowels.zip.
- 5.
As you can see, abbreviations were not excluded from the corpus. This is why some of the rare and underrepresented vowel pairs are instantiated by abbreviations; otherwise, their frequency would be even lower.
- 6.
The black “smudge” near the /r/ vertex is a thick “loop edge.” This means that the /r/–/r/ pairs are really rare.
- 7.
The examples show that there are some errors with stem extraction, namely ubrousek (‘napkin’) is stemmed as ubrous, due to the alternations in the (diminutive) suffix –ek (e.g, obdélníček (‘little rectangle nom sg’)—obdélníčku (‘little rectangle gen sg’)).
- 8.
The (l → é), (ú → ou), (é → ú), and (ú → é) pairs are so rare within stems that there is no example for them in the corpus; the (r → r) example is a long interjection.
- 9.
There might be other patterns that affect the word—its paradigm assignment, both diachronically and synchronically. Their effects might be even stronger than the effects of the phenomena under consideration, but this study does not focus on them.
- 10.
As the number of the statistical units in our corpora is very large, even a small effect size causes statistically significant differences. For example, the overall number of short front–short front pairs in SYN2010 corpus is 14,328,194 out of all 61,503,108 pairs. The same figure for the SYN2015 is 14,243,894 out of 60,963,320 pairs. According to Fisher’s test, the frequencies are significantly different (p < 0.001), while the real-life significance of the difference is quite low—the 95% confidence interval of the risk ratio lies between 0.9964 and 0.9977 (calculated according to Altman, 1990), which is very close to 1, i.e, the relative frequency of the specified vowel pair in the two corpora is close to being identical.
- 11.
Here, we mean entropy in the Shannonian sense, i.e, \( H=-\sum \limits_{a\in A}f(a){\log}_2f(a) \), where A is set of all vowels in the language system. If the phonotactics are not taken into account, then the entropy of a vowel pair is just the doubled entropy of a single vowel.
- 12.
The entropy of the vowel pair is calculated like the entropy of a single vowel, i.e, \( H=-\sum \limits_{a\in A}\sum \limits_{b\in A}f\left(a;b\right){\log}_2f\left(a;b\right), \) where A is set of all vowels in the language system.
- 13.
Admittedly, this principle belongs to the generativist linguistic framework rather than corpus or quantitative linguistics, as it was developed to describe one of the possible transformations of “deep structure” into “surface structure.” But, it is nonetheless worth noting that even the generativist descriptions suggest that the phenomenon of Czech vowel disharmony is not an isolated linguistic process.
References
Altman, D. G. (1990). Practical statistics for medical research. Cleveland, OH: CRC Press.
Altmann, G. (1980). Prolegomena to Menzerath’s law. In R. Grotjahn (Ed.), Glottometrika 2 (pp. 1–10). Bochum, Germany: Brockmeyer.
Anderson, L. B. (1980). Using asymmetrical and gradient data in the study of vowel harmony. In R. M. Vago (Ed.), Issues in vowel harmony (pp. 271–340). Amsterdam, The Netherlands: John Benjamins.
Bičan, A. (2011). Phonotactics of Czech. Ph.D. thesis, Masaryk University, Brno, Czech Republic. Retrieved October 12, 2017, from https://theses.cz/id/eguqrt
Bičan, A. (2015b). Corpus-based analysis of the Czech syllable. In E. Guetiérrez Rubio (Ed.), Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 18 (pp. 26–36). Munich, Germany: Harrasowitz Verlag.
Bičan, A. (2015c). Fonologický lexikální korpus češtiny a slabičná struktura českého slova [Phonological Lexical Corpus of Czech Language and the Syllabic Structure of Czech Words]. Bohemica Olomucensia, 7(3-4), 45–59.
Bičan, A. (2015a). Distribution of vocalic quantity in Czech. Grazer Linguistische Studien, 83, 133–138.
Čermák, F., Doležalová-Spoustová, D., Hlaváčová, J., Hnátková, M., Jelínek, T., Kocek, J., et al. (2005). SYN2005: žánrově vyvážený korpus psané češtiny [SYN 2005: Genre-Balanced Corpus of Written Czech]. Praha, Slovakia: Ústav Českého národního korpusu FF UK Retrieved October 12, 2017, from http://www.korpus.cz
Cvrček, V., Čermáková, A., & Křen, M. (2016). Nová koncepce synchronních korpusů psané češtiny [New Conception of the Synchronic Corpora of Written Czech]. Slovo a slovesnost, 77(2), 83–101.
Dankovičová, J. (1999). Czech. In Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet (pp. 70–74). Cambridge, UK: Cambridge University Press.
Goldsmith, J. (1976). Autosegmental phonology. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA.
Hnátková, M., Křen, M., Procházka, P., & Skoumalová, H. (2014). The SYN-series corpora of written Czech. Proceedings of the ninth international conference on Language Resources and Evaluation (LREC”14), 160–164.
Johnson, D. C. (1980). Regular disharmony in Kirghiz. In R. M. Vago (Ed.), Issues in vowel harmony (pp. 89–100). Amsterdam, The Netherlands: John Benjamins.
Křen, M., Cvrček, V., Čapka, T., Čermáková, A., Hnátková, M., Chlumská, L., Jelínek, T., Kováříková, D., Petkevič, V., Procházka, P., Skoumalová, H., Škrabal, M., Truneček, P., Vondřička, P., & Zasina, A. (2016). SYN2015: Representative corpus of contemporary written Czech. Proceedings of the tenth international conference on Language Resources and Evaluation (LREC”16), 2522–2528.
Křen, M., Bartoň, T., Cvrček, V., Hnátková, M.,Jelínek, T., Kocek, J., Novotná, R., Petkevič, V., Procházka, P., Schmiedtová, V., & Skoumalová, H. (2010). SYN2010: žánrově vyvážený korpus psané češtiny [SYN 2010: Genre-Balanced Corpus of Written Czech]. Praha, Slovakia: Ústav Českého národního korpusu FF UK. Retrieved October 12, 2017, from http://www.korpus.cz
Křen, M., Cvrček, V., Čapka, T., Čermáková, A., Hnátková, M., Chlumská, L., Jelínek, T., Kováříková, D., Petkevič, V., Procházka, P., Skoumalová, H., Škrabal, M., Truneček, P., Vondřička, P., & Zasina, A. J. (2015). SYN2015: reprezentativní korpus psané češtiny [SYN 2015: Representative Corpus of Written Czech]. Praha, Slovakia: Ústav Českého národního korpusu FF UK. Retrieved October 12, 2017, from http://www.korpus.cz
Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago: University of Chicago Press.
Leben, W. R. (1973). Suprasegmental phonology. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA.
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge, UK: Cambridge University Press.
McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17(2), 207–263.
Menzerath, P. (1928). Über einige phonetische Probleme. In Actes du premier Congres International de Linguistes. Leiden, Netherlands: Sijthoff.
Milička, J. (2016). Teorie komunikace jakožto explanatorní princip přirozené víceúrovňové segmentace textů [The Theory of Communication as an Explanatory Principle for Natural Multilevel Text Segmentation]. Ph.D. thesis, Charles University, Prague, Czech Republic. Retrieved October 12, 2017, from https://is.cuni.cz/webapps/zzp/detail/104810
Nguyen, N., & Fagyal, Z. (2008). Acoustic aspects of vowel harmony in French. Journal of Phonetics, 36(1), 1–27.
Ohala, J. J. (1994). Towards a universal, phonetically-based, theory of vowel harmony. Third international conference on spoken language processing, 491–494.
Oravecz, C., Váradi, T., & Sass, B. (2014). The Hungarian Gigaword Corpus. In: Proceedings of LREC 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/681_Paper.pdf
Palková, Z. (1994). Fonetika a fonologie češtiny [Phonetics and Phonology of Czech]. Praha, Slovakia: Karolinum.
Petkevič, V. (2014). Problémy automatické morfologické disambiguace češtiny [Problems of Automated Disambiguation of Czech Morphology]. Naše řeč, 97, 194–207.
Poldauf, I. (1969). Máme v češtině harmonii samohlásek? [Do We Have Vowel Harmony in Czech?]. Naše řeč, 52, 201–209.
Ringen, C. O., & Kontra, M. (1989). Hungarian neutral vowels. Lingua, 78(2-3), 181–191.
Rounds, C. (2001). Hungarian: An essential grammar. Hove, UK: Psychology Press.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
Suomi, K., McQueen, J. M., & Cutler, A. (1997). Vowel harmony and speech segmentation in finnish. Journal of Memory and Language, 36(3), 422–444.
Vago, R. M. (1976). Theoretical implications of Hungarian vowel harmony. Linguistic Inquiry, 7(2), 243–263.
Acknowledgments
This study was written within the programme Progres Q08 Czech National Corpus implemented at the Faculty of Arts, Charles University. We would like to thank Václav Cvrček and Masako Ueda Fidler (the editors of this volume), Alžběta Růžičková, Jakub Sláma, and Sadie Gold-Shapiro for their suggestions and comments.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Milička, J., Kalábová, H. (2018). Vowel Disharmony in Czech Words and Stems. In: Fidler, M., Cvrček, V. (eds) Taming the Corpus. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-98017-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-98017-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98016-4
Online ISBN: 978-3-319-98017-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)