A framework for efficient development of Slovenian written language resources used in speech processing applications
- 47 Downloads
This paper presents a framework for the efficient development and representation of morphological and phonetic lexicons, to be used in speech technology applications. Solutions that would be the most appropriate for developing speech technologies for specific language have to be analyzed when developing the lexicons. In the paper issues such as the development of resources, good word coverage in general texts, efficient coding of lexicons, representation (regarding time and memory space) and the integration of lexicons in speech processing applications are addressed. The construction process within the proposed framework is based on the use of finite-state machines and heterogeneous relation-graphs structures, and significantly reduces the time and effort needed for the construction of large-scale lexica, minimizes any analysis errors, and efficiently represents the lexicons, regarding time and memory usage. The wordlist construction process presented in the paper also guarantees that by using the constructed lexicons high word coverage is achieved in general texts. SIlex lexicons are large-scale phonetic and morphology lexicons for the Slovenian language, constructed within the new framework and with a developed toolset, and represent valuable language resources for the development of various speech processing applications for the Slovenian language.
KeywordsWritten language resources Morphology lexicon Phonetic lexicon Heterogeneous relation graphs (HRG) Finite-state machines (FSM) Slovenian language
Unable to display preview. Download preview PDF.
- Al-Shalabi, R., & Kanaan, G. (2004). Constructing an automatic lexicon for Arabic language. International Journal of Computing & Information Sciences, 2(2). Google Scholar
- Bajec, A., Kolarič, R., & Rupel, M. (1956). Slovenska slovnica. Ljubljana, Svet za prosveto in kulturo LRS. Google Scholar
- Boula, P., Yvon, F., Aubergé, V., & Vaissière, J. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the language resources and evaluation conference (LREC), Athens, Greece, May 2000. Google Scholar
- Daciuk, J. (1998). Incremental construction of finite-state automata and transducers and their use in the natural language processing. Ph.D. thesis, Technical University of Gdansk, Poland. Google Scholar
- Emmanuel, R., & Yves, S. (1997). Finite state language processing. Cambridge: MIT Press. Google Scholar
- Erjavec, T., & Ide, N. (1998). The MULTEXT-East corpus. In Proceedings of the language resources and evaluation conference (LREC), Granada, Spain. Google Scholar
- Günthner, F. (1996). CISLEX—Das Wörterbuch am CIS. www.cis.uni-muenchen.de/projects/CISLEX.html.
- Hartikainen, E., Maltese, G., Moreno, A., Shammass, S., & Ziegenhain, U. (2003). Large lexica for speech-to-speech translation: from specification to creation. In Proceedings of the Eurospeech conference, Geneva, Switzerland, September 2003. Google Scholar
- Kačič, Z. (1995). Onomastica for Slovenian. http://www.elda.fr/catalogue/speech/S0043.html.
- Kiraz, G. A., & Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. In Proceedings of the third international workshop on speech synthesis, Australia. Google Scholar
- Leech, G., & Wilson, A. (1996). Recommendations for the morphosyntactic annotation of corpora. EAGLES report EAG-TCWG-MAC/R, ILC, Pisa. http://www.ilc.cnr.it/EAGLES96/annotate/.
- Muhr, R., Höldrich, R., & Wächter-Kollpacher, E. (2002). The pronouncing dictionary of Austrian German and the other major varieties of German—a phonetic resources database on the pronunciation of German. In Proceedings of the language resources and evaluation conference (LREC), Las Palmas, Canary Islands, Spain, May 2002. Google Scholar
- Pagel, V., Lenzo, K., & Black, A. W. (1998). Letter to sound rules for accented lexicon compression. In Proc. of ICSLP (pp. 2015–2018). Sydney, Australia, September 1998. Google Scholar
- Piepenbrock, R. (2001). CELEX, the Dutch Centre for Lexical Information. http://www.kun.nl/celex/.
- Rojc, M. (2000). Use of finite-state machines in automatic text-to-speech synthesis systems. Master thesis, Maribor. Google Scholar
- Rojc, M. (2003). Time and space optimal architecture of the multilingual and polyglot TTS system—architecture with finite-state machines. Ph.D. thesis, Maribor. Google Scholar
- Rojc, M., & Kačič, Z. (2000). A computational platform for development of morphologic and phonetic lexica. In Proceedings of the second language resources and evaluation conference (LREC), Athens, Greece. Google Scholar
- SSKJ. (1995). Slovar slovenskega knjižnega jezika. Ljubljana: DZS. Google Scholar
- Toporišič, J. (1976). Slovenska slovnica. Maribor: Založba obzorja. Google Scholar
- Toporišič, J. (2000). Slovenska slovnica. Maribor: Založba obzorja. Google Scholar
- Toporišič, J. (2001). Slovenski pravopis. Ljubljana: Državna založba ZRC. Google Scholar
- Verdonik, D., Rojc, M., & Kačič, Z. (2004). Creating Slovenian language resources for development of speech-to-speech translation components. In Proceedings of the language resources and evaluation conference (LREC), Lisbon, Portugal, May 2004. Google Scholar
- Vidovič Muha, A. (1981). Pomenske skupine nekakovostnih izpeljanih pridevnikov. Slavistična Revija, 29(1), 19–42. Google Scholar
- Zemljak, M., & Kačič, Z. (1998). SAMPA for Slovenian. http://www.phon.ucl.ac.uk/home/sampa/sloven-uni.html.
- Ziegenhain, U. et al. (2004). Specification of corpora and word lists in 12 languages. LC-STAR project IST-2001-32216. Deliverable D1.1. Google Scholar