A Dialectal Chinese Speech Recognition Framework
- 59 Downloads
A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level, and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use context-independent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multi-pronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10–18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and even other languages.
Keywordsdialectal Chinese speech recognition initial or final (IF) IF-mapping rule pronunciation modeling small quantity of speech data
Unable to display preview. Download preview PDF.
- 1.Leggetter C J, Woodland P C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, April 1995, 9(2): 171–185.Google Scholar
- 2.Strik H, Cucchiarini C. Modeling pronunciation variation for ASR: A survey of the literature. Speech Communication, 1999, 29: 225–246.Google Scholar
- 3.Jurafsky D et al. What kind of pronunciation variation is hard for triphones to model? In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP'2001), 2001, pp.577–580.Google Scholar
- 4.Byrne W, Finke M, Khudanpur S et al. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. In Proc. Int. Conf. Acoustics, Speech and Signal Processing (ICASSP'1998), Seattle, USA, 1998, pp.313–316.Google Scholar
- 5.Fosler-Lussier E. Dynamic pronunciation models for automatic speech recognition [Dissertation]. University of California, Berkeley, CA, 1999.Google Scholar
- 6.Saraclar M. Pronunciation modeling for conversational speech recognition [Dissertation]. The Johns Hopkins University, Baltimore, MD, 2000.Google Scholar
- 7.Zheng F, Song Z J, Fung P, Byrne W. Modeling pronunciation variation using context-dependent weighting and B/S refined acoustic modeling. In Proc. EuroSpeech, Aalborg, Denmark, Sept. 3–7, 2001, 1: 57–60.Google Scholar
- 8.Zheng F, Song Z J, Fung P et al. Mandarin pronunciation modeling based on CASS corpus. J. Computer Science & Technology, 2002, 17(3): 249–263.Google Scholar
- 10.Huang C. Accent issue in large vocabulary continuous speech recognition. Microsoft Research Technical Report. MSR-TR-2001-69, 2001.Google Scholar
- 11.Ikeno A, Pellom B, Cer D et al. Issues in recognition of Spanish-accented spontaneous English. In Proc. IEEE/ISCA Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, 2003.Google Scholar
- 12.Tomokiyo L M. Recognizing non-native speech: Characterizing and adapting to non-native usage in LVCSR [Dissertation]. Carnegie Mellon University, 2001.Google Scholar
- 13.Wang Z R, Schultz T, Waibel A. Comparison of acoustic model adaptation techniques on non-native speech. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP'2003), Hong Kong, 2003, 1: 540–543.Google Scholar
- 14.LDC. 1992. http://wave.ldc.upenn.edu.
- 15.Young S, Evermann G, Hain T et al. The HTK Book (for HTK Version 3.2.1), 2002, http://htk.eng.cam.ac.uk.
- 16.Stolcke A. SRILM—An extensible language modeling toolkit. In Proc. International Conference on Spoken Language Processing (ICSLP'2002), Denver, 2002, 2: 901–904.Google Scholar
- 17.NIST. The 1997 Hub-4NE evaluation plan for recognition of Broadcast News, in Spanish and Mandarin. 1997, http://www.nist.gov/speech/tests/bnr/hub4ne_97/current_plan.htm.
- 18.Li J, Zheng F, Xiong Z Y et al. Construction of large-scale Shanghai putonghua speech corpus for Chinese speech recognition. In Proc. Oriental-COCOSDA, Sentosa, Singapore, Oct. 1–3, 2003, pp.62–69.Google Scholar
- 19.Rosenfeld R. Two decades of statistical language modeling: Where do we go from here? In Proc. IEEE, 2000, 88: 1270–1278.Google Scholar
- 20.Zheng F, Wu J, Song Z J. Improving the syllable-synchronous network search algorithm for word decoding in continuous Chinese speech recognition. J. Computer Science & Technology, Sept. 2000, 15(5): 461–471.Google Scholar