International Journal of Speech Technology

, Volume 18, Issue 2, pp 217–230 | Cite as

Minimum data generation for Telugu speech recognition

  • K. V. N. Sunitha
  • A. Sharada


A morphologically rich language has hundreds of forms of each word which makes storing and maintaining them time and resource consuming. It also leads to confusions while recognizing speech which leads to more word error rate. These issues make it difficult to build applications of speech recognition for such languages. Hence there is a need to develop a phonetically balanced minimal data set. This paper describes generating minimum dataset for Telugu language, the second most widely spoken language in India. Considering minimum data generation as a set covering problem, a variety of datasets are generated based on different criteria. From various set covering algorithms, Greedy algorithm is chosen. The criterion used for final data selection is the frequency of occurrence of words. As set covering requires a large set of data from which minimum data is selected, a 15 Million word text corpus has been created. Thorough analysis of this text corpus is carried out in order to ensure that the generated set is phonetically balanced. The generated minimum dataset consists of 21 words and covers each phoneme of the Telugu language. Telugu speech technology researchers can benefit from this data set in building applications of phoneme level speech recognition by reducing manual recording effort and time. This paper discusses the role of minimum data set in LVSR systems, details of the text corpus created and proposed algorithm for minimum data generation.


Minimum data generation Set covering problem Phonetically balanced data Text corpus 


  1. Agrawal S. S. (2010). Recent developments in speech corpora in indian languages: Country Report of India, O-COCOSDA, Kathmandu.Google Scholar
  2. Antal, M. (2007). Toward a simple phoneme based speech recognition system. Studia Universitatis Babes, Bolyai, Informatica, LII(2), 33.MathSciNetGoogle Scholar
  3. Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.Google Scholar
  4. Beun, D., Pols, L., & Kloosterman, H. (1995). Phoneme-based automatic speech recognition: towards a demonstrator for information retrieval, using dutch hi-fi speech. In: Proceedings in institute of phonetic sciences, University of Amsterdam (Vol. 19, pp. 126–134).Google Scholar
  5. Bharathi, A., Prakash Rao, K., Sangal, R., & Bendre, S.M. (2002). Basic statistical analysis of corpus and cross coparision among corpora. Technical Report 4, IIIT, Hyderabad,
  6. Emeneau, M. B. (1946). The phonemes of Sanskrit language.Google Scholar
  7. Gopalakrishna, A., et al. (2005, October). Development of indian language speech databases for large vocabulary speech recognition systems. In: Proceedings of international conference on speech and computer (SPECOM), Patras, Greece.Google Scholar
  8. Jagannath. (1981). Telugu loanword phonology, Ph.D Thesis, University of Arizona.Google Scholar
  9. Khan, A. N., Gangashetty, S. V. & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In: International Conference Natural Language Processing (pp. 125–134).Google Scholar
  10. Kostić, D., Mitter, A., & Krishnamurti, B. (1997). A short outline of Telugu phonetics. Calcutta: Indian Statistical Institute.Google Scholar
  11. Krishnamurthy, N. D. (1992). Conversational Telugu. Bangalore: N.D.K.Institute of Languages.Google Scholar
  12. Nagamma Reddy, K. (1995). Phonetic, Phonological, morpho-syntactic and semantic functions of segmental duration in spoken Telugu: Acoustic evidence.Google Scholar
  13. Neti, C., Rajput, N., & Verma, A. (2002). A large vocabulary continuous speech recognition system for Hindi. In: Proceedings of the national conference on communications, Mumbai (pp. 366–370).Google Scholar
  14. Rao, C. R. (1965). A grammatical sketch of Telugu, an artcicle published in 1965.Google Scholar
  15. Rao, U. (2004). Materials for a computational grammar for telugu, phonology and Morphology, Vol 1.Google Scholar
  16. Reddy, B. R. (1976). Localist studies in Telugu syntax, Ph.D Thesis, University of Edinburgh.Google Scholar
  17. Schiffman, H. F., & Eastman, C. (1975). Dravidian phonological systems. London: University of Washington Press. ISBN-13: 9780295955070.Google Scholar
  18. Sunitha, K. V. N., & Sharada, A. (2009). Telugu text corpora analysis for creating speech database. International Journal of Engineering & Information Technology, 1(2), 109–114. ISSN: 0975–5292.Google Scholar
  19. Young, S., & Bloothooft, G. (eds.), (1997). Corpus-based methods in language and speech precessing, Vol-II. Dordrecht: Kluwer Academic Publishers.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.BVRIT Hyderabad College of Engineering for WomenHyderabadIndia
  2. 2.G. Narayanamma Institute of Technology & Science for WomenHyderabadIndia

Personalised recommendations