International Journal of Speech Technology

, Volume 13, Issue 4, pp 219–230 | Cite as

Dual stream speech recognition using articulatory syllable models



Recent theoretical developments in neuroscience suggest that sublexical speech processing occurs via two parallel processing pathways. According to this Dual Stream Model of Speech Processing speech is processed both as sequences of speech sounds and articulations. We attempt to revise the “beads-on-a-string” paradigm of Hidden Markov Models in Automatic Speech Recognition (ASR) by implementing a system for dual stream speech recognition. A baseline recognition system is enhanced by modeling of articulations as sequences of syllables. An efficient and complementary model to HMMs is developed by formulating Dynamic Time Warping (DTW) as a probabilistic model. The DTW Model (DTWM) is improved by enriching syllable templates with constrained covariance matrices, data imputation, clustering and mixture modeling. The resulting dual stream system is evaluated on the N-Best Southern Dutch Broadcast News benchmark. Promising results are obtained for DTWM classification and ASR tests. We provide a discussion on the remaining problems in implementing dual stream speech recognition.


Syllabification DTW DTWM Syllable Articulatory Dual stream model Speech recognition 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Ahadi, S. M. (2000). Reduced context sensitivity in Persian speech recognition via syllable modeling. In Proceedings of the 8th Australian international conference on speech science and technology (SST-2000) (pp. 492–497). Canberra: Australian Speech Science and Technology Association. Google Scholar
  2. Aradilla, G., Vepa, J., & Bourlard, H. (2005). Improving speech recognition using a data-driven approach. In Proceedings of Interspeech (Vol. 66, pp. 3333–3336). Google Scholar
  3. Axelrod, S., & Maison, B. (2004). Combination of hidden Markov models with dynamic time warping for speech recognition. In Proceedings of ICASSP (Vol. 1, pp. 173–176). Google Scholar
  4. Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press. MATHGoogle Scholar
  5. Beyerlein, P. (1998). Discriminative model combination. In Proceedings of ICASSP (pp. 481–484). Google Scholar
  6. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Monterey: Wadsworth and Brooks. MATHGoogle Scholar
  7. David, C. C., Miller, D., & Walker, K. (2004). The Fisher corpus: a resource for the next generations of speech-to-text. In Proceedings of LREC (pp. 69–71). Google Scholar
  8. De Wachter, M., Demuynck, K., Wambacq, P., & Van Compernolle, D. (2004). A locally weighted distance measure for example based speech recognition. In Proceedings of ICASSP (Vol. 1, p. I-181-4). Google Scholar
  9. De Wachter, M., Matton, M., Demuynck, K., Wambacq, P., Cools, R., & Van Compernolle, D. (2007). Template-based continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1377–1390. CrossRefGoogle Scholar
  10. Demuynck, K., Roelens, J., Van Compernolle, D., & Wambacq, P. (2008). SPRAAK: an open source speech recognition and automatic annotation kit. In Proceedings of Interspeech (p. 495). Google Scholar
  11. Demuynck, K., Puurula, A., Van Compernolle, D., & Wambacq, P. (2009). The ESAT 2008 system for N-Best Dutch speech recognition benchmark. In Proceedings of ASRU (pp. 339–344). Google Scholar
  12. Dupont, S., & Bourlard, H. (1997). Using multiple time scales in a multi-stream speech recognition system. In Proceedings of Eurospeech (pp. 3–6). Google Scholar
  13. Frankel, J., Wester, M., & King, S. (2004). Articulatory feature recognition using dynamic Bayesian networks. In Proceedings of ICSLP. Google Scholar
  14. Ganapathiraju, A., Hamaker, J., Ordowski, M., Doddington, G., & Picone, J. (2001). Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing. Google Scholar
  15. Goldwater, S., & Johnson, M. (2005). Representational bias in unsupervised learning of syllable structure. In Proceedings of CoNLL (pp. 112–119). CrossRefGoogle Scholar
  16. Hämäläinen, A., Bosch, L., & Boves, L. (2007). Modelling pronunciation variation using multi-path HMMs for syllables. In Proceedings of ICASSP (Vol. 4, pp. 781–784). Google Scholar
  17. Han, Y., Hämäläinen, A., & Boves, L. (2006). Trajectory clustering of syllable-length acoustic models for continous speech recognition. In Proceedings of ICASSP, Toulouse, France (pp. 1169–1172). Google Scholar
  18. Hasegawa-Johnson, M., Livescu, K., Lal, P., & Saenko, K. (2007). Audiovisual speech recognition with articulator positions as hidden variables. In Proceedings of the ICPhS (pp. 297–302). Google Scholar
  19. Hetjmánek, J., & Pavelka, T. (2008). Automatic speech recognition using context-dependent syllables. In Proceedings of the 9th international PhD workshop on systems and control, young generation viewpoint. Google Scholar
  20. Hickok, G., & Poeppel, D. (2004). Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition, 92(1–2), 67–99. CrossRefGoogle Scholar
  21. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393–402. CrossRefGoogle Scholar
  22. Kawatani, T. (2000). Handwritten kanji recognition with determinant normalized quadratic discriminant function. In Proceedings of ICPR (Vol. 2, pp. 343–346). Google Scholar
  23. Kessens, J., & Leeuwen, D. Av. (2007). N-best: the Northern- and Southern-Dutch benchmark evaluation of speech recognition technology. In Proceedings of Interspeech (pp. 1354–1357). Google Scholar
  24. Kirchhoff, K. (1996). Syllable-level desynchronisation of phonetic features for speech recognition. In Proceedings of Interspeech (pp. 2274–2276). Google Scholar
  25. Leeuwen, Dv., Kessens, J., Sanders, E., & Heuvel, Hvd. (2009). Results of the N-Best 2008 Dutch speech recognition evaluation. In Proceedings of Interspeech (pp. 2571–2574). Google Scholar
  26. Leung, K. Y., & Siu, M. (2004). Integration of acoustic and articulatory information with application to speech recognition. Information Fusion, 5(2), 141–151. CrossRefGoogle Scholar
  27. Livescu, K., Glass, J., & Bilmes, J. (2003). Hidden feature models for speech recognition using dynamic Bayesian networks. In Proceedings of Eurospeech (pp. 2529–2532). Google Scholar
  28. Martínez, A. M., & Virtriá, J. (2000). Learning mixture models using a genetic version of the EM algorithm. Pattern Recognition Letters, 21(9), 759–769. CrossRefGoogle Scholar
  29. Momayyez, P., Waterhouse, J., & Rose, R. (2007). Exploiting complementary aspects of phonological features in automatic speech recognition. In Proceedings of ASRU (pp. 47–52). Google Scholar
  30. Ogata, J., & Ariki, Y. (2003). Syllable-based acoustic modeling for Japanese spontaneous speech recognition. In Proceedings of Eurospeech (pp. 2513–2516). Google Scholar
  31. Pernkopf, F., & Bouchaffra, D. (2005). Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1344–1348. CrossRefGoogle Scholar
  32. Rabiner, L. R., & Wilpon, J. G. (1979). Considerations in applying clustering techniques to speaker-independent word recognition. Journal of the Acoustical Society of America, 66, 663–673. CrossRefGoogle Scholar
  33. Rasipuram, R., Hegde, R. M., & Murthy, H. A. (2008). Incorporating acoustic feature diversity into the linguistic search space for syllable based speech recognition. In Proceedings of EUSIPCO. Google Scholar
  34. Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nature Neuroscience, 12(6), 718–724. CrossRefGoogle Scholar
  35. Saenko, K., Darrell, T., & Glass, J. R. (2004). Articulatory features for robust visual speech recognition. In Proceedings of ICMI (pp. 152–158). New York: ACM. CrossRefGoogle Scholar
  36. Sakoe, H. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 43–49. MATHCrossRefGoogle Scholar
  37. Sande, I. G. (1982). Imputation in surveys: coping with reality. The American Statistician, 36(3), 145–152. CrossRefGoogle Scholar
  38. Saur, D., Kreher, B. W., Schnell, S., Kümmerer, D., Kellmeyer, P., Vry, M. S., Umarova, R., Musso, M., Glauche, V., Abel, S., Huber, W., Rijntjes, M., Hennig, J., & Weiller, C. (2008). Ventral and dorsal pathways for language. Proceedings of the National Academy of Sciences, 105(46), 18,035–18,040. CrossRefGoogle Scholar
  39. Sethy, A., Ramabhadran, B., & Narayanan, S. (2003). Improvements in English ASR for the MALACH project using syllable-centric models. In Proceedings of ASRU (pp. 129–134). Google Scholar
  40. Wang, J. (Ed.) (2003). Data mining: opportunities and challenges. Hershey: IGI Publishing. Google Scholar
  41. White, G. (1976). Speech recognition experiments with linear predication, bandpass filtering. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(2). Google Scholar
  42. Wu, S., Kingsbury, B. E. D., Morgan, N., & Greenberg, S. (1998). Performance improvements through combining phone- and syllable-scale information in automatic speech recognition. In Proceedings of Interspeech (pp. 854–857). Google Scholar
  43. Zipf, G. K. (1935). The psycho-biology of language; an introduction to dynamic philology. Boston: Houghton Mifflin. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.ESATKatholieke Universiteit LeuvenLeuvenBelgium

Personalised recommendations