Advertisement

TIMIT and NTIMIT Phone Recognition Using Convolutional Neural Networks

  • Cornelius Glackin
  • Julie Wall
  • Gérard Chollet
  • Nazim Dugan
  • Nigel Cannings
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11351)

Abstract

A novel application of convolutional neural networks to phone recognition is presented in this paper. Both the TIMIT and NTIMIT speech corpora have been employed. The phonetic transcriptions of these corpora have been used to label spectrogram segments for training the convolutional neural network. A sliding window extracted fixed sized images from the spectrograms produced for the TIMIT and NTIMIT utterances. These images were assigned to the appropriate phone class by parsing the TIMIT and NTIMIT phone transcriptions. The GoogLeNet convolutional neural network was implemented and trained using stochastic gradient descent with mini batches. Post training, phonetic rescoring was performed to map each phone set to the smaller standard set, i.e. the 61 phone set was mapped to the 39 phone set. Benchmark results of both datasets are presented for comparison to other state-of-the-art approaches. It will be shown that this convolutional neural network approach is particularly well suited to network noise and the distortion of speech data, as demonstrated by the state-of-the-art benchmark results for NTIMIT.

Keywords

Phone recognition Convolutional Neural Network TIMIT NTIMIT 

References

  1. 1.
    Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
  2. 2.
    Glackin, C., Wall, J., Chollet, G., Dugan, N., Cannings, N.: Convolutional neural networks for phoneme recognition. In: 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM) (2018)Google Scholar
  3. 3.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances Neural Information Processing System (NIPS), pp. 1097–1105 (2012)Google Scholar
  4. 4.
    Shamma, S.: On the role of space and time in auditory processing. Trends Cogn. Sci. 5(8), 340–348 (2001)CrossRefGoogle Scholar
  5. 5.
    Paulin, M.G.: A method for analysing neural computation using receptive fields in state space. Neural Netw. 11(7), 1219–1228 (1998)CrossRefGoogle Scholar
  6. 6.
    Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4), 193–202 (1980)CrossRefGoogle Scholar
  7. 7.
    Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in cat’s visual cortex. J. Physiol. (London) 160, 106–154 (1962)CrossRefGoogle Scholar
  8. 8.
    LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network. In: Advances Neural Information Processing System (NIPS), pp. 396–404 (1990)Google Scholar
  9. 9.
    LeCun, Y.: The MNIST database of handwritten digits (1998). http://yann.lecun.com/exdb/mnist/
  10. 10.
    Ciresan, D.C., Meier, U., Masci, J., Gambardella, L., Schmidhuber, J.: Flexible, high performance convolutional neural networks for image classification. In: International Joint Conference on Artificial Intelligence (IJCAI), vol. 22, no. 1, pp. 1237–1242 (2011)Google Scholar
  11. 11.
    Garofolo, J., et al.: TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Linguistic Data Consortium, Web Download, Philadelphia (1993)Google Scholar
  12. 12.
    Lopes, C., Perdigao, F.: Phone recognition on the TIMIT database. In: Speech Technologies/Book 1, pp. 285–302 (2011)Google Scholar
  13. 13.
    Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustic Speech Signal Process (ICASSP), pp. 6645–6649 (2013)Google Scholar
  14. 14.
    Chen, D., Zhang, W., Xu, X., Xing, X.: Deep networks with stochastic depth for acoustic modelling. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4 (2016)Google Scholar
  15. 15.
    Zhang, Z., Sun, Z., Liu, J., Chen, J., Huo, Z., Zhang, X.: Deep recurrent convolutional neural network: improving performance for speech recognition. arXiv 1611.07174 (2016)Google Scholar
  16. 16.
    Tóth, L.: Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio, Speech, Music Process. 1, 1–13 (2015)Google Scholar
  17. 17.
    Jankowski, C., Kalyanwamy, A., Basson, S., Spitz, J.: NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database. In: IEEE International Conference on Acoustic Speech Signal Processing (ICASSP) (1990)Google Scholar
  18. 18.
    CUDA CUFFT Library: NVIDIA (2007). https://docs.nvidia.com/cuda/cufft/index.html
  19. 19.
    ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (2011). http://image-net.org/challenges/LSVRC/2011/index
  20. 20.
    NVIDIA DIGITS Interactive Deep Learning GPU Training System. https://developer.nvidia.com/digits

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Cornelius Glackin
    • 1
  • Julie Wall
    • 2
  • Gérard Chollet
    • 1
  • Nazim Dugan
    • 1
  • Nigel Cannings
    • 1
  1. 1.Intelligent Voice Ltd.LondonUK
  2. 2.University of East LondonLondonUK

Personalised recommendations