Skip to main content

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

  • Conference paper
  • First Online:
  • 1051 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10970))

Abstract

Recurrent neural networks (RNNs) with long short term memory (LSTM) acoustic model (AM) has achieved state-of-the-art performance in LVCSR. The strong ability in capturing context information makes the acoustic feature extracted from LSTM more discriminative. Feature extraction is also crucial to query-by-example spoken term detection (QbyE-STD), especially frame-level features. In this paper, we explore some frame-level recurrent neural networks representations for QbyE-STD, which is more robust than the original features. In addition, the designed model is a lightweight model that is suitable for the requirements for little footprint on mobile devices. Firstly, we use a traditional RAE to extract frame-level representations and use a correspondence RAE to depress non-semantic information. Then, we use the combination of the two models to extract more discriminative features. Some common tricks such as skipping frames have been used to make the model learn more context information. Experiment and evaluations show the performance of the proposed methods are superior to the conventional ones, in the same condition of computation requirements.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Metze F., Anguera, X., Barnard, E., Davel, M., Gravier, G.: The spoken web search task at MediaEval 2012. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8121–8125 (2013)

    Google Scholar 

  2. Gundogdu, B., Saraclar, M., Universitesi, B.: Similarity learning based query modeling for keyword search. In: Proceedings of the Interspeech, pp. 3617–3621 (2017)

    Google Scholar 

  3. Chen, Z., Wu, J.: A rescoring approach for keyword search using lattice context information. In: Proceedings of the Interspeech, pp. 3592–3596 (2017)

    Google Scholar 

  4. Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: Proceedings of the Automatic Speech Recognition & Understanding, pp. 398–403 (2011)

    Google Scholar 

  5. Chen, G., Parada, C., Sainath, T.N.: Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5236–5240 (2015)

    Google Scholar 

  6. Yang, M.H., Lee, H.S., Lu, Y.D., Chen, K.Y., Tsao, Y., Chen, B., Wang, H.M.: Discriminative autoencoders for acoustic modeling. In: Proceedings of the Interspeech, pp. 3557–3561 (2017)

    Google Scholar 

  7. Kamper, H., Elsner, M., Jansen, A., Goldwater, S.: Unsupervised neural network based feature extraction using weak top-down constraints, In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819–7823 (2015)

    Google Scholar 

  8. Zhang, Y.C., Li, Y.S., Xu, K., Wang, D.Y., Li, M.H., Cao, X., Li, Q.Q.: A communication-aware container re-distribution approach for high performance VNFs. In: Proceedings of the Distributed Computing Systems (ICDCS), pp. 1555–1564 (2017)

    Google Scholar 

  9. Kamper, H., Wang, W., Livescu, K.: Deep convolutional acoustic word embedding using word-pair side information. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4950–4954 (2016)

    Google Scholar 

  10. Audhkhasi, E., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B.: End-to-end ASR-free keyword search from speech. IEEE J. Sel. Top. Sign. Process. 2(5), 99 (2016)

    Google Scholar 

  11. Huscariello, A., Gravier, G., Bimbot, F.: Audio keyword extraction by unsupervised word discovery. In: Proceedings of the Interspeech, pp. 2843–2847 (2009)

    Google Scholar 

  12. Badino, L., Canevari, C., Fadiga, L., Metta, G.: An autoencoder based approach to unsupervised learning of subword units. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7634–7638 (2014)

    Google Scholar 

  13. Rodriguez-Fuentes, L. J., Varona, A., Penagarikano, M., Bordel, G., Diez, M.: High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819–7823 (2014)

    Google Scholar 

  14. Lu, L., Zhang, X.X., Cho, K., Renals, S.: On training the recurrent neural network encoder-decoder for large vocabulary speech recognition. In: Proceedings of the Interspeech, pp. 5060–5064 (2015)

    Google Scholar 

  15. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y.M., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The kaldi speech recognition toolkit. In: Proceedings of the ASRU (2011)

    Google Scholar 

  16. Pytorch Homepage. http://pytorch.org/docs. Accessed 25 Apr 2018

  17. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (2015)

    Google Scholar 

  18. Zhang, Y.C., Xu, K., Wang, H.Y., Li, Q., Li, T., Cao, X.: Going fast and fair: Latency optimization for cloud-based service chains. IEEE Netw. 32(2), 138–143 (2018)

    Article  Google Scholar 

  19. Zhang, Y.C., Jiang, J.C., Xu, K., Nie, X.H., Reed, M.J., Wang, H.Y., Yao, G., Zhang, M., Chen, K.: BDS: a centralized near-optimal overlay network for inter-datacenter data replication. In: Proceedings of the Thirteenth EuroSys Conference (2018)

    Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China-Research Grant Council of Hong Kong (NSFC-RGC) joint fund (61531166002, N_CUHK404/15), National High Technology Research and Development Program of China (2015AA016305), National Social Science Foundation of China (13&ZD189) and NSFC (61375027, 61433018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziwei Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, Z., Wu, Z., Li, R., Ning, Y., Meng, H. (2018). Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices. In: Aiello, M., Yang, Y., Zou, Y., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2018. AIMS 2018. Lecture Notes in Computer Science(), vol 10970. Springer, Cham. https://doi.org/10.1007/978-3-319-94361-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94361-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94360-2

  • Online ISBN: 978-3-319-94361-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics