Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

Zhu, Ziwei; Wu, Zhiyong; Li, Runnan; Ning, Yishuang; Meng, Helen

doi:10.1007/978-3-319-94361-9_5

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

Ziwei Zhu¹⁷,
Zhiyong Wu¹⁷,
Runnan Li¹⁷,
Yishuang Ning¹⁷ &
…
Helen Meng^17,18

Conference paper
First Online: 21 June 2018

1051 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10970))

Abstract

Recurrent neural networks (RNNs) with long short term memory (LSTM) acoustic model (AM) has achieved state-of-the-art performance in LVCSR. The strong ability in capturing context information makes the acoustic feature extracted from LSTM more discriminative. Feature extraction is also crucial to query-by-example spoken term detection (QbyE-STD), especially frame-level features. In this paper, we explore some frame-level recurrent neural networks representations for QbyE-STD, which is more robust than the original features. In addition, the designed model is a lightweight model that is suitable for the requirements for little footprint on mobile devices. Firstly, we use a traditional RAE to extract frame-level representations and use a correspondence RAE to depress non-semantic information. Then, we use the combination of the two models to extract more discriminative features. Some common tricks such as skipping frames have been used to make the model learn more context information. Experiment and evaluations show the performance of the proposed methods are superior to the conventional ones, in the same condition of computation requirements.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Metze F., Anguera, X., Barnard, E., Davel, M., Gravier, G.: The spoken web search task at MediaEval 2012. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8121–8125 (2013)
Google Scholar
Gundogdu, B., Saraclar, M., Universitesi, B.: Similarity learning based query modeling for keyword search. In: Proceedings of the Interspeech, pp. 3617–3621 (2017)
Google Scholar
Chen, Z., Wu, J.: A rescoring approach for keyword search using lattice context information. In: Proceedings of the Interspeech, pp. 3592–3596 (2017)
Google Scholar
Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: Proceedings of the Automatic Speech Recognition & Understanding, pp. 398–403 (2011)
Google Scholar
Chen, G., Parada, C., Sainath, T.N.: Query-by-example keyword spotting using long short-term memory networks. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5236–5240 (2015)
Google Scholar
Yang, M.H., Lee, H.S., Lu, Y.D., Chen, K.Y., Tsao, Y., Chen, B., Wang, H.M.: Discriminative autoencoders for acoustic modeling. In: Proceedings of the Interspeech, pp. 3557–3561 (2017)
Google Scholar
Kamper, H., Elsner, M., Jansen, A., Goldwater, S.: Unsupervised neural network based feature extraction using weak top-down constraints, In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819–7823 (2015)
Google Scholar
Zhang, Y.C., Li, Y.S., Xu, K., Wang, D.Y., Li, M.H., Cao, X., Li, Q.Q.: A communication-aware container re-distribution approach for high performance VNFs. In: Proceedings of the Distributed Computing Systems (ICDCS), pp. 1555–1564 (2017)
Google Scholar
Kamper, H., Wang, W., Livescu, K.: Deep convolutional acoustic word embedding using word-pair side information. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4950–4954 (2016)
Google Scholar
Audhkhasi, E., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B.: End-to-end ASR-free keyword search from speech. IEEE J. Sel. Top. Sign. Process. 2(5), 99 (2016)
Google Scholar
Huscariello, A., Gravier, G., Bimbot, F.: Audio keyword extraction by unsupervised word discovery. In: Proceedings of the Interspeech, pp. 2843–2847 (2009)
Google Scholar
Badino, L., Canevari, C., Fadiga, L., Metta, G.: An autoencoder based approach to unsupervised learning of subword units. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7634–7638 (2014)
Google Scholar
Rodriguez-Fuentes, L. J., Varona, A., Penagarikano, M., Bordel, G., Diez, M.: High-performance query-by-example spoken term detection on the SWS 2013 evaluation. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819–7823 (2014)
Google Scholar
Lu, L., Zhang, X.X., Cho, K., Renals, S.: On training the recurrent neural network encoder-decoder for large vocabulary speech recognition. In: Proceedings of the Interspeech, pp. 5060–5064 (2015)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y.M., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The kaldi speech recognition toolkit. In: Proceedings of the ASRU (2011)
Google Scholar
Pytorch Homepage. http://pytorch.org/docs. Accessed 25 Apr 2018
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (2015)
Google Scholar
Zhang, Y.C., Xu, K., Wang, H.Y., Li, Q., Li, T., Cao, X.: Going fast and fair: Latency optimization for cloud-based service chains. IEEE Netw. 32(2), 138–143 (2018)
Article Google Scholar
Zhang, Y.C., Jiang, J.C., Xu, K., Nie, X.H., Reed, M.J., Wang, H.Y., Yao, G., Zhang, M., Chen, K.: BDS: a centralized near-optimal overlay network for inter-datacenter data replication. In: Proceedings of the Thirteenth EuroSys Conference (2018)
Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China-Research Grant Council of Hong Kong (NSFC-RGC) joint fund (61531166002, N_CUHK404/15), National High Technology Research and Development Program of China (2015AA016305), National Social Science Foundation of China (13&ZD189) and NSFC (61375027, 61433018).

Author information

Authors and Affiliations

Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Graduate School at Shenzhen, Tsinghua University, Beijing, China
Ziwei Zhu, Zhiyong Wu, Runnan Li, Yishuang Ning & Helen Meng
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Sha Tin, Hong Kong
Helen Meng

Authors

Ziwei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Runnan Li
View author publications
You can also search for this author in PubMed Google Scholar
Yishuang Ning
View author publications
You can also search for this author in PubMed Google Scholar
Helen Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziwei Zhu .

Editor information

Editors and Affiliations

University of Stuttgart, Stuttgart, Germany
Marco Aiello
Tsinghua University, Beijing, China
Yujiu Yang
Peking University, Beijing, China
Yuexian Zou
Kingdee International Software Group Co., Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Z., Wu, Z., Li, R., Ning, Y., Meng, H. (2018). Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices. In: Aiello, M., Yang, Y., Zou, Y., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2018. AIMS 2018. Lecture Notes in Computer Science(), vol 10970. Springer, Cham. https://doi.org/10.1007/978-3-319-94361-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-94361-9_5
Published: 21 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94360-2
Online ISBN: 978-3-319-94361-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics