Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

Vavrek, Jozef; Viszlay, Peter; Lojka, Martin; Juhár, Jozef; Pleva, Matúš

doi:10.1007/s10844-018-0499-2

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

Published: 19 February 2018

Volume 51, pages 439–455, (2018)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Jozef Vavrek ORCID: orcid.org/0000-0003-4380-0801¹,
Peter Viszlay¹,
Martin Lojka¹,
Jozef Juhár¹ &
…
Matúš Pleva¹

503 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

This paper examines multilingual audio Query-by-Example (QbE) retrieval, utilizing the posteriorgram-based Phonetic Unit Modelling (PUM) approach and the Weighted Fast Sequential Dynamic Time Warping (WFSDTW) algorithm. The PUM approach employs phone recognizers trained on language-specific external resources in a supervised way. Thus, the information about the phonetic distribution is embedded in the process of acoustic modelling. The resulting acoustic models were also used for language-independent QbE retrieval. The improved WFSDTW algorithm was implemented in order to perform retrievals for each query (keyword) within the particular utterance file. The major interest is placed on a retrieval performance measurement of the proposed WFSDTW solution employing posteriorgram-based keyword matching with Gaussian mixture modelling (GMM). Score normalization and fusion of four different language-dependent sub-systems was carried out using a simple max-score merging strategy. The results show a certain predominance of the proposed WFSDTW solution among two other evaluated techniques, namely basic DTW and segmental DTW algorithms. Also, the combination of multiple PUM techniques together with the WFSDTW has been proved as an effective solution for the QbE task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Milestones in speaker recognition

Article Open access 15 February 2024

References

Anguera, X., Rodriguez Fuentes, L. J., Szoke, I., Buzo, A., & Metze, F. (2014). Query by Example Search on Speech at Mediaeval 2014. In Working notes Proceedings of the MediaEval 2014. Barcelona: Workshop.
Anguera, X., & et al. (2013). The telefonica research spoken web search system for MediaEval 2013. In Working notes proceedings of the mediaeval 2013. Barcelona: CEUR Workshop Proceedings CEUR-WS.org.
Aradilla, G., Vepa, J., & Bourlard, H. (2006). Using posterior-based features in template matching for speech recognition. In Proceedings of INTERSPEECH’06 (pp. 1186–1189). Pittsburgh: IEEE.
Buzo, A., Cucu, H., & Burileanu, C. (2014). SpeeD@MediaEval 2014: spoken term detection with robust multilingual phone recognition. In Working notes proceedigs of the mediaeval 2014. Barcelona: Workshop.
Calvo, M., Giménez, M., Hurtado, L.F., Arnal, E.S., & Gómez, J.A. (2014). ELiRF at MediaEval 2014: query by example search on speech task (QUESST). In Working notes proceedings of the mediaeval 2014. Barcelona: Workshop.
Chan, C.A., & Lee, L.S. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5652–5655). Prague: IEEE.
Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., & Szykulski, M. (2017). An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 49 (2), 167–192. https://doi.org/10.1007/s10844-016-0438-z.
Article Google Scholar
Dubois, C., & Charlet, D. (2008). Using textual information from LVCSR transcripts for phonetic-based spoken term detection. In ICASSP 2008 (pp. 4961–4964). Las Vegas: IEEE.
Fiscus, J., Ajot, J., & Doddingtion, G. (2006). The spoken term detection (std) 2006 evaluation plan, September (2006) NIST USA.
Gehring, J., Miao, Y., Metze, F., & Waibel, A. (2013). Extracting deep bottleneck features using stacked auto-encoders. In ICASSP (pp. 3377–3381). Vancouver: IEEE.
Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In IEEE workshop on Automatic speech recognition understanding, 2009. ASRU 2009 (pp. 421–426). Merano: IEEE.
van den Heuvel, H., & et al. (2001). SpeechDat-E: Five eastern European speech databases for voice-operated Teleservices completed. In Proceedings of EUROSPEECH. http://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2059.pdf (pp. 2059–2062). Aalborg: ISCA.
Hou, J., Xie, L., & Fu, Z. (2016). Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in mandarin chinese. In 2016 10th international symposium on chinese spoken language processing (ISCSLP). https://doi.org/10.1109/ISCSLP.2016.7918366 (pp. 1–5).
Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In 2011 IEEE international conference on Acoustics, speech and signal processing (ICASSP) (pp. 4436–4439). Prague: IEEE.
Itakura, F. (1990). Minimum prediction residual principle applied to speech recognition. In Waibel, A., & Lee, K.F. (Eds.) Readings in speech recognition. http://dl.acm.org/citation.cfm?id=108235.108243 (pp. 154–158). San Francisco: Morgan Kaufmann Publishers Inc.
Kesiraju, S., Mantena, G.V., & Prahallad, K. (2014). IIIT-h system for MediaEval 2014 QUESST. In Working notes proceedings of the mediaeval (p. 2014). Barcelona: Workshop.
Kotus, J., Lopatka, K., & Czyzewski, A. (2014). Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimedia Tools and Applications, 68(1), 5–21. https://doi.org/10.1007/s11042-012-1183-0.
Article Google Scholar
Mantena, G., Achanta, S., & Prahallad, K. (2014). Query-by-Example spoken term detection using frequency domain linear prediction and Non-Segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(5), 946–955.
Article Google Scholar
Muscariello, A., Gravier, G., & Bimbot, F. (2011). Zero-resource Audio-Only spoken term detection based on a combination of template matching Techniques. In INTERSPEECH (pp. 921–924). Firenze: ISCA.
Ogbureke, K.U., & Carson-Berndsen, J. (2010). Framework for cross-language automatic phonetic segmentation. In Proceedings of ICASSP’10 (pp. 5266–5269). Dallas: IEEE.
Park, A., & Glass, J.R. (2006). A Novel DTW-based distance measure for speaker segmentation. In SLT (pp. 22–25). Palm Beach: IEEE.
Park, A.S., & Glass, J.R. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197. IEEE.
Article Google Scholar
Penet, C., Demarty, C.H., Gravier, G., & Gros, P. (2015). Variability modelling for audio events detection in movies. Multimedia Tools and Applications, 74 (4), 1143–1173.
Article Google Scholar
Rodriguez-Fuentes, L.J., & Penagarikano, M. (2013). MediaEval 2013 spoken Web search task: system performance measures. Tech. rep., software technologies working group (GTTS, http://gtts.ehu.es). http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf.
Saraclar, M. (2004). Lattice-based search for spoken utterance retrieval. In Proceedings of HLT-NAACL 2004. https://www.aclweb.org/anthology/N/N04/N04-1017.pdf (pp. 129–136). Boston: ACL.
Szoke, I., & et al. (2007). Spoken term detection system based on combination of LVCSR and phonetic search. In MLMI, lecture notes in computer science, (Vol. 4892 pp. 237–247). Brno: Springer.
Tejedor, J., Fapšo, M., Szöke, I., Černocký, J., & Grézl, F. (2012). Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Transactions on Information Systems (TOIS), 2012(30), 1–34.
Article Google Scholar
Vavrek, J, & et al. (2015). Query-by-Example retrieval via fast sequential dynamic time warping algorithm. In Telecommunications and signal processing - TSP 2014. https://doi.org/10.1109/TSP.2015.7296440 (pp. 469–473). Berlin: IEEE.
Wang, H., Leung, C.C., Lee, T., Ma, B., & Li, H. (2012). An acoustic segment modeling approach to query-by-example spoken term detection. In Proceedings of ICASSP’12 (pp. 5157–5160). Kyoto: IEEE.
Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013a). Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams. In INTERSPEECH (pp. 2297–2301). Lyon: ISCA.
Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013b). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8545–8549). Vancouver: IEEE.
Yang, P., Xu, H., Xiao, X., Xie, L., Leung, C.C., Chen, H., Yu, J., Lv, H., Wang, L., Leow, S.J., & Ma, B. (2014). The NNI Query-by-Example System for MediaEval 2014. In Working Notes Proceedings of the MediaEval 2014 Workshop, CEUR Workshop Proceedings (CEUR-WS.org). Barcelona, Catalunya, Spain, October 16-17, 2014 (pp. 1–2). http://ceur-ws.org/Vol-1263/mediaeval2014_submission_69.pdf.
Young, S., & et al. (2006). The HTK book (for HTK Version 3.4). Cambridge: Cambridge University.
Google Scholar
Zhang, Y., & Glass, J.R. (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In ASRU (pp. 398–403). Merano: IEEE.

Download references

Acknowledgements

The research presented in this paper was supported by the Slovak Research and Development Agency under the research project APVV-15-0517 and by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the project VEGA 1/0511/17.

Author information

Authors and Affiliations

Department of Electronics and Multimedia Communications, FEI TU Košice, Technical University of Košice, Park Komenského 13, 041 20, Košice, Slovak Republic
Jozef Vavrek, Peter Viszlay, Martin Lojka, Jozef Juhár & Matúš Pleva

Authors

Jozef Vavrek
View author publications
You can also search for this author in PubMed Google Scholar
Peter Viszlay
View author publications
You can also search for this author in PubMed Google Scholar
Martin Lojka
View author publications
You can also search for this author in PubMed Google Scholar
Jozef Juhár
View author publications
You can also search for this author in PubMed Google Scholar
Matúš Pleva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jozef Vavrek or Matúš Pleva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vavrek, J., Viszlay, P., Lojka, M. et al. Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval. J Intell Inf Syst 51, 439–455 (2018). https://doi.org/10.1007/s10844-018-0499-2

Download citation

Received: 21 July 2017
Revised: 22 January 2018
Accepted: 22 January 2018
Published: 19 February 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10844-018-0499-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

Abstract

Access this article

Similar content being viewed by others

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Conventional and contemporary approaches used in text to speech synthesis: a review

Milestones in speaker recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

Abstract

Access this article

Similar content being viewed by others

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Conventional and contemporary approaches used in text to speech synthesis: a review

Milestones in speaker recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation