Journal of Intelligent Information Systems

, Volume 51, Issue 2, pp 439–455 | Cite as

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

  • Jozef VavrekEmail author
  • Peter Viszlay
  • Martin Lojka
  • Jozef Juhár
  • Matúš PlevaEmail author


This paper examines multilingual audio Query-by-Example (QbE) retrieval, utilizing the posteriorgram-based Phonetic Unit Modelling (PUM) approach and the Weighted Fast Sequential Dynamic Time Warping (WFSDTW) algorithm. The PUM approach employs phone recognizers trained on language-specific external resources in a supervised way. Thus, the information about the phonetic distribution is embedded in the process of acoustic modelling. The resulting acoustic models were also used for language-independent QbE retrieval. The improved WFSDTW algorithm was implemented in order to perform retrievals for each query (keyword) within the particular utterance file. The major interest is placed on a retrieval performance measurement of the proposed WFSDTW solution employing posteriorgram-based keyword matching with Gaussian mixture modelling (GMM). Score normalization and fusion of four different language-dependent sub-systems was carried out using a simple max-score merging strategy. The results show a certain predominance of the proposed WFSDTW solution among two other evaluated techniques, namely basic DTW and segmental DTW algorithms. Also, the combination of multiple PUM techniques together with the WFSDTW has been proved as an effective solution for the QbE task.


Query-by-Example retrieving Phonetic unit modeling Sequential dynamic time warping 



The research presented in this paper was supported by the Slovak Research and Development Agency under the research project APVV-15-0517 and by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the project VEGA 1/0511/17.


  1. Anguera, X., Rodriguez Fuentes, L. J., Szoke, I., Buzo, A., & Metze, F. (2014). Query by Example Search on Speech at Mediaeval 2014. In Working notes Proceedings of the MediaEval 2014. Barcelona: Workshop.Google Scholar
  2. Anguera, X., & et al. (2013). The telefonica research spoken web search system for MediaEval 2013. In Working notes proceedings of the mediaeval 2013. Barcelona: CEUR Workshop Proceedings Scholar
  3. Aradilla, G., Vepa, J., & Bourlard, H. (2006). Using posterior-based features in template matching for speech recognition. In Proceedings of INTERSPEECH’06 (pp. 1186–1189). Pittsburgh: IEEE.Google Scholar
  4. Buzo, A., Cucu, H., & Burileanu, C. (2014). SpeeD@MediaEval 2014: spoken term detection with robust multilingual phone recognition. In Working notes proceedigs of the mediaeval 2014. Barcelona: Workshop.Google Scholar
  5. Calvo, M., Giménez, M., Hurtado, L.F., Arnal, E.S., & Gómez, J.A. (2014). ELiRF at MediaEval 2014: query by example search on speech task (QUESST). In Working notes proceedings of the mediaeval 2014. Barcelona: Workshop.Google Scholar
  6. Chan, C.A., & Lee, L.S. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5652–5655). Prague: IEEE.Google Scholar
  7. Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., & Szykulski, M. (2017). An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 49 (2), 167–192. Scholar
  8. Dubois, C., & Charlet, D. (2008). Using textual information from LVCSR transcripts for phonetic-based spoken term detection. In ICASSP 2008 (pp. 4961–4964). Las Vegas: IEEE.Google Scholar
  9. Fiscus, J., Ajot, J., & Doddingtion, G. (2006). The spoken term detection (std) 2006 evaluation plan, September (2006) NIST USA.Google Scholar
  10. Gehring, J., Miao, Y., Metze, F., & Waibel, A. (2013). Extracting deep bottleneck features using stacked auto-encoders. In ICASSP (pp. 3377–3381). Vancouver: IEEE.Google Scholar
  11. Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In IEEE workshop on Automatic speech recognition understanding, 2009. ASRU 2009 (pp. 421–426). Merano: IEEE.Google Scholar
  12. van den Heuvel, H., & et al. (2001). SpeechDat-E: Five eastern European speech databases for voice-operated Teleservices completed. In Proceedings of EUROSPEECH. (pp. 2059–2062). Aalborg: ISCA.
  13. Hou, J., Xie, L., & Fu, Z. (2016). Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in mandarin chinese. In 2016 10th international symposium on chinese spoken language processing (ISCSLP). (pp. 1–5).
  14. Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In 2011 IEEE international conference on Acoustics, speech and signal processing (ICASSP) (pp. 4436–4439). Prague: IEEE.Google Scholar
  15. Itakura, F. (1990). Minimum prediction residual principle applied to speech recognition. In Waibel, A., & Lee, K.F. (Eds.) Readings in speech recognition. (pp. 154–158). San Francisco: Morgan Kaufmann Publishers Inc.
  16. Kesiraju, S., Mantena, G.V., & Prahallad, K. (2014). IIIT-h system for MediaEval 2014 QUESST. In Working notes proceedings of the mediaeval (p. 2014). Barcelona: Workshop.Google Scholar
  17. Kotus, J., Lopatka, K., & Czyzewski, A. (2014). Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimedia Tools and Applications, 68(1), 5–21. Scholar
  18. Mantena, G., Achanta, S., & Prahallad, K. (2014). Query-by-Example spoken term detection using frequency domain linear prediction and Non-Segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(5), 946–955.CrossRefGoogle Scholar
  19. Muscariello, A., Gravier, G., & Bimbot, F. (2011). Zero-resource Audio-Only spoken term detection based on a combination of template matching Techniques. In INTERSPEECH (pp. 921–924). Firenze: ISCA.Google Scholar
  20. Ogbureke, K.U., & Carson-Berndsen, J. (2010). Framework for cross-language automatic phonetic segmentation. In Proceedings of ICASSP’10 (pp. 5266–5269). Dallas: IEEE.Google Scholar
  21. Park, A., & Glass, J.R. (2006). A Novel DTW-based distance measure for speaker segmentation. In SLT (pp. 22–25). Palm Beach: IEEE.Google Scholar
  22. Park, A.S., & Glass, J.R. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197. IEEE.CrossRefGoogle Scholar
  23. Penet, C., Demarty, C.H., Gravier, G., & Gros, P. (2015). Variability modelling for audio events detection in movies. Multimedia Tools and Applications, 74 (4), 1143–1173.CrossRefGoogle Scholar
  24. Rodriguez-Fuentes, L.J., & Penagarikano, M. (2013). MediaEval 2013 spoken Web search task: system performance measures. Tech. rep., software technologies working group (GTTS,
  25. Saraclar, M. (2004). Lattice-based search for spoken utterance retrieval. In Proceedings of HLT-NAACL 2004. (pp. 129–136). Boston: ACL.
  26. Szoke, I., & et al. (2007). Spoken term detection system based on combination of LVCSR and phonetic search. In MLMI, lecture notes in computer science, (Vol. 4892 pp. 237–247). Brno: Springer.Google Scholar
  27. Tejedor, J., Fapšo, M., Szöke, I., Černocký, J., & Grézl, F. (2012). Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Transactions on Information Systems (TOIS), 2012(30), 1–34.CrossRefGoogle Scholar
  28. Vavrek, J, & et al. (2015). Query-by-Example retrieval via fast sequential dynamic time warping algorithm. In Telecommunications and signal processing - TSP 2014. (pp. 469–473). Berlin: IEEE.
  29. Wang, H., Leung, C.C., Lee, T., Ma, B., & Li, H. (2012). An acoustic segment modeling approach to query-by-example spoken term detection. In Proceedings of ICASSP’12 (pp. 5157–5160). Kyoto: IEEE.Google Scholar
  30. Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013a). Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams. In INTERSPEECH (pp. 2297–2301). Lyon: ISCA.Google Scholar
  31. Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013b). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8545–8549). Vancouver: IEEE.Google Scholar
  32. Yang, P., Xu, H., Xiao, X., Xie, L., Leung, C.C., Chen, H., Yu, J., Lv, H., Wang, L., Leow, S.J., & Ma, B. (2014). The NNI Query-by-Example System for MediaEval 2014. In Working Notes Proceedings of the MediaEval 2014 Workshop, CEUR Workshop Proceedings ( Barcelona, Catalunya, Spain, October 16-17, 2014 (pp. 1–2).
  33. Young, S., & et al. (2006). The HTK book (for HTK Version 3.4). Cambridge: Cambridge University.Google Scholar
  34. Zhang, Y., & Glass, J.R. (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In ASRU (pp. 398–403). Merano: IEEE.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Electronics and Multimedia Communications, FEI TU KošiceTechnical University of KošiceKošiceSlovak Republic

Personalised recommendations