Skip to main content
Log in

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

This paper examines multilingual audio Query-by-Example (QbE) retrieval, utilizing the posteriorgram-based Phonetic Unit Modelling (PUM) approach and the Weighted Fast Sequential Dynamic Time Warping (WFSDTW) algorithm. The PUM approach employs phone recognizers trained on language-specific external resources in a supervised way. Thus, the information about the phonetic distribution is embedded in the process of acoustic modelling. The resulting acoustic models were also used for language-independent QbE retrieval. The improved WFSDTW algorithm was implemented in order to perform retrievals for each query (keyword) within the particular utterance file. The major interest is placed on a retrieval performance measurement of the proposed WFSDTW solution employing posteriorgram-based keyword matching with Gaussian mixture modelling (GMM). Score normalization and fusion of four different language-dependent sub-systems was carried out using a simple max-score merging strategy. The results show a certain predominance of the proposed WFSDTW solution among two other evaluated techniques, namely basic DTW and segmental DTW algorithms. Also, the combination of multiple PUM techniques together with the WFSDTW has been proved as an effective solution for the QbE task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Anguera, X., Rodriguez Fuentes, L. J., Szoke, I., Buzo, A., & Metze, F. (2014). Query by Example Search on Speech at Mediaeval 2014. In Working notes Proceedings of the MediaEval 2014. Barcelona: Workshop.

  • Anguera, X., & et al. (2013). The telefonica research spoken web search system for MediaEval 2013. In Working notes proceedings of the mediaeval 2013. Barcelona: CEUR Workshop Proceedings CEUR-WS.org.

  • Aradilla, G., Vepa, J., & Bourlard, H. (2006). Using posterior-based features in template matching for speech recognition. In Proceedings of INTERSPEECH’06 (pp. 1186–1189). Pittsburgh: IEEE.

  • Buzo, A., Cucu, H., & Burileanu, C. (2014). SpeeD@MediaEval 2014: spoken term detection with robust multilingual phone recognition. In Working notes proceedigs of the mediaeval 2014. Barcelona: Workshop.

  • Calvo, M., Giménez, M., Hurtado, L.F., Arnal, E.S., & Gómez, J.A. (2014). ELiRF at MediaEval 2014: query by example search on speech task (QUESST). In Working notes proceedings of the mediaeval 2014. Barcelona: Workshop.

  • Chan, C.A., & Lee, L.S. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5652–5655). Prague: IEEE.

  • Czyzewski, A., Kostek, B., Bratoszewski, P., Kotus, J., & Szykulski, M. (2017). An audio-visual corpus for multimodal automatic speech recognition. Journal of Intelligent Information Systems, 49 (2), 167–192. https://doi.org/10.1007/s10844-016-0438-z.

    Article  Google Scholar 

  • Dubois, C., & Charlet, D. (2008). Using textual information from LVCSR transcripts for phonetic-based spoken term detection. In ICASSP 2008 (pp. 4961–4964). Las Vegas: IEEE.

  • Fiscus, J., Ajot, J., & Doddingtion, G. (2006). The spoken term detection (std) 2006 evaluation plan, September (2006) NIST USA.

  • Gehring, J., Miao, Y., Metze, F., & Waibel, A. (2013). Extracting deep bottleneck features using stacked auto-encoders. In ICASSP (pp. 3377–3381). Vancouver: IEEE.

  • Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In IEEE workshop on Automatic speech recognition understanding, 2009. ASRU 2009 (pp. 421–426). Merano: IEEE.

  • van den Heuvel, H., & et al. (2001). SpeechDat-E: Five eastern European speech databases for voice-operated Teleservices completed. In Proceedings of EUROSPEECH. http://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2059.pdf (pp. 2059–2062). Aalborg: ISCA.

  • Hou, J., Xie, L., & Fu, Z. (2016). Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in mandarin chinese. In 2016 10th international symposium on chinese spoken language processing (ISCSLP). https://doi.org/10.1109/ISCSLP.2016.7918366 (pp. 1–5).

  • Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In 2011 IEEE international conference on Acoustics, speech and signal processing (ICASSP) (pp. 4436–4439). Prague: IEEE.

  • Itakura, F. (1990). Minimum prediction residual principle applied to speech recognition. In Waibel, A., & Lee, K.F. (Eds.) Readings in speech recognition. http://dl.acm.org/citation.cfm?id=108235.108243 (pp. 154–158). San Francisco: Morgan Kaufmann Publishers Inc.

  • Kesiraju, S., Mantena, G.V., & Prahallad, K. (2014). IIIT-h system for MediaEval 2014 QUESST. In Working notes proceedings of the mediaeval (p. 2014). Barcelona: Workshop.

  • Kotus, J., Lopatka, K., & Czyzewski, A. (2014). Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimedia Tools and Applications, 68(1), 5–21. https://doi.org/10.1007/s11042-012-1183-0.

    Article  Google Scholar 

  • Mantena, G., Achanta, S., & Prahallad, K. (2014). Query-by-Example spoken term detection using frequency domain linear prediction and Non-Segmental dynamic time warping. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(5), 946–955.

    Article  Google Scholar 

  • Muscariello, A., Gravier, G., & Bimbot, F. (2011). Zero-resource Audio-Only spoken term detection based on a combination of template matching Techniques. In INTERSPEECH (pp. 921–924). Firenze: ISCA.

  • Ogbureke, K.U., & Carson-Berndsen, J. (2010). Framework for cross-language automatic phonetic segmentation. In Proceedings of ICASSP’10 (pp. 5266–5269). Dallas: IEEE.

  • Park, A., & Glass, J.R. (2006). A Novel DTW-based distance measure for speaker segmentation. In SLT (pp. 22–25). Palm Beach: IEEE.

  • Park, A.S., & Glass, J.R. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197. IEEE.

    Article  Google Scholar 

  • Penet, C., Demarty, C.H., Gravier, G., & Gros, P. (2015). Variability modelling for audio events detection in movies. Multimedia Tools and Applications, 74 (4), 1143–1173.

    Article  Google Scholar 

  • Rodriguez-Fuentes, L.J., & Penagarikano, M. (2013). MediaEval 2013 spoken Web search task: system performance measures. Tech. rep., software technologies working group (GTTS, http://gtts.ehu.es). http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf.

  • Saraclar, M. (2004). Lattice-based search for spoken utterance retrieval. In Proceedings of HLT-NAACL 2004. https://www.aclweb.org/anthology/N/N04/N04-1017.pdf (pp. 129–136). Boston: ACL.

  • Szoke, I., & et al. (2007). Spoken term detection system based on combination of LVCSR and phonetic search. In MLMI, lecture notes in computer science, (Vol. 4892 pp. 237–247). Brno: Springer.

  • Tejedor, J., Fapšo, M., Szöke, I., Černocký, J., & Grézl, F. (2012). Comparison of methods for language-dependent and language-independent query-by-example spoken term detection. ACM Transactions on Information Systems (TOIS), 2012(30), 1–34.

    Article  Google Scholar 

  • Vavrek, J, & et al. (2015). Query-by-Example retrieval via fast sequential dynamic time warping algorithm. In Telecommunications and signal processing - TSP 2014. https://doi.org/10.1109/TSP.2015.7296440 (pp. 469–473). Berlin: IEEE.

  • Wang, H., Leung, C.C., Lee, T., Ma, B., & Li, H. (2012). An acoustic segment modeling approach to query-by-example spoken term detection. In Proceedings of ICASSP’12 (pp. 5157–5160). Kyoto: IEEE.

  • Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013a). Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams. In INTERSPEECH (pp. 2297–2301). Lyon: ISCA.

  • Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013b). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8545–8549). Vancouver: IEEE.

  • Yang, P., Xu, H., Xiao, X., Xie, L., Leung, C.C., Chen, H., Yu, J., Lv, H., Wang, L., Leow, S.J., & Ma, B. (2014). The NNI Query-by-Example System for MediaEval 2014. In Working Notes Proceedings of the MediaEval 2014 Workshop, CEUR Workshop Proceedings (CEUR-WS.org). Barcelona, Catalunya, Spain, October 16-17, 2014 (pp. 1–2). http://ceur-ws.org/Vol-1263/mediaeval2014_submission_69.pdf.

  • Young, S., & et al. (2006). The HTK book (for HTK Version 3.4). Cambridge: Cambridge University.

    Google Scholar 

  • Zhang, Y., & Glass, J.R. (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In ASRU (pp. 398–403). Merano: IEEE.

Download references

Acknowledgements

The research presented in this paper was supported by the Slovak Research and Development Agency under the research project APVV-15-0517 and by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the project VEGA 1/0511/17.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jozef Vavrek or Matúš Pleva.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vavrek, J., Viszlay, P., Lojka, M. et al. Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval. J Intell Inf Syst 51, 439–455 (2018). https://doi.org/10.1007/s10844-018-0499-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-018-0499-2

Keywords

Navigation