Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval
This paper examines multilingual audio Query-by-Example (QbE) retrieval, utilizing the posteriorgram-based Phonetic Unit Modelling (PUM) approach and the Weighted Fast Sequential Dynamic Time Warping (WFSDTW) algorithm. The PUM approach employs phone recognizers trained on language-specific external resources in a supervised way. Thus, the information about the phonetic distribution is embedded in the process of acoustic modelling. The resulting acoustic models were also used for language-independent QbE retrieval. The improved WFSDTW algorithm was implemented in order to perform retrievals for each query (keyword) within the particular utterance file. The major interest is placed on a retrieval performance measurement of the proposed WFSDTW solution employing posteriorgram-based keyword matching with Gaussian mixture modelling (GMM). Score normalization and fusion of four different language-dependent sub-systems was carried out using a simple max-score merging strategy. The results show a certain predominance of the proposed WFSDTW solution among two other evaluated techniques, namely basic DTW and segmental DTW algorithms. Also, the combination of multiple PUM techniques together with the WFSDTW has been proved as an effective solution for the QbE task.
KeywordsQuery-by-Example retrieving Phonetic unit modeling Sequential dynamic time warping
The research presented in this paper was supported by the Slovak Research and Development Agency under the research project APVV-15-0517 and by the Ministry of Education, Science, Research and Sport of the Slovak Republic under the project VEGA 1/0511/17.
- Anguera, X., Rodriguez Fuentes, L. J., Szoke, I., Buzo, A., & Metze, F. (2014). Query by Example Search on Speech at Mediaeval 2014. In Working notes Proceedings of the MediaEval 2014. Barcelona: Workshop.Google Scholar
- Anguera, X., & et al. (2013). The telefonica research spoken web search system for MediaEval 2013. In Working notes proceedings of the mediaeval 2013. Barcelona: CEUR Workshop Proceedings CEUR-WS.org.Google Scholar
- Aradilla, G., Vepa, J., & Bourlard, H. (2006). Using posterior-based features in template matching for speech recognition. In Proceedings of INTERSPEECH’06 (pp. 1186–1189). Pittsburgh: IEEE.Google Scholar
- Buzo, A., Cucu, H., & Burileanu, C. (2014). SpeeD@MediaEval 2014: spoken term detection with robust multilingual phone recognition. In Working notes proceedigs of the mediaeval 2014. Barcelona: Workshop.Google Scholar
- Calvo, M., Giménez, M., Hurtado, L.F., Arnal, E.S., & Gómez, J.A. (2014). ELiRF at MediaEval 2014: query by example search on speech task (QUESST). In Working notes proceedings of the mediaeval 2014. Barcelona: Workshop.Google Scholar
- Chan, C.A., & Lee, L.S. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5652–5655). Prague: IEEE.Google Scholar
- Dubois, C., & Charlet, D. (2008). Using textual information from LVCSR transcripts for phonetic-based spoken term detection. In ICASSP 2008 (pp. 4961–4964). Las Vegas: IEEE.Google Scholar
- Fiscus, J., Ajot, J., & Doddingtion, G. (2006). The spoken term detection (std) 2006 evaluation plan, September (2006) NIST USA.Google Scholar
- Gehring, J., Miao, Y., Metze, F., & Waibel, A. (2013). Extracting deep bottleneck features using stacked auto-encoders. In ICASSP (pp. 3377–3381). Vancouver: IEEE.Google Scholar
- Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In IEEE workshop on Automatic speech recognition understanding, 2009. ASRU 2009 (pp. 421–426). Merano: IEEE.Google Scholar
- van den Heuvel, H., & et al. (2001). SpeechDat-E: Five eastern European speech databases for voice-operated Teleservices completed. In Proceedings of EUROSPEECH. http://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2059.pdf (pp. 2059–2062). Aalborg: ISCA.
- Hou, J., Xie, L., & Fu, Z. (2016). Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in mandarin chinese. In 2016 10th international symposium on chinese spoken language processing (ISCSLP). https://doi.org/10.1109/ISCSLP.2016.7918366 (pp. 1–5).
- Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In 2011 IEEE international conference on Acoustics, speech and signal processing (ICASSP) (pp. 4436–4439). Prague: IEEE.Google Scholar
- Itakura, F. (1990). Minimum prediction residual principle applied to speech recognition. In Waibel, A., & Lee, K.F. (Eds.) Readings in speech recognition. http://dl.acm.org/citation.cfm?id=108235.108243 (pp. 154–158). San Francisco: Morgan Kaufmann Publishers Inc.
- Kesiraju, S., Mantena, G.V., & Prahallad, K. (2014). IIIT-h system for MediaEval 2014 QUESST. In Working notes proceedings of the mediaeval (p. 2014). Barcelona: Workshop.Google Scholar
- Muscariello, A., Gravier, G., & Bimbot, F. (2011). Zero-resource Audio-Only spoken term detection based on a combination of template matching Techniques. In INTERSPEECH (pp. 921–924). Firenze: ISCA.Google Scholar
- Ogbureke, K.U., & Carson-Berndsen, J. (2010). Framework for cross-language automatic phonetic segmentation. In Proceedings of ICASSP’10 (pp. 5266–5269). Dallas: IEEE.Google Scholar
- Park, A., & Glass, J.R. (2006). A Novel DTW-based distance measure for speaker segmentation. In SLT (pp. 22–25). Palm Beach: IEEE.Google Scholar
- Rodriguez-Fuentes, L.J., & Penagarikano, M. (2013). MediaEval 2013 spoken Web search task: system performance measures. Tech. rep., software technologies working group (GTTS, http://gtts.ehu.es). http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf.
- Saraclar, M. (2004). Lattice-based search for spoken utterance retrieval. In Proceedings of HLT-NAACL 2004. https://www.aclweb.org/anthology/N/N04/N04-1017.pdf (pp. 129–136). Boston: ACL.
- Szoke, I., & et al. (2007). Spoken term detection system based on combination of LVCSR and phonetic search. In MLMI, lecture notes in computer science, (Vol. 4892 pp. 237–247). Brno: Springer.Google Scholar
- Vavrek, J, & et al. (2015). Query-by-Example retrieval via fast sequential dynamic time warping algorithm. In Telecommunications and signal processing - TSP 2014. https://doi.org/10.1109/TSP.2015.7296440 (pp. 469–473). Berlin: IEEE.
- Wang, H., Leung, C.C., Lee, T., Ma, B., & Li, H. (2012). An acoustic segment modeling approach to query-by-example spoken term detection. In Proceedings of ICASSP’12 (pp. 5157–5160). Kyoto: IEEE.Google Scholar
- Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013a). Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams. In INTERSPEECH (pp. 2297–2301). Lyon: ISCA.Google Scholar
- Wang, H., Lee, T., Leung, C.C., Ma, B., & Li, H. (2013b). Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8545–8549). Vancouver: IEEE.Google Scholar
- Yang, P., Xu, H., Xiao, X., Xie, L., Leung, C.C., Chen, H., Yu, J., Lv, H., Wang, L., Leow, S.J., & Ma, B. (2014). The NNI Query-by-Example System for MediaEval 2014. In Working Notes Proceedings of the MediaEval 2014 Workshop, CEUR Workshop Proceedings (CEUR-WS.org). Barcelona, Catalunya, Spain, October 16-17, 2014 (pp. 1–2). http://ceur-ws.org/Vol-1263/mediaeval2014_submission_69.pdf.
- Young, S., & et al. (2006). The HTK book (for HTK Version 3.4). Cambridge: Cambridge University.Google Scholar
- Zhang, Y., & Glass, J.R. (2009). Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In ASRU (pp. 398–403). Merano: IEEE.Google Scholar