Skip to main content

Better Phoneme Recognisers Lead to Better Phoneme Posteriorgrams for Search on Speech? An Experimental Analysis

  • Conference paper
  • First Online:
Advances in Speech and Language Technologies for Iberian Languages (IberSPEECH 2016)

Abstract

Phoneme posteriorgrams are widely used for speech representation when performing query-by-example search on speech. These posteriorgrams are computed by obtaining the per-frame a posteriori probability of each unit in a phoneme recogniser, regardless the architecture of this phoneme recogniser. It is straightforward to believe that the higher the quality of the phone transcriptions generated by a phoneme recogniser, the higher the quality of its resulting phoneme posteriorgrams; however, to the best of our knowledge, no analysis exist proving this statement. This paper aims at investigating whether there is a correlation between the phone error rate of a recogniser and the maximum term weighted value obtained when performing query-by-example search on speech. Experiments on the Albayzin corpus in Spanish language showed a slight correlation between these two metrics, which suggests that the goodness of phoneme posteriorgram representation is somehow related to phone error rate, but there are other factors that affect their performance in search on speech tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Spoken Term Detection (STD) 2006 Evaluation Plan, National Institute of Standards and Technology (NIST): http://www.itl.nist.gov/iad/mig/tests/std/2006/docs/std06-evalplan-v10.pdf.

References

  1. Abad, A., Astudillo, R., Trancoso, I.: The L2F spoken web search system for Mediaeval 2013. In: Proceedings of the MediaEval 2013 Workshop (2013)

    Google Scholar 

  2. Anguera, X., Metze, F., Buzo, A., Szöke, I., Rodriguez-Fuentes, L.: The spoken web search task. In: Proceedings of the MediaEval 2013 Workshop (2013)

    Google Scholar 

  3. Anguera, X., Rodriguez-Fuentes, L., Szöke, I., Buzo, A., Metze, F.: Query by example search on speech at MediaEval 2014. In: Proceedings of the MediaEval 2014 Workshop (2014)

    Google Scholar 

  4. Buzo, A., Cucu, H., Molnar, I., Ionescu, B., Burileanu, C.: SpeeD @ MediaEval 2013: a phone recognition approach to spoken term detection. In: Proceedings of the MediaEval 2013 Workshop (2013)

    Google Scholar 

  5. Can, D., Saraclar, M.: Lattice indexing for spoken term detection. IEEE Trans. Audio Speech Lang. Process. 19(8), 2338–2347 (2011)

    Article  Google Scholar 

  6. Chelba, C., Hazen, T.J., Saraclar, M.: Retrieval and browsing of spoken content. IEEE Sig. Process. Mag. 25(3), 39–49 (2008)

    Article  Google Scholar 

  7. Gales, M.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  8. Garofolo, J., Auzanne, G., Voorhees, E.: The TREC spoken document retrieval task: a success story. In: Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) (2014)

    Google Scholar 

  9. Hazen, T., Shen, W., White, C.: Query-by-example spoken term detection using phonetic posteriorgram templates. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU, pp. 421–426 (2009)

    Google Scholar 

  10. Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C.: Phonetic unit selection for cross-lingual query-by-example spoken term detection. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 223–229 (2015)

    Google Scholar 

  11. Mantena, G., Achanta, S., Prahallad, K.: Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans. Audio Speech Lang. Process. 22(5), 944–953 (2014)

    Article  Google Scholar 

  12. Martinez, M., Lopez-Otero, P., Varela, R., Cardenal-Lopez, A., Docio-Fernandez, L., Garcia-Mateo, C.: GTM-UVigo systems for Albayzin 2014 search on speech evaluation. In: Iberspeech 2014: VIII Jornadas en Tecnologa del Habla and IV SLTech Workshop (2014)

    Google Scholar 

  13. Metze, F., Barnard, E., Davel, M., Heerden, C.V., Anguera, X., Gravier, G., Rajput, N.: The spoken web search task. In: Proceedings of the MediaEval 2012 Workshop (2012)

    Google Scholar 

  14. Metze, F., Rajput, N., Anguera, X., Davel, M., Gravier, G., Heerden, C.V., Mantena, G., Muscariello, A., Pradhallad, K., Szöke, I., Tejedor, J.: The spoken web search task at MediaEval 2011. In: Proceedings of ICASSP (2012)

    Google Scholar 

  15. Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: Proceedings of Eurospeech (1993)

    Google Scholar 

  16. Müller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg (2007)

    Book  Google Scholar 

  17. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of ICASSP, pp. 4057–4060 (2008)

    Google Scholar 

  18. Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR abs/1410.7455 (2014). http://arxiv.org/abs/1410.7455

  19. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society (2011)

    Google Scholar 

  20. Rodriguez-Fuentes, L., Varona, A., Penagarikano, M.: GTTS-EHU systems for QUESST at MediaEval 2014. In: Proceedings of the MediaEval 2014 Workshop (2014)

    Google Scholar 

  21. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics Speech Sig. Process. 26(1), 43–49 (1978)

    Article  MATH  Google Scholar 

  22. Schwarz, P.: Phoneme recognition based on long temporal context. Ph.D. thesis, Brno University of Technology (2009)

    Google Scholar 

  23. Siohan, O., Bacchiani, M.: Fast vocabulary independent audio search using path based graph indexing. In: Proceedings of Interspeech/Eurospeech, pp. 53–56 (2005)

    Google Scholar 

  24. Szöke, I., Burget, L., Grézl, F., C̆ernocký, J., Ondel, L.: Calibration and fusion of query-by-example systems - BUT SWS 2013. In: Proceedings of ICASSP, pp. 7899–7903 (2014)

    Google Scholar 

  25. Szöke, I., Rodriguez-Fuentes, L., Buzo, A., Anguera, X., Metze, F., Proenca, J., Lojka, M., Xiong, X.: Query by example search on speech at Mediaeval 2015. In: Proceedings of the MediaEval 2015 Workshop (2015)

    Google Scholar 

  26. Szöke, I., Skácel, M., Burget, L.: BUT QUESST2014 system description. In: Proceedings of the MediaEval 2014 Workshop (2014)

    Google Scholar 

  27. Veselý, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: Proceedings of Interspeech, pp. 2345–2349, no. 8 (2013)

    Google Scholar 

  28. Yang, P., Xu, H., Xiao, X., Xie, L., Leung, C.C., Chen, H., Yu, J., Lv, H., Wang, L., Leow, S., Ma, B., Chng, E., Li, H.: The NNI query-by-example system for MediaEval 2014. In: Proceedings of the MediaEval 2014 Workshop (2014)

    Google Scholar 

Download references

Acknowledgements

This research was funded by the Spanish Government under the project TEC2015-65345-P, the Galician Government through the research contract GRC2014/024 (Modalidade: Grupos de Referencia Competitiva 2014) and AtlantTIC Project CN2012/160, and by the European Regional Development Fund (ERDF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paula Lopez-Otero .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Lopez-Otero, P., Docio-Fernandez, L., Garcia-Mateo, C. (2016). Better Phoneme Recognisers Lead to Better Phoneme Posteriorgrams for Search on Speech? An Experimental Analysis. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49169-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49168-4

  • Online ISBN: 978-3-319-49169-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics