Abstract
This paper investigates an approach for fast hybrid human and machine video subtitling based on lattice disambiguation and posterior model adaptation. The approach aims at correcting Automatic Speech Recognition (ASR) transcriptions requiring minimal effort from the user and facilitating user corrections from smart-phone devices. Our approach is based on three key concepts. Firstly, only a portion of the data is sent to the user for correction. Secondly, user action is limited to selecting from a fixed set of options extracted from the ASR word lattice. Thirdly, user feedback is used to update the ASR parameters and further enhance performance. To investigate the potential and limitations of this approach, we carry out experiments employing simulated and real user corrections of TED talks videos. Simulated corrections include both the true reference and the best combination of the options shown to the user. Real corrections are obtained from 30 editors through a special purpose web-interface displaying the options for small video segments. We analyze the fixed option approach and the trade-off between model adaptation and increasing the amount of corrected data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
TED Talks. http://www.ted.com/talks
Audhkhasi, K., Georgiou, P.G., Narayanan, S.S.: Reliability-weighted acoustic model adaptation using crowd sourced transcriptions. In: INTERSPEECH, pp. 3045–3048 (2011)
Audhkhasi, K., Georgiou, P.G., Narayanan, S.S.: Analyzing quality of crowd-sourced speech transcriptions of noisy audio for acoustic model adaptation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4137–4140. IEEE (2012)
Bertoldi, N., Simianer, P., Cettolo, M., Wäschle, K., Federico, M., Riezler, S.: Online adaptation to post-edits for phrase-based statistical machine translation. Mach. Translation 28(3–4), 309–339 (2014)
Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? Perspect. Psychol. Sci. 6(1), 3–5 (2011)
Eskenazi, M., Levow, G.A., Meng, H., Parent, G., Suendermann, D.: Crowdsourcing for Speech Processing: Applications to Data Collection Transcription and Assessment. Wiley, Hoboken (2013)
Federico, M., Bentivogli, L., Paul, M., Stüker, S.: Overview of the IWSLT 2012 evaluation campaign. In: IWSLT, pp. 11–27 (2011)
Fiscus, J.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Proceedings of the ASRU, Santa Barbara, USA (1997)
Green, S., Heer, J., Manning, C.D.: The efficacy of human post-editing for language translation. In: Proceedings of the SIGCHI Conference On Human Factors In Computing Systems, pp. 439–448. ACM (2013)
Hakkani-Tür, D.Z., Béchet, F., Riccardi, G., Tür, G.: Beyond ASR 1-best: using word confusion networks in spoken language understanding. Comput. Speech Lang. 20(4), 495–514 (2006)
Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(6), 570–583 (1990)
Lavie, A., Denkowski, M., Dyer, C.: Learning from post-editing: online model adaptation for statistical machine translation. In: EACL 2014 (2014)
Meinedo, H., Abad, A., Pellegrini, T., Trancoso, I., Neto, J.A.: The L2F broadcast news speech recognition system. In: Proceedings of Fala, pp. 93–96 (2010)
Mühlberger, G., Zelger, J., Sagmeister, D.: User-driven correction of OCR errors: combing crowdsourcing and information retrieval technology. In: Digital Access to Textual Cultural Heritage (DATeCH 2014), pp. 53–56, Madrid, Spain (2014)
Neto, J.a., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., Robinson, T.: Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system (1995)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, DTIC Document (1985)
Vertanen, K., MacKay, D.J.: Speech dasher: fast writing using speech and gaze. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 595–598. ACM (2010)
Wald, M.: Crowdsourcing correction of speech recognition captioning errors. In: W4A. ACM (2011)
Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 366–369. IEEE (2012)
Acknowledgements
This work has been partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013, the grant number SFRH/BPD/68428/2010 and by the TRATAHI Portugal-CMU Project CMUP-EPB/TIC/0065/2013. Ângela Costa was supported by a Ph.D. fellowship from Fundação para a Ciência e Tecnologia (SFRH/BD/85737/2012).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Miranda, J. et al. (2016). Crowdsourced Video Subtitling with Adaptation Based on User-Corrected Lattices. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-49169-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49168-4
Online ISBN: 978-3-319-49169-1
eBook Packages: Computer ScienceComputer Science (R0)