Abstract
In this paper, we describe and analyse the performance of a simple approach to the alignment of very long speech signals to acoustically inaccurate transcriptions, even when two different languages are employed. The alignment algorithm operates on two phonetic sequences, the first one automatically extracted from the speech signal by means of a phone decoder, and the second one obtained from the reference text by means of a multilingual grapheme-to-phoneme transcriber. The proposed algorithm is compared to a widely known state-of-the-art alignment procedure based on word-level speech recognition. We present alignment accuracy results on two different datasets: (1) the 1997 English Hub4 database; and (2) a set of bilingual (Basque/Spanish) parliamentary sessions. In experiments on the Hub4 dataset, the proposed approach provided only slightly worse alignments than those reported for the state-of-the-art alignment procedure, but at a much lower computational cost and requiring much fewer resources. Moreover, if the resource to be aligned includes speech in two or more languages and speakers conmute between them at any time, applying a speech recognizer becomes unfeasible in practice, whereas our approach can be still applied with very competitive performance at no additional cost.
This work has been supported by the University of the Basque Country, under grant GIU10/18 and project US11/06, by the Government of the Basque Country, under program SAIOTEK (project S-PE11UN065), and the Spanish MICINN, under Plan Nacional de I+D+i (project TIN2009-07446, partially financed by FEDER funds).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Vonwiller, J., Cleirigh, C., Garsden, H., Kumpf, K., Mountstephens, R., Rogers, I.: The development and application of an accurate and flexible automatic aligner. The International Journal of Speech Technology 1(2), 151–160 (1997)
Moreno, P., Alberti, C.: A factor automaton approach for the forced alignment of long speech recordings. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4869–4872 (April 2009)
Moreno, P., Joerg, C., Thong, J., Glickman, O.: A recursive algorithm for the forced alignment of very long audio segments. In: Fifth International Conference on Spoken Language Processing (1998)
Bordel, G., Nieto, S., Penagarikano, M., Rodriguez Fuentes, L.J., Varona, A.: Automatic subtitling of the Basque Parliament plenary sessions videos. In: Proceedings of Interspeech, pp. 1613–1616 (2011)
Bordel, G., Penagarikano, M., Rodriguez Fuentes, L.J., Varona, A.: A simple and efficient method to align very long speech signals to acoustically imperfect transcriptions. In: Interspeech 2012, Portland (OR), USA, September 9-13 (2012)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia (1993)
Garofolo, J.S., Graff, D., Paul, D., Pallett, D.S.: CSR-I (WSJ0) Complete. Linguistic Data Consortium, Philadelphia (2007)
Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Marino, J., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: Proceedings of Eurospeech, Berlin, Germany, September 22-25, pp. 175–178 (1993)
Basque Government, “ADITU program”, Initiative to promote the development of speech technologies for the Basque language (2005)
Weide, R.: The Carnegie Mellon pronouncing dictionary (cmudict.0.6). Carnegie Mellon University (2005)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Hirschberg, D.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341–343 (1975)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bordel, G., Penagarikano, M., Rodríguez-Fuentes, L.J., Fernández, M.A.V. (2012). Aligning Very Long Speech Signals to Bilingual Transcriptions of Parliamentary Sessions. In: Torre Toledano, D., et al. Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science, vol 328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35292-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-35292-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35291-1
Online ISBN: 978-3-642-35292-8
eBook Packages: Computer ScienceComputer Science (R0)