Aligning Very Long Speech Signals to Bilingual Transcriptions of Parliamentary Sessions

Bordel, Germán; Penagarikano, Mikel; Rodríguez-Fuentes, Luis Javier; Fernández, María Amparo Varona

doi:10.1007/978-3-642-35292-8_8

Germán Bordel⁷,
Mikel Penagarikano⁷,
Luis Javier Rodríguez-Fuentes⁷ &
…
María Amparo Varona Fernández⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 328))

719 Accesses
1 Citations

Abstract

In this paper, we describe and analyse the performance of a simple approach to the alignment of very long speech signals to acoustically inaccurate transcriptions, even when two different languages are employed. The alignment algorithm operates on two phonetic sequences, the first one automatically extracted from the speech signal by means of a phone decoder, and the second one obtained from the reference text by means of a multilingual grapheme-to-phoneme transcriber. The proposed algorithm is compared to a widely known state-of-the-art alignment procedure based on word-level speech recognition. We present alignment accuracy results on two different datasets: (1) the 1997 English Hub4 database; and (2) a set of bilingual (Basque/Spanish) parliamentary sessions. In experiments on the Hub4 dataset, the proposed approach provided only slightly worse alignments than those reported for the state-of-the-art alignment procedure, but at a much lower computational cost and requiring much fewer resources. Moreover, if the resource to be aligned includes speech in two or more languages and speakers conmute between them at any time, applying a speech recognizer becomes unfeasible in practice, whereas our approach can be still applied with very competitive performance at no additional cost.

This work has been supported by the University of the Basque Country, under grant GIU10/18 and project US11/06, by the Government of the Basque Country, under program SAIOTEK (project S-PE11UN065), and the Spanish MICINN, under Plan Nacional de I+D+i (project TIN2009-07446, partially financed by FEDER funds).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Vonwiller, J., Cleirigh, C., Garsden, H., Kumpf, K., Mountstephens, R., Rogers, I.: The development and application of an accurate and flexible automatic aligner. The International Journal of Speech Technology 1(2), 151–160 (1997)
Article Google Scholar
Moreno, P., Alberti, C.: A factor automaton approach for the forced alignment of long speech recordings. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4869–4872 (April 2009)
Google Scholar
Moreno, P., Joerg, C., Thong, J., Glickman, O.: A recursive algorithm for the forced alignment of very long audio segments. In: Fifth International Conference on Spoken Language Processing (1998)
Google Scholar
Bordel, G., Nieto, S., Penagarikano, M., Rodriguez Fuentes, L.J., Varona, A.: Automatic subtitling of the Basque Parliament plenary sessions videos. In: Proceedings of Interspeech, pp. 1613–1616 (2011)
Google Scholar
Bordel, G., Penagarikano, M., Rodriguez Fuentes, L.J., Varona, A.: A simple and efficient method to align very long speech signals to acoustically imperfect transcriptions. In: Interspeech 2012, Portland (OR), USA, September 9-13 (2012)
Google Scholar
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia (1993)
Google Scholar
Garofolo, J.S., Graff, D., Paul, D., Pallett, D.S.: CSR-I (WSJ0) Complete. Linguistic Data Consortium, Philadelphia (2007)
Google Scholar
Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Marino, J., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: Proceedings of Eurospeech, Berlin, Germany, September 22-25, pp. 175–178 (1993)
Google Scholar
Basque Government, “ADITU program”, Initiative to promote the development of speech technologies for the Basque language (2005)
Google Scholar
Weide, R.: The Carnegie Mellon pronouncing dictionary (cmudict.0.6). Carnegie Mellon University (2005)
Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Article Google Scholar
Hirschberg, D.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341–343 (1975)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

GTTS, Department of Electricity and Electronics, ZTF/FCT, University of the Basque Country UPV/EHU, Barrio Sarriena, 48940, Leioa, Spain
Germán Bordel, Mikel Penagarikano, Luis Javier Rodríguez-Fuentes & María Amparo Varona Fernández

Authors

Germán Bordel
View author publications
You can also search for this author in PubMed Google Scholar
Mikel Penagarikano
View author publications
You can also search for this author in PubMed Google Scholar
Luis Javier Rodríguez-Fuentes
View author publications
You can also search for this author in PubMed Google Scholar
María Amparo Varona Fernández
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Escuela Politecnica Superior, Universidad Autonoma de Madrid. C/ Francisco, Tomas y Valiente 11, 28049, Madrid, Spain
Doroteo Torre Toledano
Centro Politécnico Superior, Edificio Ada Byron, C/ María de Luna nº 1, 50018, Zaragoza, Spain
Alfonso Ortega Giménez
Universidade de Aveiro, Campus Universitário Aveiro, 3810-193, Aveiro, Portugal
António Teixeira
Escuela Politecnica Superior, Universidad Autonoma de Madrid, C/ Francisco, Tomas y Valiente 11, 28049, Madrid, Spain
Joaquín González Rodríguez
E.T.S.I.Telecomunicacion, Universidad Politécnica de Madrid, Ciudad Universitaria s/n, 28040, Madrid, Spain
Luis Hernández Gómez & Rubén San Segundo Hernández &
Escuela Politecnica Superior, Universidad Autonoma de Madrid, C/ Francisco, Tomas y Valiente 11, 28049, Madrid, Spain
Daniel Ramos Castro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bordel, G., Penagarikano, M., Rodríguez-Fuentes, L.J., Fernández, M.A.V. (2012). Aligning Very Long Speech Signals to Bilingual Transcriptions of Parliamentary Sessions. In: Torre Toledano, D., et al. Advances in Speech and Language Technologies for Iberian Languages. Communications in Computer and Information Science, vol 328. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35292-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-35292-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35291-1
Online ISBN: 978-3-642-35292-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics