CNN-Based Phonetic Segmentation Refinement with a Cross-Speaker Setup

Cuozzo, Luis Gustavo D.; Silva, Diego Augusto; Neto, Mario Uliani; Simões, Flávio Olmos; Nagle, Edson Jose

doi:10.1007/978-3-319-99722-3_45

Luis Gustavo D. Cuozzo²¹,
Diego Augusto Silva²¹,
Mario Uliani Neto²¹,
Flávio Olmos Simões²¹ &
…
Edson Jose Nagle²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11122))

Included in the following conference series:

International Conference on Computational Processing of the Portuguese Language

792 Accesses

Abstract

This work proposes a method to improve the performance of automatic phonetic alignment of speech data. The method uses a deep convolutional neural network (CNN) trained on a combination of acoustic features extracted from labeled data to fine tune the position of each boundary within a fixed-size window around the original boundary position. The proposed method is robust to speaker identity, which means that a system trained with enough labeled data can be used to fine tune alignment on any speech file, regardless of speaker identity. With an absolute gain between 20% and 33% in cross speaker scenario, our results demonstrate the applicability of deep learning for this task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Forced alignment and goodness of pronunciation (GOP) with DNN support. https://github.com/tbright17/kaldi-dnn-ali-gop. Accessed 30 Mar 2018
Speech signal processing toolkit (SPTK), version: SPTK-3.11.tar.gz. http://sp-tk.sourceforge.net/
Adell, J., Bonafonte, A., Gómez, J.A., Castro, M.J.: Comparative study of automatic phone segmentation methods for TTS. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), vol. 1, p. I-309. IEEE (2005)
Google Scholar
Baby, A., Prakash, J.J., Vignesh, R., Murthy, H.A.: Deep learning techniques in tandem with signal processing cues for phonetic segmentation for text to speech synthesis in Indian languages. Proceedings of Interspeech 2017, pp. 3817–3821 (2017)
Google Scholar
Boersma, P.: Praat: doing phonetics by computer (2006). http://www.praat.org/
Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011)
Google Scholar
van Hemert, J.P.: Automatic segmentation of speech. IEEE Trans. Signal Process. 39(4), 1008–1012 (1991)
Article Google Scholar
Kawai, H., Toda, T.: An evaluation of automatic phone segmentation for concatenative speech synthesis. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), vol. 1, p. I-677. IEEE (2004)
Google Scholar
Lo, H.Y., Wang, H.M.: Phonetic boundary refinement using support vector machine. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 4, p. IV-933. IEEE (2007)
Google Scholar
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Proceedings of interspeech (2017)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, No. EPFL-CONF-192584. IEEE Signal Processing Society (2011)
Google Scholar
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs], September 2014
Sjölander, K., Beskow, J.: Wavesurfer-an open source speech tool. In: Sixth International Conference on Spoken Language Processing (2000)
Google Scholar
Tokuda, K., Kobayashi, T., Masuko, T., Imai, S.: Mel-generalized cepstral analysis. In: Proceedings of the ICSLP-94, pp. 1043–1046 (1994)
Google Scholar
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 1556–1559 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

CPqD, Campinas, SP, Brazil
Luis Gustavo D. Cuozzo, Diego Augusto Silva, Mario Uliani Neto, Flávio Olmos Simões & Edson Jose Nagle

Authors

Luis Gustavo D. Cuozzo
View author publications
You can also search for this author in PubMed Google Scholar
Diego Augusto Silva
View author publications
You can also search for this author in PubMed Google Scholar
Mario Uliani Neto
View author publications
You can also search for this author in PubMed Google Scholar
Flávio Olmos Simões
View author publications
You can also search for this author in PubMed Google Scholar
Edson Jose Nagle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Augusto Silva .

Editor information

Editors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Aline Villavicencio
Instituto de Informática - UFRGS, Porto Alegre, Brazil
Viviane Moreira
INESC-ID, Lisbon, Portugal
Alberto Abad
UFSCAR, Sao Carlos, Brazil
Helena Caseli
Centro Singular de Investigación en Tecnoloxías, Universidade de Santiago de Compostela, Santiago de Compostela, La Coruña, Spain
Pablo Gamallo
Université de Toulon, Parc Scientifique Technologique Luminy, Marseille, France
Carlos Ramisch
Centro de Informática e Sistemas, Universidade de Coimbra, Coimbra, Portugal
Hugo Gonçalo Oliveira
Federal University of Technology, Dois Vizinhos, Paraná, Brazil
Gustavo Henrique Paetzold

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cuozzo, L.G.D., Silva, D.A., Neto, M.U., Simões, F.O., Nagle, E.J. (2018). CNN-Based Phonetic Segmentation Refinement with a Cross-Speaker Setup. In: Villavicencio, A., et al. Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science(), vol 11122. Springer, Cham. https://doi.org/10.1007/978-3-319-99722-3_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-99722-3_45
Published: 26 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99721-6
Online ISBN: 978-3-319-99722-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics