Advertisement

Emilia: a speech corpus for Argentine Spanish text to speech synthesis

  • Humberto M. TorresEmail author
  • Jorge A. Gurlekian
  • Diego A. Evin
  • Christian G. Cossio Mercado
Original Paper
  • 14 Downloads

Abstract

This paper introduces Emilia, a speech corpus created to build a female voice in Spanish spoken in Buenos Aires for the Aromo text-to-speech system. Aromo is a unit selection text-to-speech system, which employs diphones as units of synthesis. The key requirements and design criteria for Emilia were: to synthesize any text in Spanish into high-quality speech with a minimum corpus size. The text corpus was designed to guarantee the phonetic and prosodic coverage. A three-stage strategy was used: in the first stage, 741 sentences were designed with all of the syllables of Spanish spoken in Argentina, with and without stress, and in all positions within the word; in the second stage, 852 sentences were added to balance out the distribution of the diphones; and after a perceptual evaluation of the quality of synthesized speech, in the third and final stage, 625 sentences were added to achieve the specified unit coverage, and to introduce sentences with more complex syntactic and prosodic structures. Issues from all three corpus building stages are reported. The paper also presents the results from the quality perceptual evaluations of the synthesized voice. Emilia has a duration of three hours and 15 minutes; its speech quality synthesized with Aromo system is similar to the level obtained with commercial systems, with a real-time ratio less than one.

Keywords

Speech corpus design Text-to-speech Argentine Spanish Phonetic corpus Phonetic transcription 

Notes

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful feedback. This research was supported by Ministerio de Ciencia y Tecnología and Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina.

References

  1. Adell, J., Bonafonte, A., Gomez J., & Castro, M. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the ICASSP’05, (pp. 309–312).  https://doi.org/10.1109/ICASSP.2005.1415112.
  2. Aguilar, L., Fernzández, J., Garrido J., Llisterri, J., Monzón, A. M. L., & Crespo, M. R. (1994). Evaluation of a Spanish text-to-speech system. In Proceedings of the second ESCA/IEEE workshop on speech synthesis (pp. 207–210). https://www.isca-speech.org/archive_open/archive_papers/ssw2/ssw2_207.pdf.
  3. Alıas, F., Iriondo, I., & Barnola, P. (2003). Multi-domain text classification for unit selection text-to-speech synthesis. In Procedings of the 15th international congress of phonetic sciences (pp. 2341–2344). https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2003/papers/p15_2341.pdf.
  4. Alvarez, Y. V., & Huckvale, M. (2002). The reliability of the ITU-T P.85 standard for the evaluation of text-to-speech systems. In Proceedings of the 7th international conference on speech & language processing (pp. 329–332). https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0329.pdf.
  5. Andersen, O., & Hoequist, C. (2003). Keeping rare events rare. In Proceedings of the eighth European conference on speech communication & technology (pp. II-1337–II-1340). https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_1337.pdf.
  6. Badino, L., Barolo, C., & Quazza, S. (2004). Language independent phoneme mapping for foreign TTS. Proceedings of the fifth ISCA workshop on speech synthesis, Pittsburgh, PA, USA (pp. 127–137). https://www.isca-speech.org/archive_open/archive_papers/ssw5/ssw5_217.pdf.
  7. Bayerl, P. S., & Paul, K. I. (2011). What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics, 37(4), 699–725.  https://doi.org/10.1162/COLI_a_00074.CrossRefGoogle Scholar
  8. Bellegarda, J. R. (2008). Unit-centric feature mapping for inventory pruning in unit selection text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 74–82.  https://doi.org/10.1109/TASL.2007.911059.CrossRefGoogle Scholar
  9. Benoît, C., Grice, M., & Hazan, V. (1966). The SUS test: A method for the assessment of TTS synthesis intelligibility. Speech Communication, 18(4), 381–392.  https://doi.org/10.1016/0167-6393(96)00026-X.CrossRefGoogle Scholar
  10. Betz, S., Carlmeyer, B., Wagner, P., & Wrede, B. (2018). Interactive hesitation synthesis: Modelling and evaluation. Multimodal Technologies and Interaction, 2(1), 9.  https://doi.org/10.3390/mti2010009.CrossRefGoogle Scholar
  11. Beutnagel, M., & Conkie, A. (1999). Interaction of units in a unit selection database. In Proceedings of the sixth European conference on speech communication and technology (Vol. 3, pp. 1063–1066). https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_1063.pdf.
  12. Black, A. W., & Lenzo, K. A. (2000). Limited domain synthesis. Proceedings of the 6th international conference on spoken language processing (Vol. 2, pp. 411–414). https://www.isca-speech.org/archive/archive_papers/icslp_2000/i00_2411.pdf.
  13. Black, A. W., & Lenzo, K. A. (2003). Building synthetic voices. Language Technologies Institute, Carnegie Mellon University and Cepstral LLC 4:2. http://festvox.org/bsv/bsv.pdf.
  14. Boëffard, O. (2001). Variable-length acoustic units inference for text-to-speech synthesis. In Proceedings of the 7th European conference on speech communication and technology (pp. 983–986). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_0983.pdf.
  15. Bonafonte, A., Höge, H., Kiss I., Moreno, A., Ziegenhain, U., Heuvel, H., Hain, H., Wang, X., & Garcia, M. (2006). TC-STAR: Specifications of language resources and evaluation for speech. In Proceedings of the 5th interantional conference on language resources and evaluation (pp. 311–314). http://nlp.lsi.upc.edu/publications/papers/tc_star_spec.pdf.
  16. Bonafonte, A., Höge, H., Tropf, H. S., Moreno, A., van der Heuvel, H., Sündermann, D., Ziegenhain, U., Kiss, J. P. I., & Jokisch, O. (2005). TTS baselines and specifications. In Deliverable D8 of the EU project TC-STAR technology and corpora for speech to speech translation (FP6-506738). http://nlp.lsi.upc.edu/publications/papers/tc_star_spec.pdf.
  17. Bozkurt, B., Ozturk, O., & Dutoit, T. (2003). Text design for TTS speech corpus building using a modified greedy selection. In Proceedings of the eighth European conference on speech communication and technology (pp. 277–280). https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_0277.pdf.
  18. Breen, A. P., & Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT’s laureate TTS system. In Proceedings of the third ESCA workshop on speech synthesis (pp. 373–376). https://www.isca-speech.org/archive_open/archive_papers/ssw3/ssw3_201.pdf.
  19. Campbell, N. (1996). Chatr: A high-definition speech re-sequencing system. In Proceedings of the 3rd ASA/ASJ joint meeting (pp. 1223–1228). http://www.speech-data.jp/nick/feast/proceeding/asa-asj%201996_12.pdf
  20. Campbell, N. (2005). Developments in corpus-based speech synthesis: Approaching natural conversational speech. IEICE Transactions on Information and Systems, 88(3), 376–383.  https://doi.org/10.1093/ietisy/e88-d.3.376.CrossRefGoogle Scholar
  21. Chalamandaris, A., Tsiakoulis, P., Raptis, S., & Karabetsos, S. (2011). Corpus design for a unit selection TTS system with application to Bulgarian. Human Language Technology Challenges for Computer Science and Linguistics, 6562, 35–46.  https://doi.org/10.1007/978-3-642-20095-3_4.CrossRefGoogle Scholar
  22. Chevelu, J., Barbot, N., Boeffard, O., & Delhay, A. (2008). Comparing set-covering strategies for optimal corpus design. In Proceedings of the 23rd European signal processing conference (pp. 2951–2956). http://lrec-conf.org/proceedings/lrec2008/pdf/750_paper.pdf.
  23. Chevelu, J., & Lolive, D. (2015). Do not build your TTS training corpus randomly. In Proceedings of the signal processing conference, IEEE (pp. 350–354).  https://doi.org/10.1109/EUSIPCO.2015.7362403.
  24. Chu, M., Chen, Y., Zhao, Y., Li, Y., & Soong, F. (2006). A study on how human annotations benefit the TTS voice. In Proceedings of the blizzard challenge workshop 2006. http://www.festvox.org/blizzard/bc2006/msra_blizzard2006.pdf.
  25. Chu, M., & Peng, H. (2001). An objective measure for estimating MOS of synthesized speech. In Proceedings of the eventh European conference on speech communication and technology (Vol. 3, pp. 2087–2090). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2087.pdf.
  26. Coelho, L., Hain, HU., Jokisch, O., & Braga, D. (2009). Towards an objective voice preference definition for the portuguese language. In Proceedings of the joint SIG-IL/microsoft workshop on speech and language technologies for Iberian languages (pp. 67–70). http://www.isca-speech.org/archive_open/sltech_2009/papers/isl9_067.pdf.
  27. Colantoni, L., & Gurlekian, J. (2004). Convergence and intonation: Historical evidence from Buenos Aires Spanish. Bilingualism: Language and Cognition, 7(2), 107–119.  https://doi.org/10.1017/S1366728904001488.CrossRefGoogle Scholar
  28. Coloma, G. (2018). Illustrations of the IPA: Argentine Spanish. Journal of the International Phonetic Association, 48, 243–250.  https://doi.org/10.1017/S0025100317000275.CrossRefGoogle Scholar
  29. Cryer, H., & Home, S. (2010). Review of methods for evaluating synthetic speech. RNIB Centre for Accessible Information, Birmingham: Technical report #8. https://www.rnib.org.uk/sites/default/files/2010_02_Evaluating_synthetic_speech_review.doc.
  30. Dutoit, T. (1997). An introduction to text-to-speech synthesis. Text, speech and language technology. Dordrecht: Kluwer Academic.CrossRefGoogle Scholar
  31. Dybkjær, L., & Hemsen, H. (2007). Evaluation of text and speech systems. Berlin: Springer.CrossRefGoogle Scholar
  32. Eisen, B. (1993). Reliability of speech segmentation and labelling at different levels of transcription. In Proccedings of 3rd European conference on speech communication and technology (Vol. 1, pp. 673–676). https://www.isca-speech.org/archive/archive_papers/eurospeech_1993/e93_0673.pdf.
  33. Entropic. (1993). ESPS version 5.0 programs manual. Washington, D.C.: Entropic Research Laboratory.Google Scholar
  34. Falk, T. H., & Moller, S. (2008). Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letters, 15, 781–784.  https://doi.org/10.1109/LSP.2008.2006709.CrossRefGoogle Scholar
  35. Febrer, A., Padrell, J., & Bonafonte, A. (1998). Generation of unit databases for the UPC text-to-speech system. In Proceedings of the international workshop on speech and computer (pp. 26–29). http://www.lsi.upc.edu/~nlp/papers/febrer98b.pdf.
  36. Fernández-Torné, A., & Matamala, A. (2015). Text-to-speech vs. human voiced audio descriptions: A reception study in films dubbed into catalan. The Journal of Specialised Translation, 24, 61–88. http://www.jostrans.org/issue24/art_fernandez.php.
  37. François, H., & Boëffard, O. (2001). Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In Proceedings of the seventh European conference on speech communication and technology (pp. 829–832). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_0829.pdf
  38. François, H., & Boëffard, O. (2002). The greedy algorithm and its application to the construction of a continuous speech database. In Procedings of the third international conference on language resources and evaluation (pp. 1420–1426). http://lrec.elra.info/proceedings/lrec2002/pdf/265.pdf.
  39. Fujisaki, H. & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of japanese. Journal of Acoustic Society of Japan, 5(4), 233–242. https://www.jstage.jst.go.jp/article/ast1980/5/4/5_4_233/_pdf.
  40. Grůber, M., Matoušek, J., Tihelka, D., & Hanzlicek, Z. (2014). Reducing footprint of unit selection TTS system by removing linguistic segments with rarely selected units. In Proceedings of the 12th international conference on signal processing (pp. 494–499).  https://doi.org/10.1109/ICOSP.2014.7015054
  41. Grůber, M., Tihelka, D., & Matoušek, J. (2007). Evaluation of various unit types in the unit selection approach for the czech language using the festival system. In Proceedings of the 6th ISCA workshop on speech synthesis (pp. 276–281). http://www.isca-speech.org/archive_open/archive_papers/ssw6/ssw6_276.pdf.
  42. Guirao, M., & Jurado, M. G. (1993). Estudio estadístico del español. Buenos Aires: CONICET.Google Scholar
  43. Gurlekian, J. A., Colantoni, L., & Torres, H. M. (2001a). El alfabeto fonético SAMPA y el diseño de córpora fonéticamente balanceados. Fonoaudiológica, 47(3), 58–70.Google Scholar
  44. Gurlekian, J. A., Cossio-Mercado, C., Torres, H. M., & Vaccari, M. E. (2012). Subjective evaluation of a high quality text-to-speech system for argentine spanish. In Proceedings of VII Jornadas en Tecnologí del Habla and III Iberian SLTech Workshop, IberSPEECH 2012 (pp. 241–250). https://www.researchgate.net/profile/Christian_Cossio-Mercado/publication/265955190_Subjective_Evaluation_of_a_High_Quality_Text-to-Speech_System_for_Argentine_Spanish/links/552ef53d0cf2acd38cbbdad4.pdf.
  45. Gurlekian, J. A., Rodríguez, H., Colantoni, L., & Torres, H. M. (2001b). Development of a prosodic database for an argentine spanish text to speech system. In B. Bird, & M. Liberman (Eds.) Proceedings of the IRCS workshop on linguistic databases, SIAM (pp. 99–104). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.5050&rep=rep1&type=pdf.
  46. Gurlekian, J. A., Torres, H. M., & Evin, D. (2014). Guía para la segmentación y transcripción fonética para las tecnologías del habla. Fonoaudiológica, 61(2), 24–27.Google Scholar
  47. Hall, J. L. (2001). Application of multidimensional scaling to subjective evaluation of coded speech. The Journal of the Acoustical Society of America, 110(4), 2167–2182.  https://doi.org/10.1121/1.1397322.CrossRefGoogle Scholar
  48. Hansakunbuntheung, C., Rugchatjaroen, A., & Wutiwiwatchai, C. (2005). Space reduction of speech corpus based on quality perception for unit selection speech synthesis. In Proceedings of the 6th international symposium on natural language processing (pp. 127–132). https://www.researchgate.net/profile/Chatchawarn_Hansakunbuntheung/publication/228957899_Space_reduction_of_speech_corpus_based_on_quality_perception_for_unit_selection_speech_synthesis/links/0912f510bb45091b12000000.pdf.
  49. Harris, J. (1983). Syllable structure and Stress in Spanish. Cambridge: The MIT Press.Google Scholar
  50. Hinterleitner, F., Norrenbrock, C., & Möller, S. (2013). Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech. In Proceedings of the eighth ISCA workshop on speech synthesis (pp. 147–151). http://ssw8.talp.cat/papers/ssw8_PS2-1_Hinterleitner.pdf.
  51. Hinterleitner, F., Norrenbrock, C., Möller, S., & Heute, U. (2014). Text-to-speech synthesis. Quality of experience (pp. 179–193). Berlin: Springer.Google Scholar
  52. Hinterleitner, F., Zabel, S., Möller, S., Leutelt, L., & Norrenbrock, C. (2011). Predicting the quality of synthesized speech using reference-based prediction measures. In Proceedings of the 22th Konferenz Elektronische Sprachsignalverarbeitung (pp. 99–106). http://www.qu.tu-berlin.de/fileadmin/fg41/publications/hinterleitner_2011_predicting-the-quality-of-synthesized-speech-using-reference.-.based-prediction-measures.pdf.
  53. Hirst, D., Rilliard, A., & Aubergé, V. (1998). Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthesis. In Proceedings of the third ESCA/COCOSDA workshop (ETRW) on speech synthesis (pp. 293–306). https://www.isca-speech.org/archive_open/archive_papers/ssw3/ssw3_001.pdf.
  54. Hoeckel, C. (1989). The reliability of manual labelling of continuous speech. In Proceedings of the ESCA workshop on speech input/output assessment an speech databases (Vol. 2, pp. 2179–2182). http://www.isca-speech.org/archive_open/archive_papers/sioa_89/sia_2179.pdf.
  55. Hon, H., Acero, A., Huang, X., Liu, J., & Plumpe, M. (1998). Automatic generation of synthesis units for trainable text to speech systems. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’98) (Vol. 1, pp. 293–306).  https://doi.org/10.1109/ICASSP.1998.674425
  56. Karabetsos, S., Tsiakoulis, P., Chalamandaris, A., & Raptis, S. (2009). Embedded unit selection text-to-speech synthesis for mobile devices. IEEE Transactions on Consumer Electronics, 55(2), 613–621.  https://doi.org/10.1109/TCE.2009.5174430.CrossRefGoogle Scholar
  57. Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. I–677–80).  https://doi.org/10.1109/ICASSP.2004.1326076.
  58. Kawai, H., & Tsuzaki, M. (2002). Study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis. In Proceedings of the IEEE workshop on speech synthesis (pp. 15–18).  https://doi.org/10.1109/WSS.2002.1224362.
  59. Kelly, A. C., Berthelsen, H., Campbell, N., Chasaide, A. N., & Gobl, C. (2009). Corpus design techniques for irish speech synthesis. In Proceedings of the China Ireland ICT conference (pp. 264–265). http://www.eeng.dcu.ie/ciict/2009/proceedings.pdf.
  60. King, S. (2014). Measuring a decade of progress in text-to-speech. Loquens, 1(1).  https://doi.org/10.3989/loquens.2014.006.
  61. Kishore, S., & Black, A. (2003). Unit size in unit selection speech synthesis. In Proceedings of the Eurospeech 2003 (pp. 1317–1320). https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_1317.pdf.
  62. Krul, A., Damnati, G., Yvon, F., Boidin, C., & Moudenc, T. (2007). Approaches for adaptive database reduction for text-to-speech synthesis. In Proceedings of the eighth annual conference of the international speech communication association (Vol. 3, pp. 2881–2884). https://www.isca-speech.org/archive/archive_papers/interspeech_2007/i07_2881.pdf.
  63. Kurtic, E. (2004). Polyglot voice design for unit selection speech synthesis. Master’s thesis, School of Philosophy, Psychology and Language Sciences, University of Edinburgh. https://www.era.lib.ed.ac.uk/bitstream/handle/1842/2070/Emina%20Kurtic.pdf?sequence=1&isAllowed=y
  64. Lambert. T., Braunschweiler, N., & Buchholz, S. (2007). How (not) to select your voice corpus: Random selection vs. phonologically balanced. In Proceedings of the 6th ISCA workshop on speech synthesis (pp. 22–24). https://isca-speech.org/archive_open/archive_papers/ssw6/ssw6_264.pdf.
  65. Lewis, E., & Tatham, M. (1999). Word and syllable concatenation in text-to-speech synthesis. In Proceedings of the sixth European conference on speech communications and technology (Vol. 2, pp. 615–618). https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_0615.pdf.
  66. Llisterri, J. (1999). Transcripción, etiquetado y codificación de corpus orales. Revista Española de Lingüística Aplicada, Monográfico: Panorama de la Investigación en Lingüística Informática, (pp, 53–82). http://liceu.uab.es/~joaquim/publicacions/RESLA_99.pdf.
  67. Lu, H., Zhang, W., Shao, X., Lei, Q. Z. W., Zhou, H.. & Breen, A. (2015). Pruning redundant synthesis units based on static and delta unit appearance frequency. In Proceedings of the sixteenth annual conference of the international speech communication association (pp. 269–273). https://www.isca-speech.org/archive/interspeech_2015/papers/i15_0269.pdf.
  68. Marino, J. B., Nogueiras, A., Pachès-Leal, P., & Bonafonte, A. (2000). The demiphone: An efficient contextual subword unit for continuous speech recognition. Speech Communication, 32(3), 187–197.  https://doi.org/10.1016/S0167-6393(00)00010-8.CrossRefGoogle Scholar
  69. Matoušek, J., & Psutka, J. (2001). Design of speech corpus for text-to-speech synthesis. In Proceedings of the 7th conference on speech communication and technology (pp. 2047–2050). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2047.pdf.
  70. Matoušek, J., Tihelka, D., & Romportl, J. (2008). Building of a speech corpus optimised for unit selection TTS synthesis. In Proceedings of 6th international conference on language resources and evaluation (pp. 1296–1299). http://www.lrec-conf.org/proceedings/lrec2008/pdf/329_paper.pdf.
  71. Mayo, C., Clark, R. A., & King, S. (2005). Multidimensional scaling of listener responses to synthetic speech. In Proceedings of the 9th European conference on speech communication and technology (pp. 1725–1728). https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1725.pdf.
  72. McPherson, I. (1975). Spanish phonology: Descriptive and historical. Manchester: Manchester Univiversity Press.Google Scholar
  73. Mendelson, J., & Aylett, M. (2017). Beyond the listening test: An interactive approach to TTS evaluation. In Proceedings of the 18th annual conference of the international speech communication association (pp. 20–24).  https://doi.org/10.21437/Interspeech.2017-1438.
  74. Möbius, B. (2000). Corpus-based speech synthesis: Methods and challenges. AIMS, Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung, 6(4), 87–116. http://www.ims.uni-stuttgart.de/~moebius/papers/unitsel.pdf.
  75. Möbius, B. (2003). Rare events and closed domains: Two delicate concepts in speech synthesis. International Journal of Speech Technology, 6(1), 57–71.  https://doi.org/10.1023/A:1021052023237.CrossRefGoogle Scholar
  76. Möller, S., Hinterleitner, F., Falk, T. H., & Polzehl, T. (2010). Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Proceedings of the eleventh annual conference of the international speech communication association (pp. 1325–1328). https://www.isca-speech.org/archive/archive_papers/interspeech_2010/i10_1325.pdf.
  77. Ni, J., Hirai, T., Kawai, H., Toda, T., Tokuda, K., Tsuzaki, M., Sakai, S., Maia, R., & Nakamura, S. (2007). ATRECSS: ATR english speech corpus for speech synthesis. In Proceedings of the 6th ISCA workshop on speech synthesis, paper 002. https://www.isca-speech.org/archive_open/archive_papers/blizzard_2007/blz3_002.pdf.
  78. Niebuhr, O., & Michaud, A. (2015). Speech data acquisition: The underestimated challenge. In KALIPHO-Kieler Arbeiten zur Linguistik und Phonetik, 3, 1–42. https://halshs.archives-ouvertes.fr/halshs-01026295v4/document.
  79. Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Möller, S. (2015). Quality prediction of synthesized speech based on perceptual quality dimensions. Speech Communication, 66, 17–35.  https://doi.org/10.1016/j.specom.2014.06.003.CrossRefGoogle Scholar
  80. Oliveira, L. C., Paulo, S., Figueira, L., Mendes, C., Nunes, A., & Godinho, J. (2008). Methodologies for designing and recording speech databases for corpus based synthesis. In Proceedings of the 6th international conference on language resources and evaluation (pp. 2921–2925). http://www.lrec-conf.org/proceedings/lrec2008/pdf/741_paper.pdf.
  81. P.85 ITR. (1990). Studies toward the unification of picture assessment methodology. Technical report, ITU. https://www.itu.int/dms_pub/itu-r/opb/rep/R-REP-BT.1082-1-1990-PDF-E.pdf.
  82. P800 ITR. (1996). Methods for subjective determination of transmission quality. Technical report, ITU. https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.800-199608-I!!PDF-E&type=items.
  83. P85 ITR. (1994). Method for subjective performance assessment of the quality of speech voice output devices. Technical report, ITU. https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.85-199406-I!!PDF-E&type=items.
  84. Peterson, G. E., Wang, W. S. Y., & Sivertsen, E. (1958). Segmentation techniques in speech synthesis. The Journal of the Acoustical Society of America, 30(8), 739–742.  https://doi.org/10.1121/1.1909746.CrossRefGoogle Scholar
  85. Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., & Raymond, W. (2005). The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1), 89–95.  https://doi.org/10.1016/j.specom.2004.09.001.CrossRefGoogle Scholar
  86. Prudon, R., & d’Alessandro, C. (2001). A selection/concatenation text to speech synthesis system: Databases development, system design, comparative evaluation. In Proceedings of the 4th speech synthesis workshop (SSW4-2001), paper 138. https://www.isca-speech.org/archive_open/archive_papers/ssw4/ssw4_138.pdf.
  87. Rodríguez, H. (2000). Construcción de una base de datos para el desarrollo de sistemas de conversión de texto a habla. University of La Plata, Buenos Aires, licenciature thesis.Google Scholar
  88. Rosenberg, A., & Ramabhadran, B. (2017). Bias and statistical significance in evaluating speech synthesis with mean opinion scores. In Proceedings of the 18th annual conference of the international speech communication association (pp. 3976–3980).  https://doi.org/10.21437/Interspeech.2017-479.
  89. Royal Spanish Academy. (1992). Dictionary of the Spanish language. Madrid: Espasa Calpe.Google Scholar
  90. Rutten, P., Aylett, M. P., Fackrell, J., & Taylor, P. (2002). A statistically motivated database pruning technique for unit selection synthesis. In Proceedings of the seventh international conference on spoken language processing (pp. 125–128). https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0125.pdf.
  91. Sainz, I., Navas, E., Hernáez, I., Bonafonte, A., & Campillo, F. (2010). TTS evaluation campaign with a common spanish database. In Proceedings of the seventh international conference on language resources and evaluation (pp. 2155–2160). http://www.lrec-conf.org/proceedings/lrec2010/pdf/456_Paper.pdf.
  92. Schiel, F., Baumann, A., Draxler, C., Ellbogen, T., Hoole, P., & Steffen, A. (2012). The validation of speech corpora. Munchen: Bavarian Archive for Speech Signals. https://epub.ub.uni-muenchen.de/13698/1/schiel_13698.pdf.
  93. Sityaev, D., Knill, K., & Burrows, T. (2006). Comparison of the ITU-T P.85 standard to other methods for the evaluation of text-to-speech systems. In Proceedings of the ninth international conference on spoken language processing (pp. 2743–2746). https://www.isca-speech.org/archive/archive_papers/interspeech_2006/i06_1233.pdf.
  94. Streijl, R. C., Winkler, S., & Hands, D. S. (2016). Mean opinion score (mos) revisited: Methods and applications, limitations and alternatives. Multimedia Systems, 22(2), 213–227.  https://doi.org/10.1007/s00530-014-0446-1.CrossRefGoogle Scholar
  95. Syrdal, A., Wightman, C., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K., & Makashay, M. (2000). Corpus-based techniques in the AT&t nextgen synthesis system. In Proceedings of the 6th international conference on spoken language processing (Vol. 3, pp. 410–415). https://www.isca-speech.org/archive/archive_papers/icslp_2000/i00_3410.pdf.
  96. Syrdal, A. K., Conkie, A., & Stylianou, Y. (1998). Exploration of acoustic correlates in speaker selection for concatenative synthesis. In Proceedings of the international conference on spoken language processing (Vol. 6, pp. 2743–2746). https://www.isca-speech.org/archive/archive_papers/icslp_1998/i98_0882.pdf.
  97. Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  98. Torres, H. M. (2012). Creación de un corpus de texto para la construcción de un sistema TTS. Informe técnico, ISSN 0325-2043, Laboratorio de Investigaciones Sensoriales, UBA-CONICET, Buenos Aires, Argentina. http://www.lis.secyt.gov.ar/informes/2012.pdf
  99. Torres, H. M. (2013). Medición de la velocidad de conversión del sistema TTS aromo. Informe técnico, ISSN 0325-2043, Laboratorio de Investigaciones Sensoriales, UBA-CONICET, Buenos Aires, Argentina. http://www.lis.secyt.gov.ar/informes/2013.pdf
  100. Torres, H. M., & Gurlekian, J. (2004). Automatic determination of phrase breaks for argentine spanish. In Proceedings of the speech prosody 2004 (pp. 553–556). http://www.isca-speech.org/archive_open/sp2004/sp04_553.pdf.
  101. Torres, H. M., & Gurlekian, J. A. (2008). Acoustic speech unit segmentation for concatenative synthesis. Computer Speech and Language, 22, 196–206.  https://doi.org/10.1016/j.csl.2007.07.002.CrossRefGoogle Scholar
  102. Torres, H. M., & Gurlekian, J. A. (2009). Parameter estimation and prediction from text for a superpositional intonation model. In Proceedings of the 20 Konferenz Elektronische Sprachsignalverarbeitung (pp. 238–247). https://www.researchgate.net/publication/265963364_Parameter_estimation_and_prediction_from_text_for_a_superpositional_intonation_model
  103. Torres, H. M., & Gurlekian, J. A. (2016). Novel estimation method for the superpositional intonation model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1), 151–160.  https://doi.org/10.1109/TASLP.2015.2500728.CrossRefGoogle Scholar
  104. Torres, H. M., Gurlekian, J. A., & Mercado, C. (2012). Aromo: Argentine spanish TTS system. In Proceedings of VII Jornadas en Tecnología del Habla and III Iberian SLTech workshop (pp. 416–421). https://www.researchgate.net/profile/Christian_Cossio-Mercado/publication/265952108_Aromo_Argentine_Spanish_TTS_System/links/570c37ea08aee0660351b0b9.pdf
  105. Umbert, M., Moreno, A., Agüero, P., & Bonafonte, A. (2006). Spanish synthesis corpora. In Proceedings of the international conference of language resources and evaluation (pp. 2102–2105). http://www.lrec-conf.org/proceedings/lrec2006/pdf/590_pdf.pdf.
  106. Vainio, M., Jarvikivi, J., Werner, S., Volk, N., & Valikangas, J. (2002). Effect of prosodic naturalness on segmental acceptability in synthetic speech. In Proceedings of 2002 IEEE workshop on speech synthesis (pp. 143–146).  https://doi.org/10.1109/WSS.2002.1224394.
  107. Valentini-Botinhao, C., Yamagishi, J., & King, S. (2011). Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise. In 2011 IEEE international conference on acoustics, speech and signal processing (pp. 5112–5115).  https://doi.org/10.1109/ICASSP.2011.5947507.
  108. van den Heuvel, H., Iskra, D., Sanders, E., & de Vriend, F. (2008). Validation of spoken language resources: An overview of basic aspects. Language Resources and Evaluation, 42(1), 41–73.  https://doi.org/10.1007/s10579-007-9049-1.CrossRefGoogle Scholar
  109. van Santen, J. P. H. (1997). Prosodic modelling in text-to-speech synthesis. In Proceedings of the 5th European conference on speech communication and technology (Vol. 5, pp. 2511–2514). https://www.isca-speech.org/archive/archive_papers/eurospeech_1997/e97_KN19.pdf.
  110. Viswanathan, M., & Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (mos) scale. Computer Speech & Language, 19(1), 55–83.  https://doi.org/10.1016/j.csl.2003.12.001.CrossRefGoogle Scholar
  111. Watson, A., Mullin, J., Smallwood, L., & Wilson, G. (2001). New techniques for assessing audio and video quality in real-time interactive communication. In Tutorial at IHM-HCI, Lille, France. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.494.6094&rep=rep1&;type=pdf.
  112. Zhang, W., Liu, Y., Deng, Y., & Pang, M. (2010). Automatic construction for a TTS corpus with limited text. In Proccedings of the 2010 international conference on measuring technology and mechatronics automation (Vol. 1, pp. 707–710).  https://doi.org/10.1109/ICMTMA.2010.796.

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Laboratorio de Investigaciones SensorialesINIGEM, CONICET-UBABuenos AiresArgentina
  2. 2.Center for Research and Transfer in Acoustics (CINTRA)UTN-FRC UA CONICETCórdoba CapitalArgentina
  3. 3.Departamento de ComputaciónFCEN, UBABuenos AiresArgentina

Personalised recommendations