Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech

  • Alp ÖktemEmail author
  • Mireia Farrús
  • Leo Wanner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10583)


Until very recently, the generation of punctuation marks for automatic speech recognition (ASR) output has been mostly done by looking at the syntactic structure of the recognized utterances. Prosodic cues such as breaks, speech rate, pitch intonation that influence placing of punctuation marks on speech transcripts have been seldom used. We propose a method that uses recurrent neural networks, taking prosodic and lexical information into account in order to predict punctuation marks for raw ASR output. Our experiments show that an attention mechanism over parallel sequences of prosodic cues aligned with transcribed speech improves accuracy of punctuation generation.


Speech transcription Recurrent neural networks Prosody Punctuation generation Automatic speech recognition 



We would like to thank Francesco Barbieri for offering his technical insights throughout this work. This work is part of the KRISTINA project, which has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under the Grant Agreement number H2020-RIA-645012. The second author is partially funded by the Spanish Ministry of Economy, Industry and Competitiveness through the Ramón y Cajal program.


  1. 1.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014).
  2. 2.
    Ballesteros, M., Wanner, L.: A neural network architecture for multilingual punctuation generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)Google Scholar
  3. 3.
    Baron, D., Shriberg, E., Stolcke, A.: Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Channels 20(61), 41 (2002)Google Scholar
  4. 4.
    Batista, F., Moniz, H., Trancoso, I., Mamede, N.: Bilingual experiments on automatic recovery of capitalization and punctuation of automatic speech transcripts. IEEE Trans. Audio Speech Lang. Process. 20(2), 474–485 (2012)CrossRefGoogle Scholar
  5. 5.
    Boersma, P., Weenink, D.: Praat: doing phonetics by computer [computer program] (2016).
  6. 6.
    Cho, E., Niehues, J., Kilgour, K., Waibel, A.: Punctuation insertion for real-time spoken language translation. In: Proceedings of the Eleventh International Workshop on Spoken Language Translation (2015)Google Scholar
  7. 7.
    Cho, E., Niehues, J., Waibel, A.: Segmentation and punctuation prediction in speech language translation using a monolingual translation system. In: International Workshop on Spoken Language Translation (IWSLT) 2012 (2012)Google Scholar
  8. 8.
    Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014).
  9. 9.
    Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proceedings of the ISCA Workshop on Prosody in Speech Recognition and Understanding, pp. 35–40 (2001)Google Scholar
  10. 10.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). MathSciNetzbMATHGoogle Scholar
  11. 11.
    Dyer, C., Ballesteros, M., Ling, W., Matthews, A., Smith, N.A.: Transition-based dependency parsing with stack long short-term memory. CoRR abs/1505.08075 (2015).
  12. 12.
    Farrús, M., Lai, C., Moore, J.D.: Paragraph-based prosodic cues for speech synthesis applications. In: Proceedings of the 8th International Conference on Speech Prosody (2016)Google Scholar
  13. 13.
    Favre, B., Grishman, R., Hillard, D., Ji, H., Hakkani-Tur, D., Ostendorf, M.: Punctuating speech for information extraction. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 5013–5016. IEEE (2008)Google Scholar
  14. 14.
    Fung, J.G., Hakkani-Tür, D., Magimai-Doss, M., Shriberg, E., Cuendet, S., Mirghafori, N.: Cross-linguistic analysis of prosodic features for sentence segmentation. In: Eighth Annual Conference of the International Speech Communication Association (2007)Google Scholar
  15. 15.
    Hillard, D., Huang, Z., Ji, H., Grishman, R., Hakkani-Tur, D., Harper, M., Ostendorf, M., Wang, W.: Impact of automatic comma prediction on POS/name tagging of speech. In: Spoken Language Technology Workshop, pp. 58–61. IEEE (2006)Google Scholar
  16. 16.
    Jakubícek, M., Horák, A.: Punctuation detection with full syntactic parsing. Spec. Issue: Nat. Lang. Process. Appl. 46, 335–346 (2010)Google Scholar
  17. 17.
    Khomitsevich, O., Chistikov, P., Krivosheeva, T., Epimakhova, N., Chernykh, I.: Combining prosodic and lexical classifiers for two-pass punctuation detection in a Russian ASR system. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 161–169. Springer, Cham (2015). doi: 10.1007/978-3-319-23132-7_20 CrossRefGoogle Scholar
  18. 18.
    Kolář, J., Lamel, L.: Development and evaluation of automatic punctuation for French and English speech-to-text. In: Proceedings of INTERSPEECH, pp. 1376–1379 (2012)Google Scholar
  19. 19.
    Kolář, J., Švec, J., Psutka, J.: Automatic punctuation annotation in Czech broadcast news speech. In: in Proceedings of the SPECOM (2004)Google Scholar
  20. 20.
    Levy, T., Silber-Varod, V., Moyal, A.: The effect of pitch, intensity and pause duration in punctuation detection. In: 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel (IEEEI), pp. 1–4. IEEE (2012)Google Scholar
  21. 21.
    Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. CoRR abs/1511.04586 (2015)Google Scholar
  22. 22.
    Liu, Y., Chawla, N.V., Harper, M.P., Shriberg, E., Stolcke, A.: A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput. Speech Lang. 20(4), 468–494 (2006)CrossRefGoogle Scholar
  23. 23.
    Lu, W., Ng, H.T.: Better punctuation prediction with dynamic conditional random fields. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 177–186. Association for Computational Linguistics (2010)Google Scholar
  24. 24.
    Matusov, E., Mauser, A., Ney, H.: Automatic sentence segmentation and punctuation prediction for spoken language translation. In: International Workshop on Spoken Language Translation (IWSLT) 2006 (2006)Google Scholar
  25. 25.
    Miranda, J., Neto, J.P., Black, A.W.: Improved punctuation recovery through combination of multiple speech streams. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). pp. 132–137. IEEE (2013)Google Scholar
  26. 26.
    Peitz, S., Freitag, M., Mauser, A., Ney, H.: Modeling punctuation prediction as machine translation. In: International Workshop on Spoken Language Translation (IWSLT) 2011 (2011)Google Scholar
  27. 27.
    Schuster, M., Paliwal, K.K., General, A.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  28. 28.
    Shen, W., Yu, R.P., Seide, F., Wu, J.: Automatic punctuation generation for speech. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2009, pp. 586–589. IEEE (2009)Google Scholar
  29. 29.
    Theano Development Team: Theano: a Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688, May 2016.
  30. 30.
    Tilk, O., Alumäe, T.: LSTM for punctuation restoration in speech transcripts. In: Proceedings of INTERSPEECH, pp. 683–687 (2015)Google Scholar
  31. 31.
    Tilk, O., Alumäe, T.: Bidirectional recurrent neural network with attention mechanism for punctuation restoration. In: Proceedings of INTERSPEECH, pp. 3047–3051 (2016)Google Scholar
  32. 32.
    Ueffing, N., Bisani, M., Vozila, P.: Improved models for automatic punctuation prediction for spoken and written text. In: INTERSPEECH, pp. 3097–3101 (2013)Google Scholar
  33. 33.
    Wang, T., Cho, K.: Larger-context language modelling. CoRR abs/1511.03729 (2015).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Universitat Pompeu FabraBarcelonaSpain
  2. 2.Catalan Institute for Research and Advanced Studies (ICREA)BarcelonaSpain

Personalised recommendations