Skip to main content

Vietnamese Punctuation Prediction Using Deep Neural Networks

  • Conference paper
  • First Online:
SOFSEM 2020: Theory and Practice of Computer Science (SOFSEM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12011))

Abstract

Adding appropriate punctuation marks into text is an essential step in speech-to-text where such information is usually not available. While this has been extensively studied for English, there is no large-scale dataset and comprehensive study in the punctuation prediction problem for the Vietnamese language. In this paper, we collect two massive datasets and conduct a benchmark with both traditional methods and deep neural networks. We aim to publish both our data and all implementation codes to facilitate further research, not only in Vietnamese punctuation prediction but also in other related fields. Our project, including datasets and implementation details, is publicly available at https://github.com/BinhMisfit/vietnamese-punctuation-prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://vnexpress.net/khoa-hoc/dai-duong-can-thiet-voi-su-song-tren-trai-dat-the-nao-3976195.html.

  2. 2.

    https://gacsach.com/tac-gia/nguyen-nhat-anh.html.

  3. 3.

    https://baomoi.com.

  4. 4.

    https://fasttext.cc/.

  5. 5.

    https://taku910.github.io/crfpp/.

  6. 6.

    https://www.tensorflow.org/.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2015)

    Google Scholar 

  2. Ballesteros, M., Wanner, L.: A neural network architecture for multilingual punctuation generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1048–1053. Association for Computational Linguistics, Austin, November 2016. https://doi.org/10.18653/v1/D16-1111. https://www.aclweb.org/anthology/D16-1111

  3. Beeferman, D., Berger, A., Lafferty, J.: Cyberpunc: a lightweight punctuation annotation system for speech. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 689–692, May 1998. https://doi.org/10.1109/ICASSP.1998.675358

  4. Dien, D., Hoang, K., Toan, N.V.: Vietnamese word segmentation. In: NLPRS (2001)

    Google Scholar 

  5. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  6. Huang, J., Zweig, G.: Maximum entropy model for punctuation annotation from speech, January 2002

    Google Scholar 

  7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  8. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282–289 (2001)

    Google Scholar 

  9. Li, X.L., Wang, D., Eisner, J.: A generative model for punctuation in dependency trees, pp. 357–373, July 2019

    Google Scholar 

  10. Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007 (2017)

    Google Scholar 

  11. Lu, W., Ng, H.T.: Better punctuation prediction with dynamic conditional random fields. In: Conference on Empirical Methods in Natural Language Processing (2010)

    Google Scholar 

  12. Lu, W., Tou Ng, H.: Better punctuation prediction with dynamic conditional random fields, pp. 177–186, January 2010

    Google Scholar 

  13. Nguyen, C.T., Nguyen, T.K., Phan, X.H., Nguyen, L.M., Ha, Q.T.: Vietnamese word segmentation with CRFs and SVMs: an investigation. In: PACLIC (2006)

    Google Scholar 

  14. Nguyen, V.C., Ye, N., Lee, W.S., Chieu, H.L.: Conditional random field with high-order dependencies for sequence labeling and segmentation. J. Mach. Learn. Res. 15, 981–1009 (2014)

    MathSciNet  MATH  Google Scholar 

  15. Paul, M.: Overview of the IWSLT 2009 evaluation campaign. In: International Workshop on Spoken Language Translation (IWSLT) 2009, pp. 1–18 (2009)

    Google Scholar 

  16. Peitz, S., Freitag, M., Mauser, A., Ney, H.: Modeling punctuation prediction as machine translation. In: IWSLT (2011)

    Google Scholar 

  17. Pham, D.D., Tran, G.B., Pham, S.B.: A hybrid approach to Vietnamese word segmentation using part of speech tags. In: 2009 International Conference on Knowledge and Systems Engineering, pp. 154–161 (2009)

    Google Scholar 

  18. Pham, Q.H., Nguyen, B.T., Cuong, N.V.: Punctuation prediction for Vietnamese texts using conditional random fields. In: ACML Workshop: Machine Learning and Its Applications in Vietnam, pp. 1–9 (2014)

    Google Scholar 

  19. Stephanie, S., Kong, J., Graff, D.: TDT4 multilingual text and annotations LDC2005T16 (2005)

    Google Scholar 

  20. Tilk, O., Alumae, T.: LSTM for punctuation restoration in speech transcripts. In: INTERSPEECH 2015, pp. 683–687 (2015)

    Google Scholar 

  21. Zhang, D., Wu, S., Yang, N., Li, M.: Punctuation prediction with transition-based parsing. In: ACL (2013)

    Google Scholar 

  22. Zhao, Y., Wang, C., Fu, G.: A CRF sequence labeling approach to Chinese punctuation prediction. In: Pacific Asia Conference on Language, Information and Computation (2012)

    Google Scholar 

Download references

Acknowledgement

We would like to thank The National Foundation for Science and Technology Development (NAFOSTED), University of Science, Inspectorio Research Lab, and AISIA Research Lab for supporting us throughout this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Binh Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pham, T., Nguyen, N., Pham, Q., Cao, H., Nguyen, B. (2020). Vietnamese Punctuation Prediction Using Deep Neural Networks. In: Chatzigeorgiou, A., et al. SOFSEM 2020: Theory and Practice of Computer Science. SOFSEM 2020. Lecture Notes in Computer Science(), vol 12011. Springer, Cham. https://doi.org/10.1007/978-3-030-38919-2_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-38919-2_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-38918-5

  • Online ISBN: 978-3-030-38919-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics