Skip to main content

Sentence Pair Augmentation Approach for Grammatical Error Correction

  • Conference paper
  • First Online:
Computational Intelligence for Engineering and Management Applications

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 984))

  • 319 Accesses

Abstract

The deep learning model requires a large amount of data when learning a task. The sentence proofreading task requires the text before and after proofreading as the training data. Usually, most of the available publications are after proofreading and are easily accessible. However, the text before proofreading is rarely seen in the general publications and is highly difficult to obtain. In this study, we assume a case where we cannot prepare sufficient amounts of data for training a sentence proofreading task, such as work procedure manuals. We propose a method that automatically generates both pre- and post-proofreading sentences. We generate pseudo post-proofread sentences by Markov chains or GPT-3. The sentences generated by Markov chains are often semantically incorrect. We identify and remove these incorrect sentences by gated recurrent unit (GRU). Then, we generate pseudo pre-proofread sentences by adding noises to the pseudo post-proofread sentences using three different methods. In the experiments, we have compared the case where seq2seq-based grammatical error correction method is trained with a small corpus only, and the case where it is trained with the pseudo-sentence pairs generated in this study in addition to the small corpus. As a result, one of our methods improved the accuracy by 12.1% in the metric BLEU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hitomi Y, Tamori H, Okazaki N, Inui K (2017) Proofread sentence generation as multi-task learning with editing operation prediction. In: Proceedings of the Eighth international joint conference on natural language processing, pp 436–441

    Google Scholar 

  2. Feng YS, Gangal GV, Wei J, Chander S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for NLP, arXiv preprint arXiv:2105.03075v5

  3. Nagai R, Maeda A (2021) Dataset augmentation for grammatical error correction using Markov chain. In: Proceedings of the world congress on engineering 2021, pp 97–100

    Google Scholar 

  4. Maruyama T, Yamamoto K (2018) Simplified corpus with core vocabulary. In: Proceedings of the 11th international conference on language resources and evaluation, LREC 2018, pp 461–466

    Google Scholar 

  5. Brown BT, Mann B, Ryder N, Subbiah M, Kapalan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Ariel H-V, Krueger G, Henighan T, Child R, Ramesh A, Ziegler MD, Wu J, Winter C, Hesse, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, MacCandlish S, Radford A, Sutskever I, Amodei Dario (2020) Language models are few-shot learners, arXiv preprint arXiv:2005.14165

  6. Coulombe C (2018) Text data augmentation made simple by leveraging NLP cloud APIs, arXiv preprint arXiv:1812.04718

  7. Tanaka Y, Murawaki Y, Kawahara D, Kurohashi S (2020) Building a Japanese Typo dataset from Wikipedia's revision history. In: Proceedings of the ACL 2020 student research workshop, pp 230–236

    Google Scholar 

  8. Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 489–500

    Google Scholar 

  9. Gu J, Lu Z, Li H, Li VOK (2016) Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1631–1640

    Google Scholar 

  10. Database backup dumps of Japanese Wikipedia. https://dumps.wikimedia.org/jawiki/. Last accessed 2021/11/24

  11. SentencePiece. https://github.com/google/sentencepiece. Last accessed 2021/11/24

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akira Maeda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nagai, R., Maeda, A. (2023). Sentence Pair Augmentation Approach for Grammatical Error Correction. In: Chatterjee, P., Pamucar, D., Yazdani, M., Panchal, D. (eds) Computational Intelligence for Engineering and Management Applications. Lecture Notes in Electrical Engineering, vol 984. Springer, Singapore. https://doi.org/10.1007/978-981-19-8493-8_46

Download citation

Publish with us

Policies and ethics