Sentence Pair Augmentation Approach for Grammatical Error Correction

Nagai, Ryoga; Maeda, Akira

doi:10.1007/978-981-19-8493-8_46

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 984))

319 Accesses

Abstract

The deep learning model requires a large amount of data when learning a task. The sentence proofreading task requires the text before and after proofreading as the training data. Usually, most of the available publications are after proofreading and are easily accessible. However, the text before proofreading is rarely seen in the general publications and is highly difficult to obtain. In this study, we assume a case where we cannot prepare sufficient amounts of data for training a sentence proofreading task, such as work procedure manuals. We propose a method that automatically generates both pre- and post-proofreading sentences. We generate pseudo post-proofread sentences by Markov chains or GPT-3. The sentences generated by Markov chains are often semantically incorrect. We identify and remove these incorrect sentences by gated recurrent unit (GRU). Then, we generate pseudo pre-proofread sentences by adding noises to the pseudo post-proofread sentences using three different methods. In the experiments, we have compared the case where seq2seq-based grammatical error correction method is trained with a small corpus only, and the case where it is trained with the pseudo-sentence pairs generated in this study in addition to the small corpus. As a result, one of our methods improved the accuracy by 12.1% in the metric BLEU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hitomi Y, Tamori H, Okazaki N, Inui K (2017) Proofread sentence generation as multi-task learning with editing operation prediction. In: Proceedings of the Eighth international joint conference on natural language processing, pp 436–441
Google Scholar
Feng YS, Gangal GV, Wei J, Chander S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for NLP, arXiv preprint arXiv:2105.03075v5
Nagai R, Maeda A (2021) Dataset augmentation for grammatical error correction using Markov chain. In: Proceedings of the world congress on engineering 2021, pp 97–100
Google Scholar
Maruyama T, Yamamoto K (2018) Simplified corpus with core vocabulary. In: Proceedings of the 11th international conference on language resources and evaluation, LREC 2018, pp 461–466
Google Scholar
Brown BT, Mann B, Ryder N, Subbiah M, Kapalan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Ariel H-V, Krueger G, Henighan T, Child R, Ramesh A, Ziegler MD, Wu J, Winter C, Hesse, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, MacCandlish S, Radford A, Sutskever I, Amodei Dario (2020) Language models are few-shot learners, arXiv preprint arXiv:2005.14165
Coulombe C (2018) Text data augmentation made simple by leveraging NLP cloud APIs, arXiv preprint arXiv:1812.04718
Tanaka Y, Murawaki Y, Kawahara D, Kurohashi S (2020) Building a Japanese Typo dataset from Wikipedia's revision history. In: Proceedings of the ACL 2020 student research workshop, pp 230–236
Google Scholar
Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 489–500
Google Scholar
Gu J, Lu Z, Li H, Li VOK (2016) Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1631–1640
Google Scholar
Database backup dumps of Japanese Wikipedia. https://dumps.wikimedia.org/jawiki/. Last accessed 2021/11/24
SentencePiece. https://github.com/google/sentencepiece. Last accessed 2021/11/24

Download references

Author information

Authors and Affiliations

Formerly Graduate School of Information Science and Engineering, Ritsumeikan University, 1-1-1 Noji-Higashi, Kusatsu, 525-8577, Shiga, Japan
Ryoga Nagai
College of Information Science and Engineering, Ritsumeikan University, 1-1-1 Noji-Higashi, Kusatsu, 525-8577, Shiga, Japan
Akira Maeda

Authors

Ryoga Nagai
View author publications
You can also search for this author in PubMed Google Scholar
Akira Maeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akira Maeda .

Editor information

Editors and Affiliations

Department of Mechanical Engineering, MCKV Institute of Engineering, Howrah, West Bengal, India
Prasenjit Chatterjee
Department of Operations Research and Statistics, Faculty of Organizational Sciences, University of Belgrade, Belgrade, Serbia
Dragan Pamucar
Universidad Internacional de Valencia, Valencia, Spain
Morteza Yazdani
Department of Mechanical Engineering, National Institute of Technology Kurukshetra, Kurukshetra, Haryana, India
Dilbagh Panchal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nagai, R., Maeda, A. (2023). Sentence Pair Augmentation Approach for Grammatical Error Correction. In: Chatterjee, P., Pamucar, D., Yazdani, M., Panchal, D. (eds) Computational Intelligence for Engineering and Management Applications. Lecture Notes in Electrical Engineering, vol 984. Springer, Singapore. https://doi.org/10.1007/978-981-19-8493-8_46

Download citation

DOI: https://doi.org/10.1007/978-981-19-8493-8_46
Published: 30 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8492-1
Online ISBN: 978-981-19-8493-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics