Abstract
The deep learning model requires a large amount of data when learning a task. The sentence proofreading task requires the text before and after proofreading as the training data. Usually, most of the available publications are after proofreading and are easily accessible. However, the text before proofreading is rarely seen in the general publications and is highly difficult to obtain. In this study, we assume a case where we cannot prepare sufficient amounts of data for training a sentence proofreading task, such as work procedure manuals. We propose a method that automatically generates both pre- and post-proofreading sentences. We generate pseudo post-proofread sentences by Markov chains or GPT-3. The sentences generated by Markov chains are often semantically incorrect. We identify and remove these incorrect sentences by gated recurrent unit (GRU). Then, we generate pseudo pre-proofread sentences by adding noises to the pseudo post-proofread sentences using three different methods. In the experiments, we have compared the case where seq2seq-based grammatical error correction method is trained with a small corpus only, and the case where it is trained with the pseudo-sentence pairs generated in this study in addition to the small corpus. As a result, one of our methods improved the accuracy by 12.1% in the metric BLEU.
Keywords
- Deep learning
- Natural language processing
- Proofreading
- seq2seq
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Hitomi Y, Tamori H, Okazaki N, Inui K (2017) Proofread sentence generation as multi-task learning with editing operation prediction. In: Proceedings of the Eighth international joint conference on natural language processing, pp 436–441
Feng YS, Gangal GV, Wei J, Chander S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for NLP, arXiv preprint arXiv:2105.03075v5
Nagai R, Maeda A (2021) Dataset augmentation for grammatical error correction using Markov chain. In: Proceedings of the world congress on engineering 2021, pp 97–100
Maruyama T, Yamamoto K (2018) Simplified corpus with core vocabulary. In: Proceedings of the 11th international conference on language resources and evaluation, LREC 2018, pp 461–466
Brown BT, Mann B, Ryder N, Subbiah M, Kapalan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Ariel H-V, Krueger G, Henighan T, Child R, Ramesh A, Ziegler MD, Wu J, Winter C, Hesse, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, MacCandlish S, Radford A, Sutskever I, Amodei Dario (2020) Language models are few-shot learners, arXiv preprint arXiv:2005.14165
Coulombe C (2018) Text data augmentation made simple by leveraging NLP cloud APIs, arXiv preprint arXiv:1812.04718
Tanaka Y, Murawaki Y, Kawahara D, Kurohashi S (2020) Building a Japanese Typo dataset from Wikipedia's revision history. In: Proceedings of the ACL 2020 student research workshop, pp 230–236
Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 489–500
Gu J, Lu Z, Li H, Li VOK (2016) Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 1631–1640
Database backup dumps of Japanese Wikipedia. https://dumps.wikimedia.org/jawiki/. Last accessed 2021/11/24
SentencePiece. https://github.com/google/sentencepiece. Last accessed 2021/11/24
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nagai, R., Maeda, A. (2023). Sentence Pair Augmentation Approach for Grammatical Error Correction. In: Chatterjee, P., Pamucar, D., Yazdani, M., Panchal, D. (eds) Computational Intelligence for Engineering and Management Applications. Lecture Notes in Electrical Engineering, vol 984. Springer, Singapore. https://doi.org/10.1007/978-981-19-8493-8_46
Download citation
DOI: https://doi.org/10.1007/978-981-19-8493-8_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8492-1
Online ISBN: 978-981-19-8493-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)