An Optimized Regularization Method to Enhance Low-Resource MT
Abstract
Overfitting caused by scarce parallel corpus is a serious problem in low-resource machine translation task, resulting in the weak generalization ability of translation models. Dropout and Dropconnect can address this issue by reducing training neurons or weights randomly with increasing the generalization ability. In this paper, we optimize Dropconnect by adopting Gaussian approximation in the Bernoulli distribution in low-resource machine translation tasks, and make an integration to alleviate the uneven sampling effect in Dropout and Dropconnect, especially the inadequate training problem. It is an effective approach to approximate mask calculations to linear operations while being fully trained. An interesting finding is that the adhesive language is more sensitive to our regular methods. Our approach outperforms the Dropout and Dropconnect for low-resource translation tasks.
Keywords
Low-resource machine translation Over-fitting Uneven sampling Regularization methodNotes
Acknowledgments
We thank PDCAT-18 reviewers. This work is supported by Natural Science Foundation of Inner Mongolia (No. 2018MS06005), Mongolian Language Information Special Support Project of Inner Mongolia (No. MW-2018-MGYWXXH-302) and the Postgraduate Scientific Research Innovation Foundation of Inner Mongolia (No. 10000-16010109-14).
References
- 1.Liu, Y.: Advances in neural machine translation. J. Comput. Res. Dev. 54(6), 1144–1149 (2017)Google Scholar
- 2.Lü, G., Luo, S., Huang, Y., et al.: A novel regularization method based on convolution neural network. J. Comput. Res. Dev. 51(9), 1891–1900 (2014)Google Scholar
- 3.Srivastava, N., Hinton, G., Krizhevsky, A., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
- 4.Mackay, D.J.C.: Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Netw. Comput. Neural Syst. 6(3), 469–505 (1995)CrossRefGoogle Scholar
- 5.Wan, L., Zeiler, M., Zhang, S., et al.: Regularization of neural networks using dropconnect. In: International Conference on Machine Learning, pp. 1058–1066 (2013)Google Scholar
- 6.Zhou, W.H., Wang, A.H.: Discrete-time queue with Bernoulli bursty source arrival and generally distributed service times. Appl. Math. Model. 32(11), 2233–2240 (2008)MathSciNetCrossRefGoogle Scholar
- 7.Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on International Conference on Machine Learning, pp. 1050–1059. JMLR.org (2016)Google Scholar
- 8.Shekhar, S., Xiong, H.: Model generalization. Encyclopedia of Gis, 682 (2013)Google Scholar
- 9.Xu, P., Jelinek, F.: Random forests and the data sparseness problem in language modeling. Comput. Speech Lang. 21(1), 105–152 (2007)CrossRefGoogle Scholar
- 10.Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
- 11.Mitchell, T., Buchanan, B., Dejong, G., et al.: Machine Learning. McGraw-Hill, New York (2003)Google Scholar
- 12.Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate [J]. Computer Science, 2014Google Scholar
- 13.Bianchini, M., Scarselli, F.: On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures [J]. IEEE Transactions on Neural Networks & Learning Systems 25(8), 1553–1565 (2014)CrossRefGoogle Scholar
- 14.Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural. Inf. Process. Syst. 3, 2177–2185 (2014)Google Scholar
- 15.Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8609–8613. IEEE (2013)Google Scholar
- 16.Ozonat, K.M., Gray, R.M.: Fast Gauss mixture image classification based on the central limit theorem. In: 2004 IEEE Workshop on Multimedia Signal Processing, pp. 446–449. IEEE (2005)Google Scholar
- 17.Kline, D.M., Berardi, V.L.: Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput. Appl. 14(4), 310–318 (2005)CrossRefGoogle Scholar
- 18.Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluation the role of bleu in machine translation research. In: Proceedings of the Conference Eacl 2006, Conference of the European Chapter of the Association for Computational Linguistics, 3–7 Apr 2006, Trento, Italy, pp. 249–256. DBLP (2006)Google Scholar
- 19.Cho, K., Van Merrienboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. Comput. Sci. (2014)Google Scholar
- 20.Dey, R., Salem, F.M.: Gate-variants of gated recurrent unit (LSTM) neural networks (2017)Google Scholar
- 21.Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need (2017)Google Scholar