Advertisement

Language Model for Mongolian Polyphone Proofreading

  • Min Lu
  • Feilong BaoEmail author
  • Guanglai Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10565)

Abstract

Mongolian text proofreading is the particularly difficult task because of its unique polyphonic alphabet, morphological ambiguity and agglutinative feature, and coding errors are currently pervasive in the Mongolian corpus of electronic edition, which results in Mongolian statistic and retrieval research toughly difficult to carry out. Some conventional approaches have been proposed to solve this problem but with limitations by not considering proofreading of polyphone. In this paper, we address this problem by means of constructing the large-scale resource and conducting n-gram language model based approach. For ease of understanding, the entire proofreading system architecture is also introduced in this paper, since the polyphone proofreading is the important component of it. Experimental results show that our method performs pretty well. Polyphone correction accuracy is relatively improved by 62% and overall system accuracy is relatively promoted by 16.1%.

Keywords

Mongolian Polyphone Automatic proofreading system Morphological ambiguity 

Notes

Acknowledgements

This paper is supported by The National Natural Science Foundation of China (No. 61563040), Inner Mongolia Natural Science Foundation of major projects (No. 2016ZD06) and Inner Mongolia Natural Science Fund Project (No. 2017BS0601).

References

  1. 1.
    Wang, W., Bao, F., Gao, G.: Mongolian named entity recognition system with rich features. In: COLING, pp. 505–512 (2016)Google Scholar
  2. 2.
    Bao, F., Gao, G., Wang, H., et al.: Cyril Mongolian to traditional Mongolian conversion based on rules and statistics method. J. Chin. Inf. Process. 31(3), 156–162 (2013)Google Scholar
  3. 3.
    Bao, F., Gao, G., Yan, X., et al.: Segmentation-based Mongolian LVCSR approach. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8136–8139. IEEE (2013)Google Scholar
  4. 4.
    Islam, A., Inkpen, D.: Real-word spelling correction using Google web 1T n-gram data set. In: International Conference on Natural Language Processing and Knowledge Engineering, Nlp-Ke, pp. 1689–1692. IEEE (2009)Google Scholar
  5. 5.
    Su, C., Hou, H., Yang, P., Yuan, H.: Based on the statistical translation framework of the Mongolian automatic spelling correction method. J. Chin. Inf. Process. 175–179 (2013)Google Scholar
  6. 6.
    Si, L.: Mongolian proofreading algorithm based on nondeterministic finite automata. Chin. J. Inf. 23(6), 110–115 (2009)Google Scholar
  7. 7.
    Jiang, B.: Research on Rule-Based the Method of Mongolian Automatic Correction. Inner Mongolia University, Hohhot (2014)Google Scholar
  8. 8.
    Yan, X., Bao, F., Wei, H., Su, X.: A novel approach to improve the Mongolian language model using intermediate characters. In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y. (eds.) CCL/NLP-NABD -2016. LNCS, vol. 10035, pp. 103–113. Springer, Cham (2016). doi: 10.1007/978-3-319-47674-2_9 CrossRefGoogle Scholar
  9. 9.
    Gong, Z.: Research on Mongolian code conversion. Inner Mongolia University (2008)Google Scholar
  10. 10.
    GB 25914-2010: Information technology of traditional Mongolian nominal characters, presentation characters and control characters using the rules (2011)Google Scholar
  11. 11.
    Surgereltu, : Mongolia Orthography Dictionary, 5th edn. Inner Mongolia People’s Publisher, Hohhot (2011)Google Scholar
  12. 12.
    Inner Mongolia University: Modern Mongolian. 2nd edn. Inner Mongolia People’s Publisher, Hohhot (2005)Google Scholar
  13. 13.
    Zong, C.: Statistical Natural Language Processing, 2nd edn. Tsinghua University Press, Beijing (2008)Google Scholar
  14. 14.
    Jurafsky, D., Martin, J.: Speech and Language Processing, 2nd edn. Prentice Hall, Upper Saddle River (2009)Google Scholar
  15. 15.
    Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of International Conference on Spoken Language Processing, Denver, Colorado (2002)Google Scholar
  16. 16.
    Pontus, S., Sampo, P., Goran T.: Brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102–107Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.College of Computer ScienceInner Mongolia UniversityHohhotChina

Personalised recommendations