Abstract
Homophonic words are very popular in Chinese microblog, posing a new challenge for Chinese microblog text analysis. However, to date, there has been very little research conducted on Chinese homophonic words normalization. In this paper, we take Chinese homophonic word normalization as a process of language decoding and propose an n-gram based approach. To this end, we first employ homophonic–original word or character mapping tables to generate normalization candidates for a given sentence with homophonic words, and thus exploit n-gram language models to decode the best normalization from the candidate set. Our experimental results show that using the homophonic-original character mapping table and n-grams trained from the microblog corpus help improve performance in homophonic word recognition and restoration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tan, X., Pu, K., Shen, M.: Chinese Dictionary of Rhetoric. Shanghai Dictionary Publishing House, Shanghai (2010)
Zhou, J., Yang, Y.: Features of Network homophonic words in generation and development. Journal of Jianghan University: Humanities Science Edition 31(3), 30–35 (2012)
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition 10(3-4), 157–174 (2007)
Han, B., Baldwin, T.: Lexical normalization of short text messages: Makn sens a# twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 368–378 (2011)
Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013)
Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C.: Normalization of non-standard words. Computer Speech & Language 15(3), 287–333 (2001)
Beaufort, R., Roekhaut, S., Cougnon, L., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779 (2010)
Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the 44st Annual Meeting of the Association for Computational Linguistics, pp. 33–40 (2006)
Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7 (2002)
Fu, G., Zhang, M., Zhou, G., Luke, K.: A unified framework for text analysis in Chinese TTS. In: van de Snepscheut, J.L.A. (ed.) Trace Theory and VLSI Design. LNCS, vol. 200, pp. 200–210. Springer, Heidelberg (1985)
Fu, G., Luke, K., Webster, J.: Automatic expansion of abbreviations in Chinese news text: A hybrid approach. International Journal of Computer Processing of Oriental Languages 20(2&3), 165–179 (2007)
Liu, L., Wang, S., Wang, D., Wang, P., Cao, C.: Automatic Text Detection in Domain Question Answering. Journal of Chinese Information Processing 27(3), 77–83 (2013)
Chang, C.: Corpus-based adaptation mechanisms for Chinese homophone disambiguation. In: Proceedings of the Workshop on Very Large Corpora, pp. 94–101 (1993)
Lee, Y., Chen, H.: Applying repair processing in Chinese homophone disambiguation. In: Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 57–63 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, X., Song, J., He, Y., Fu, G. (2015). Normalization of Homophonic Words in Chinese Microblogs. In: Wang, H., et al. Intelligent Computation in Big Data Era. ICYCSEE 2015. Communications in Computer and Information Science, vol 503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46248-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-662-46248-5_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46247-8
Online ISBN: 978-3-662-46248-5
eBook Packages: Computer ScienceComputer Science (R0)