Skip to main content

Normalization of Homophonic Words in Chinese Microblogs

  • Conference paper
Intelligent Computation in Big Data Era (ICYCSEE 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 503))

Abstract

Homophonic words are very popular in Chinese microblog, posing a new challenge for Chinese microblog text analysis. However, to date, there has been very little research conducted on Chinese homophonic words normalization. In this paper, we take Chinese homophonic word normalization as a process of language decoding and propose an n-gram based approach. To this end, we first employ homophonic–original word or character mapping tables to generate normalization candidates for a given sentence with homophonic words, and thus exploit n-gram language models to decode the best normalization from the candidate set. Our experimental results show that using the homophonic-original character mapping table and n-grams trained from the microblog corpus help improve performance in homophonic word recognition and restoration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tan, X., Pu, K., Shen, M.: Chinese Dictionary of Rhetoric. Shanghai Dictionary Publishing House, Shanghai (2010)

    Google Scholar 

  2. Zhou, J., Yang, Y.: Features of Network homophonic words in generation and development. Journal of Jianghan University: Humanities Science Edition 31(3), 30–35 (2012)

    Google Scholar 

  3. Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. International Journal of Document Analysis and Recognition 10(3-4), 157–174 (2007)

    Article  Google Scholar 

  4. Han, B., Baldwin, T.: Lexical normalization of short text messages: Makn sens a# twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 368–378 (2011)

    Google Scholar 

  5. Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013)

    Google Scholar 

  6. Sproat, R., Black, A.W., Chen, S., Kumar, S., Ostendorf, M., Richards, C.: Normalization of non-standard words. Computer Speech & Language 15(3), 287–333 (2001)

    Article  Google Scholar 

  7. Beaufort, R., Roekhaut, S., Cougnon, L., Fairon, C.: A hybrid rule/model-based finite-state framework for normalizing SMS messages. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 770–779 (2010)

    Google Scholar 

  8. Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: Proceedings of the 44st Annual Meeting of the Association for Computational Linguistics, pp. 33–40 (2006)

    Google Scholar 

  9. Bangalore, S., Murdock, V., Riccardi, G.: Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7 (2002)

    Google Scholar 

  10. Fu, G., Zhang, M., Zhou, G., Luke, K.: A unified framework for text analysis in Chinese TTS. In: van de Snepscheut, J.L.A. (ed.) Trace Theory and VLSI Design. LNCS, vol. 200, pp. 200–210. Springer, Heidelberg (1985)

    Google Scholar 

  11. Fu, G., Luke, K., Webster, J.: Automatic expansion of abbreviations in Chinese news text: A hybrid approach. International Journal of Computer Processing of Oriental Languages 20(2&3), 165–179 (2007)

    Article  Google Scholar 

  12. Liu, L., Wang, S., Wang, D., Wang, P., Cao, C.: Automatic Text Detection in Domain Question Answering. Journal of Chinese Information Processing 27(3), 77–83 (2013)

    MathSciNet  Google Scholar 

  13. Chang, C.: Corpus-based adaptation mechanisms for Chinese homophone disambiguation. In: Proceedings of the Workshop on Very Large Corpora, pp. 94–101 (1993)

    Google Scholar 

  14. Lee, Y., Chen, H.: Applying repair processing in Chinese homophone disambiguation. In: Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 57–63 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, X., Song, J., He, Y., Fu, G. (2015). Normalization of Homophonic Words in Chinese Microblogs. In: Wang, H., et al. Intelligent Computation in Big Data Era. ICYCSEE 2015. Communications in Computer and Information Science, vol 503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46248-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46248-5_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46247-8

  • Online ISBN: 978-3-662-46248-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics