Abstract
Chinese language has been generally regarded as a Subject-Verb -Object (SVO) language and the basic semantic unit is the Chinese word that is usually consisted by two or more Chinese characters. However, word-centered structure of Chinese language has been controversial in linguistics. Some recent research in computational linguistics in Chinese language suggests that the character-based models perform better than the word-based models in some applications such word segmentation. In this paper, the word-based topic models and the character-based models are tested for modeling Chinese language, respectively. By empirical studies, we demonstrated the effectiveness of using Chinese characters as the basic semantic units. These two models have close performance in text classifications while the character-based model has a better quality in language modeling and a much smaller vocabulary. By testing on a bilingual corpus, three independent topic models based on Chinese words, Chinese characters and English words are trained and compared to each other. we verify the capability of topic models in modeling semantics by experiments across Chinese and English. The classification accuracy can also be boosted up by aggregating the classification results from the three independent topic models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barbara, A.: The Nature of the Chinese Character. Simon, New York (1991)
Bishop, M.C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Blei, D.M., Griffiths, T., Jordan, M.I., Tenenbaum, J.: Hierarchical Topic Models and the Nested Chinese Restaurant Process. In: Thrun, S., Saul, L., Schoelkopf, B. (eds.) Advances in Neural Information Processing Systems (2004)
Blei, D.M., Lafferty, J.D.: Correlated Topic Models. In: Advances in Neural Information Processing Systems, vol. 18. MIT Press, Cambridge (2006)
Blei, D.M., Lafferty, J.D.: Dynamic Topic Model. In: Proceedings of the 23rd ICML, Pittsburgh, USA (2006)
Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Soc. of Inform. Sci. 41 (1990)
Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. Proceedings of the National Academy of Science 101, 5228–5235 (2004)
Griffiths, T.L., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating topics and syntax. In: Advances in Neural Information Processing Systems, vol. 17 (2005)
Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proceedings of UAI 1999, Stockholm (1999)
Huang, Z., Thint, M., Qin, Z.: Question Classification using Head Words and their Hypernyms. In: Proceedings of EMNLP, pp. 927–936 (2008)
Li, C., Sandra, T.: Mandarin Chinese: A Functional Reference Grammar. University of California Press, Los Angeles (1981) ISBN 978-0520066106
Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Maurits, L., Perfors, A., Navarro, D.: Why are some word orders more common than others? A uniform information density account. In: Proceedings of NIPS (2010)
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Uncertainty in Artificial Intelligence (2002)
Ng, H.T., Low, J.K.: Chinese part-of-speech tagging: one-at-a-time or all-at- once? word-based or character-based. In: Proceedings of EMNLP, pp. 277–284 (2004)
Qin, Z., Thint, M., Huang, Z.: Ranking Answers by Hierarchical Topic Models. In: Chien, B.-C., Hong, T.-P., Chen, S.-M., Ali, M. (eds.) IEA/AIE 2009. LNCS, vol. 5579, pp. 103–112. Springer, Heidelberg (2009)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis - A Road to Meaning (2007)
Wang, K., Zong, C., Su, K.-Y.: A character-based joint model for Chinese word segmentation. In: Proceedings of CoLing, pp. 1173–1181 (2010)
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: SIGIR (2006)
Wu, Y., Ding, Y., Wang, X., Xu, J.: A comparative study of topic models for topic clustering of Chinese web news. Computer Science and Information Technology (ICCSIT) 5, 236–240 (2010)
Xu, T.Q.: Fundamental structural principles of Chinese semantic syntax in terms of Chinese Characters. Applied Linguistics 1, 3–13 (2001) (In Chinese)
Zhang, Y., Qin, Z.: A topic model of Observing Chinese Characters. In: Proceedings of the 2nd International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), pp. 7–10 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, Q., Qin, Z., Wan, T. (2011). What Is the Basic Semantic Unit of Chinese Language? A Computational Approach Based on Topic Models. In: Kanazawa, M., Kornai, A., Kracht, M., Seki, H. (eds) The Mathematics of Language. MOL 2011. Lecture Notes in Computer Science(), vol 6878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23211-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-23211-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23210-7
Online ISBN: 978-3-642-23211-4
eBook Packages: Computer ScienceComputer Science (R0)