Automatic Word Clustering for Text Categorization Using Global Information

Wenliang, Chen; Xingzhi, Chang; Huizhen, Wang; Jingbo, Zhu; Tianshun, Yao

doi:10.1007/978-3-540-31871-2_1

Chen Wenliang²⁰,
Chang Xingzhi²⁰,
Wang Huizhen²⁰,
Zhu Jingbo²⁰ &
…
Yao Tianshun²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Included in the following conference series:

Asia Information Retrieval Symposium

480 Accesses
6 Citations

Abstract

High dimensionality of feature space and short of training documents are the crucial obstacles for text categorization. In order to overcome these obstacles, this paper presents a cluster-based text categorization system which uses class distributional clustering of words. We propose a new clustering model which considers the global information over all the clusters. The model can be understood as the balance of all the clusters according to the number of words in them. It can group words into clusters based on the distribution of class labels associated with each word. Using these learned clusters as features, we develop a cluster-based classifier. We present several experimental results to show that our proposed method performs better than the other three text classifiers. The proposed model has better results than the model which only considers the information of the two related clusters. Specially, it can maintain good performance when the number of features is small and the size of training corpus is small.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Croft, W.B., Moffat, A., van Rijsbergen, C.J., Wilkinson, R., Zobel, J. (eds.) Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 96–103 (1998)
Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, New Orleans, US, pp. 146–153 (2001)
Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 1183–1208 (2003)
Google Scholar
Board, C.L.C.E.: China Library Categorization, 4th edn. Beijing Library Press, Beijing (1999)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Ko, Y., Seo, J.: Automatic text categorization by unsupervised learning. In: Proceedings of COLING 2000, the 18th International Conference on Computational Linguistics, Saarbrucken, DE (2000)
Google Scholar
Lee, L.: Similarity-Based Approaches to Natural Language Processing. PhD thesis, Harvard University, Cambridge, MA (1997)
Google Scholar
Lee, S., Shishibori, M.: Passage segmentation based on topic matter. Computer Processing of Oriental Languages 15(3), 305–340 (2002)
Article Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Pereira, F.C.N., Tishby, N., Lee, L.: Distributional clustering of english words. In: Meeting of the Association for Computational Linguistics, pp. 183–190 (1993)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49 (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Yao, T., et al.: Natural Language Processing - A research of making computers understand human languages. Tsinghua University Press, Beijing (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Lab, Northeastern University, Shenyang, 110004, China
Chen Wenliang, Chang Xingzhi, Wang Huizhen, Zhu Jingbo & Yao Tianshun

Authors

Chen Wenliang
View author publications
You can also search for this author in PubMed Google Scholar
Chang Xingzhi
View author publications
You can also search for this author in PubMed Google Scholar
Wang Huizhen
View author publications
You can also search for this author in PubMed Google Scholar
Zhu Jingbo
View author publications
You can also search for this author in PubMed Google Scholar
Yao Tianshun
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng
The Key Laboratory of Power System Protection and Dynamic Security Monitoring and Control under Ministry of Education, North China Electric Power University, Zhuxinzhuang Dewai, 102206, Beijing, China
Ming Zhou
Department of Systems Engineering and Engineering Management, Shatin, The Chinese University of Hong Kong, Hong Kong, N.T.
Kam-Fai Wong
5F, Beijing Sigma Center, Microsoft Research Asia, No. 49 Zhichun Road Haidian District, 100080, Beijing, China
Hong-Jiang Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wenliang, C., Xingzhi, C., Huizhen, W., Jingbo, Z., Tianshun, Y. (2005). Automatic Word Clustering for Text Categorization Using Global Information. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-31871-2_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics