Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

Yang, Hsin-Chang; Lee, Chung-Hong

doi:10.1007/s10844-005-0859-6

Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

Published: July 2005

Volume 25, pages 47–67, (2005)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Hsin-Chang Yang¹ &
Chung-Hong Lee²

112 Accesses
5 Citations
Explore all metrics

Abstract

Recently research on text mining has attracted lots of attention from both industrial and academic fields. Text mining concerns of discovering unknown patterns or knowledge from a large text repository. The problem is not easy to tackle due to the semi-structured or even unstructured nature of those texts under consideration. Many approaches have been devised for mining various kinds of knowledge from texts. One important aspect of text mining is on automatic text categorization, which assigns a text document to some predefined category if the document falls into the theme of the category. Traditionally the categories are arranged in hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures were most done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. These maps were then analyzed to obtain the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language and such documents can be transformed into a list of separated terms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Apte, C., Damerau, F., and Weiss, S.M. (1994). Automated Learning of Decision Rules for Text Categorization. ACM Trans. Information Systems, 12(3), 233–251.
Article Google Scholar
Chen, A., He, J.Z., Xu, L.J., Gey, F.C., and Meggs, J. (1997). Chinese Text Retrieval Without Using a Dictionary. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 42–49).
Clifton, C. and Cooley, R. (1999). TopCat: Data Mining for Topic Identification in a Text Corpus. In Proc. European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD 99) (pp. 174–183).
Cohen, W.W. and Singer, Y. (1996). Context-Sensitive Learning Methods for Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 307–315).
Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. London: Chapman & Hall.
Google Scholar
Dai, Y., Loh, T.E., and Khoo, C., (1999). A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information. In Proc. 22th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 82–89).
Deerwester, S., Dumais, S., Furnas, G., and Landauer, K. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 40(6), 391–407.
Article Google Scholar
Feldman, R., Dagan, I., and Hirsh, H. (1998). Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems, 10, 281–300.
Article Google Scholar
Grobelnik, M. and Mladenić, D. (1998). Efficient Text Categorization. In Proc. Text Mining Workshop on ECML-98. Chemnitz, Germany.
Hearst, M.A. and Karadi, C. (1997). Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 246–255).
Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-Length Document Access. In Proc. 16th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 59–68).
Hofmann, T. (1999). The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data. In Proc. Int’l Joint Conf. on Artificial Intelligence (IJCAI 99) (pp. 682–687).
Huang, X. and Robertson, S.E. (1997a). Experiments on Large Test Collections with Probabilistic Approaches to Chinese Text Retrieval. In Proc. the 2nd Int’l Workshop on Information Retrieval With Asian Languages (pp. 129–140). Tsukuba, Japan.
Google Scholar
Huang, X. and Robertson, S.E. (1997b). Okapi Chinese Text Retrieval Experiments at TREC-6. In Proc. 6th Text Retrieval Conference (TREC-6) (pp. 137–142).
Jolliffe, I.T. (1986), Principal Component Analysis. Berlin: Springer-Verlag.
Google Scholar
Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998). WEBSOM-Self-Organizing Maps of Document Collections. Neurocomputing, 21, 101–117.
Article Google Scholar
Kohonen, T. (1997). Self-Organizing Maps. Berlin (Springer-Verlag).
Google Scholar
Lam, W., Ruiz, M., and Srinivasan, P. (1999). Automatic Text Categorization and Its Application to Text Retrieval. IEEE Trans. Knowledge and Data Engineering, 11(8), 865–879.
Article Google Scholar
Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 289–297).
Lee, C.H. and Yang, H.C. (1999). A Web Text Mining Approach Based on Self-Organizing Map. In Proc. ACM CIKM’99 2nd Workshop on Web Information and Data Management. (pp. 59–62) Kansas City, MI.
Lewis, D.D. (1992). Feature Selection and Feature Extraction for Text Categorization. In Proc. Speech and Natural Language Workshop (pp. 212–217), Arden House.
Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R. (1996). Training Algorithms for Linear Text Classifiers. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 298–306).
Lin C.Y. (1995). Knowledge-Based Automatic Topic Identification. In Proc. Meeting of the Association for Computational Linguistics (ACL 95). (pp. 308–310).
McCallum, A. and Nigam, K. (1999). Text Classification by Bootstrapping with Keywords, EM and Shrinkage. In Proc. ACL ‘99 Workshop for Unsupervised Learning in Natural Language Processing. (pp. 52–58).
Mehnert, R. (1997). Federal Agency and Federal Library Reports, National Library of Medicine: 2 edition. Providence, NJ: R. R. Bowker.
Google Scholar
Nie, J.Y., Brisebois, M., and Ren, X. (1996). On Chinese Text Retrieval. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 225–233).
Ponte, J.M. and Croft, W.B. (1997). Text Segmentation by Topic. In Proc. European Conference on Digital Libraries (ECDL 97) (pp. 113–125).
Rajaraman, K., Lai, K.F., and Changwen, Y. (1997). Experiments on Proximity Based Chinese Text Retrieval in TREC 6. In Proc. 6th Text REtrieval Conference (TREC-6) (pp. 559–576).
Rauber, A. and Merkl, D. (1999). Using Self-Organizing Maps to Organize Document Archives and to Characterize Subject Matter: How to Make a Map Tell the News of the World. In Proc. 10th International Conference on Database and Expert Systems Applications. (pp. 302–311).
Rizzo, R., Allegra, M., and Fulantelli, G. (1998). Developing Hypertext through a Self-Organizing Map. In Proc. WebNet 98 (pp. 768–772) Orlando, USA.
Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Google Scholar
Salton, G. and Singhal, A. (1994). Automatic Text Theme Generation and the Analysis of Text Structure. Technical Report TR 94-1438, Dept. Computer Science, Cornell Univ., Ithaca, NY.
Weigend, A.S., Wiener, E.D., and Pedersen, J.O. (1999). Exploiting Hierarchy in Text Categorization. Information Retrieval, 1(3), 193–216.
Article Google Scholar
Wu, Z.M. and Tseng, G. (1993). Chinese Text Segmentation for Text Retrieval, Achievements and Problems. Journal of the American Society for Information Science, 44(9), 532–542.
Article Google Scholar
Wu, Z.M. and Tseng, G. (1995). An Automatic Chinese Text Segmentation System for Full Text Retrieval. Journal of the American Society for Information Science, 46(2), 83–96.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Management, Chang Jung University, Tainan, Taiwan
Hsin-Chang Yang
Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
Chung-Hong Lee

Authors

Hsin-Chang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chung-Hong Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hsin-Chang Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, HC., Lee, CH. Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization. J Intell Inf Syst 25, 47–67 (2005). https://doi.org/10.1007/s10844-005-0859-6

Download citation

Received: 12 December 2000
Revised: 04 May 2004
Accepted: 07 May 2004
Issue Date: July 2005
DOI: https://doi.org/10.1007/s10844-005-0859-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

Abstract

Access this article

Similar content being viewed by others

A Text Clustering Algorithm to Detect Basic Level Categories in Texts

Text Categorization Based on Semantic Cluster-Hidden Markov Models

A Huffman Tree-Based Algorithm for Clustering Documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

Abstract

Access this article

Similar content being viewed by others

A Text Clustering Algorithm to Detect Basic Level Categories in Texts

Text Categorization Based on Semantic Cluster-Hidden Markov Models

A Huffman Tree-Based Algorithm for Clustering Documents

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation