Skip to main content
Log in

Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Recently research on text mining has attracted lots of attention from both industrial and academic fields. Text mining concerns of discovering unknown patterns or knowledge from a large text repository. The problem is not easy to tackle due to the semi-structured or even unstructured nature of those texts under consideration. Many approaches have been devised for mining various kinds of knowledge from texts. One important aspect of text mining is on automatic text categorization, which assigns a text document to some predefined category if the document falls into the theme of the category. Traditionally the categories are arranged in hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures were most done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. These maps were then analyzed to obtain the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language and such documents can be transformed into a list of separated terms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Apte, C., Damerau, F., and Weiss, S.M. (1994). Automated Learning of Decision Rules for Text Categorization. ACM Trans. Information Systems, 12(3), 233–251.

    Article  Google Scholar 

  • Chen, A., He, J.Z., Xu, L.J., Gey, F.C., and Meggs, J. (1997). Chinese Text Retrieval Without Using a Dictionary. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 42–49).

  • Clifton, C. and Cooley, R. (1999). TopCat: Data Mining for Topic Identification in a Text Corpus. In Proc. European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD 99) (pp. 174–183).

  • Cohen, W.W. and Singer, Y. (1996). Context-Sensitive Learning Methods for Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 307–315).

  • Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. London: Chapman & Hall.

    Google Scholar 

  • Dai, Y., Loh, T.E., and Khoo, C., (1999). A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information. In Proc. 22th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 82–89).

  • Deerwester, S., Dumais, S., Furnas, G., and Landauer, K. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 40(6), 391–407.

    Article  Google Scholar 

  • Feldman, R., Dagan, I., and Hirsh, H. (1998). Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems, 10, 281–300.

    Article  Google Scholar 

  • Grobelnik, M. and Mladenić, D. (1998). Efficient Text Categorization. In Proc. Text Mining Workshop on ECML-98. Chemnitz, Germany.

  • Hearst, M.A. and Karadi, C. (1997). Cat-a-Cone: An Interactive Interface for Specifying Searches and Viewing Retrieval Results using a Large Category Hierarchy. In Proc. 20th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 246–255).

  • Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-Length Document Access. In Proc. 16th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 59–68).

  • Hofmann, T. (1999). The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data. In Proc. Int’l Joint Conf. on Artificial Intelligence (IJCAI 99) (pp. 682–687).

  • Huang, X. and Robertson, S.E. (1997a). Experiments on Large Test Collections with Probabilistic Approaches to Chinese Text Retrieval. In Proc. the 2nd Int’l Workshop on Information Retrieval With Asian Languages (pp. 129–140). Tsukuba, Japan.

    Google Scholar 

  • Huang, X. and Robertson, S.E. (1997b). Okapi Chinese Text Retrieval Experiments at TREC-6. In Proc. 6th Text Retrieval Conference (TREC-6) (pp. 137–142).

  • Jolliffe, I.T. (1986), Principal Component Analysis. Berlin: Springer-Verlag.

    Google Scholar 

  • Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998). WEBSOM-Self-Organizing Maps of Document Collections. Neurocomputing, 21, 101–117.

    Article  Google Scholar 

  • Kohonen, T. (1997). Self-Organizing Maps. Berlin (Springer-Verlag).

    Google Scholar 

  • Lam, W., Ruiz, M., and Srinivasan, P. (1999). Automatic Text Categorization and Its Application to Text Retrieval. IEEE Trans. Knowledge and Data Engineering, 11(8), 865–879.

    Article  Google Scholar 

  • Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 289–297).

  • Lee, C.H. and Yang, H.C. (1999). A Web Text Mining Approach Based on Self-Organizing Map. In Proc. ACM CIKM’99 2nd Workshop on Web Information and Data Management. (pp. 59–62) Kansas City, MI.

  • Lewis, D.D. (1992). Feature Selection and Feature Extraction for Text Categorization. In Proc. Speech and Natural Language Workshop (pp. 212–217), Arden House.

  • Lewis, D.D., Schapire, R.E., Callan, J.P., and Papka, R. (1996). Training Algorithms for Linear Text Classifiers. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 298–306).

  • Lin C.Y. (1995). Knowledge-Based Automatic Topic Identification. In Proc. Meeting of the Association for Computational Linguistics (ACL 95). (pp. 308–310).

  • McCallum, A. and Nigam, K. (1999). Text Classification by Bootstrapping with Keywords, EM and Shrinkage. In Proc. ACL ‘99 Workshop for Unsupervised Learning in Natural Language Processing. (pp. 52–58).

  • Mehnert, R. (1997). Federal Agency and Federal Library Reports, National Library of Medicine: 2 edition. Providence, NJ: R. R. Bowker.

    Google Scholar 

  • Nie, J.Y., Brisebois, M., and Ren, X. (1996). On Chinese Text Retrieval. In Proc. 19th Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (pp. 225–233).

  • Ponte, J.M. and Croft, W.B. (1997). Text Segmentation by Topic. In Proc. European Conference on Digital Libraries (ECDL 97) (pp. 113–125).

  • Rajaraman, K., Lai, K.F., and Changwen, Y. (1997). Experiments on Proximity Based Chinese Text Retrieval in TREC 6. In Proc. 6th Text REtrieval Conference (TREC-6) (pp. 559–576).

  • Rauber, A. and Merkl, D. (1999). Using Self-Organizing Maps to Organize Document Archives and to Characterize Subject Matter: How to Make a Map Tell the News of the World. In Proc. 10th International Conference on Database and Expert Systems Applications. (pp. 302–311).

  • Rizzo, R., Allegra, M., and Fulantelli, G. (1998). Developing Hypertext through a Self-Organizing Map. In Proc. WebNet 98 (pp. 768–772) Orlando, USA.

  • Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill.

    Google Scholar 

  • Salton, G. and Singhal, A. (1994). Automatic Text Theme Generation and the Analysis of Text Structure. Technical Report TR 94-1438, Dept. Computer Science, Cornell Univ., Ithaca, NY.

  • Weigend, A.S., Wiener, E.D., and Pedersen, J.O. (1999). Exploiting Hierarchy in Text Categorization. Information Retrieval, 1(3), 193–216.

    Article  Google Scholar 

  • Wu, Z.M. and Tseng, G. (1993). Chinese Text Segmentation for Text Retrieval, Achievements and Problems. Journal of the American Society for Information Science, 44(9), 532–542.

    Article  Google Scholar 

  • Wu, Z.M. and Tseng, G. (1995). An Automatic Chinese Text Segmentation System for Full Text Retrieval. Journal of the American Society for Information Science, 46(2), 83–96.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hsin-Chang Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, HC., Lee, CH. Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization. J Intell Inf Syst 25, 47–67 (2005). https://doi.org/10.1007/s10844-005-0859-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-005-0859-6

Keywords

Navigation