Text Categorization Based on Subtopic Clusters

  • Francis C. Y. Chik
  • Robert W. P. Luk
  • Korris F. L. Chung
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


The distribution of the number of documents in topic classes is typically highly skewed. This leads to good micro-average performance but not so desirable macro-average performance. By viewing topics as clusters in a high dimensional space, we propose the use of clustering to determine subtopic clusters for large topic classes by assuming that large topic clusters are in general a mixture of a number of subtopic clusters. We used the Reuters News articles and support vector machines to evaluate whether using subtopic cluster can lead to better macro-average performance.


Support Vector Machine Test Document Text Categorization Large Topic Document Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Li, Y.H., Jain, A.K.: Classification of text documents. The Computer Journal 41(8), 537–546 (1998)zbMATHCrossRefGoogle Scholar
  2. 2.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proc. 22nd ACM SIGIR Conf., pp. 42–49 (1999)Google Scholar
  3. 3.
    Aas, K., Eikvil, L.: Text Categorisation: a survey, Technical Report #941, Norwegian Computing Center (1999)Google Scholar
  4. 4.
    Lewis, D.: Reuters-21578 text categorization test collection distribution 1.0,
  5. 5.
    Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization, Technical Report, Microsoft Research (1998)Google Scholar
  6. 6.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proc. European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  7. 7.
    Yang, Y.: An evaluation of statistical approaches to text categorization, Technical Report CMU-CS-97127, Computer Science Department, Carnegie Mellon University (1997) Google Scholar
  8. 8.
    Schapire, R., Singer, Y.: Boostexter: a boosting-based system for text categorization. Machine Learning 39(2), 135–168 (2000)zbMATHCrossRefGoogle Scholar
  9. 9.
    Schütze, H.: Single-link, complete-link & average-link clustering, NLP and Text Mining,
  10. 10.
    Nicholas, C., Kogan, J., Teboulle, M.: Tutorial on clustering large and high-dimensional data,
  11. 11.
    Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Computing Surveys 31(3), 263–323 (1999)CrossRefGoogle Scholar
  12. 12.
    Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  13. 13.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic text retrieval. Communications of the ACM 18(11), 613–620 (1975)zbMATHCrossRefGoogle Scholar
  14. 14.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  15. 15.
    Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  16. 16.
  17. 17.
  18. 18.
    van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)Google Scholar
  19. 19.
    Kullback, S., Leibler, R.: On information and sufficiency. Annals of Mathematical Statistics 22, 79–86 (1951)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
  21. 21.
    Callan, J.P.: Passage-level evidence in document retrieval. In: Proc. 17th ACM SIGIR Conf., pp. 302–310 (1994)Google Scholar
  22. 22.
    Takamura, H., Matsumoto, Y.: Two-dimensional clustering for text categorization. In: Proc. of CoNLL-2002, pp. 29–35 (2002)Google Scholar
  23. 23.
    Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proc. 23rd ACM SIGIR Conf., pp. 224–231 (2000)Google Scholar
  24. 24.
    Reuters Corpus, Volume 1, English language (Release date 2000-11-03, Format version 1, correction level 0),

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Francis C. Y. Chik
    • 1
  • Robert W. P. Luk
    • 1
  • Korris F. L. Chung
    • 1
  1. 1.Department of ComputingHong Kong Polytechnic University 

Personalised recommendations