Term Clustering and Confidence Measurement in Document Clustering

  • Kristóf Csorba
  • István Vajk
Conference paper

Document clustering is the classification of documents into several groups based on a given classification criteria, like the topic similarity. In a supervised learning scenario, the system extracts features from labeled examples and learns to identify documents of the same categories. A large family of methods is based on vector spaces, where documents are represented by vectors in a space of features, like occurrences of the various terms. Every used term (not a stopword and not too rare) is assigned to a feature and the coordinate of the document along this dimension is a function of the occurrence of the term in the documents. A frequently used weighting scheme family is the TFIDF (term-frequency, inverse document frequency) scheme [1].


Document Class Target Class Confidence Measurement Document Cluster Inverse Document Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Singhal (2001) Modern information retrieval: A brief overview. In: IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 35-43.Google Scholar
  2. 2.
    K. Lang (1995) Newsweeder: Learning to filter netnews. In: ICML, pp. 331-339.Google Scholar
  3. 3.
    L. Li and W. Chou (2002) Improving latent semantic indexing based classifier with information gain. In: tech. rep., May 16 2002.Google Scholar
  4. 4.
    N. Slonim and N. Tishby (2000) Document clustering using word clus- ters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Clustering, pp. 208-215.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Kristóf Csorba
  • István Vajk

There are no affiliations available

Personalised recommendations