Term Clustering and Confidence Measurement in Document Clustering
Document clustering is the classification of documents into several groups based on a given classification criteria, like the topic similarity. In a supervised learning scenario, the system extracts features from labeled examples and learns to identify documents of the same categories. A large family of methods is based on vector spaces, where documents are represented by vectors in a space of features, like occurrences of the various terms. Every used term (not a stopword and not too rare) is assigned to a feature and the coordinate of the document along this dimension is a function of the occurrence of the term in the documents. A frequently used weighting scheme family is the TFIDF (term-frequency, inverse document frequency) scheme .
KeywordsDocument Class Target Class Confidence Measurement Document Cluster Inverse Document Frequency
Unable to display preview. Download preview PDF.
- 1.Singhal (2001) Modern information retrieval: A brief overview. In: IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 35-43.Google Scholar
- 2.K. Lang (1995) Newsweeder: Learning to filter netnews. In: ICML, pp. 331-339.Google Scholar
- 3.L. Li and W. Chou (2002) Improving latent semantic indexing based classifier with information gain. In: tech. rep., May 16 2002.Google Scholar
- 4.N. Slonim and N. Tishby (2000) Document clustering using word clus- ters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Clustering, pp. 208-215.Google Scholar