Definition

Text clustering is to automatically group textual documents (for example, documents in plain text, web pages, emails and etc) into clusters based on their content similarity. The problem of text clustering can be defined as follows. Given a set of n documents noted as DS and a pre-defined cluster number K (usually set by users), DS is clustered into K document clusters DS1 , DS2 , … , DSk, (i . e , {DS1, DS2, … , DSk} = DS) so that the documents in a same document cluster are similar to one another while documents from different clusters are dissimilar [14].

Historical Background

Text clustering was initially developed to improve the performance of search engines through pre-clustering the entire corpus [2]. Text clustering later has also been investigated as a post-retrieval document browsing technique [1, 2, 7].

Foundations

Text clustering consists of several important components including document representation, text clustering algorithms and performance measurements. The readers should refer to [6, 8, 13] for more details.

Document Representation

The original representation of textual documents (like plain texts, web pages, emails and etc) could not be interpreted by text clustering algorithms directly. A proper document representation method is necessary for any text clustering algorithms. Vector Space Model [6] is generally used to represent a document d as a vector of term weights d = < w1 , w2 , … , w|V|>, where V is the set of terms (also named as features sometimes) that occur at least once in the document set DS. Different representation approaches vary in two issues: (i) different ways of understanding what a term is; (ii) different ways of computing term weights. For issue (i), a straightforward way is to identify terms with words. This is often called either the set-of-words or the bag-of-words approach to document representation, depending on whether weights are binary or not [11]. Some previous work has found that representations more sophisticated than this are not significantly more effective [5]. For issue (ii), the weight can be binary (1 denoting the presence and 0 absence of the term in the document) or non-binary. For the non-binary value, it can be either the Term Frequency (TF) of the term in a document or TFIDF as computed according to the following equation where N(t, d) is the number of the times the word t appears in d, |D| is the size of the document corpus, nt , D is the number of documents in D containing the term t:

$$ {w}_t=N\left(t,d\right)* \log \left(|D|/{n}_{t,D}\right) $$

Cosine normalization is sometimes used to normalize the document vectors [11]. It would depend on the text clustering algorithms to choose proper term weight strategies.

Text Clustering Algorithms

Two categories can be used to organize all various clustering algorithms (most of the general clustering algorithms could be applied to text clustering tasks) developed in the past a few years: hierarchical and parititional approaches. The hierarchical algorithms generate successive clusters in a nested sequence. The partitional ones produce all clusters at one time.

In the following section, three popular clustering algorithms would be briefly introduced for readers to get primary impressions of basic clustering algorithms. Single-Link clustering [3] is one basic approach among hierarchical clustering algorithms category (http://en.wikipedia.org/wiki/Cluster_analysis). K-Means clustering [9] is one of the typical partitional algorithms which minimizes square error to generate clusters. Co-clustering [4] is a graphic theory based partitional clustering approach which is very popular in recent years. For more clustering algorithms, the readers can refer to [6].

Single-Link Clustering

In the Single-Link clustering, the distance between two clusters is defined as the minimum of the distances of all linkages drawn from the two clusters, where the linkage is the criterion to determine the distance of pairs of patterns/points between two clusters while patterns/points are associated with them. One shortcoming of the Single-Link clustering is that it would suffer from a chaining effect [10] which has a tendency to produce clusters that are straggly or elongated [6].

The three main steps of Single-Link Clustering algorithm are as follows [6]:

  1. 1.

    With each pattern/point in its own cluster, construct a list of inter-pattern/point distances for all distinct N ordered pairs of patterns/points, and sort this list in ascending order.

  2. 2.

    Step through the sorted list of distances, forming for each distinct dissimilarity value dk a graph on the patterns where pairs of patterns closer than dk are connected by a graph edge.

    1. 1.

      If all the patterns are members of a connected graph, stop.

    2. 2.

      Otherwise, repeat 2.

  3. 3.

    The output of the algorithm is a nested hierarchy of graphs which can be cut at a desired dissimilarity level to form a clustering. The clusters would be identified by simply connected components in the corresponding graph.

K-Means Clustering

K-Means clustering algorithm is one of the simple but very efficient clustering algorithms, which allows it to run through large datasets. The main advantages of K-Means are (i) simplicity and efficiency; (ii) does not yield the same result with different run as the resulting clusters depend on the initial random assignments. The main disadvantage is that as it minimizes intra-cluster variance, K-means does not ensure the result has a global minimum of variance (http://en.wikipedia.org/wiki/Cluster_analysis).

K-Means algorithm is to cluster n objects (here textual documents) based on attributes (the document representation as vector space model) into K(K < n) partitions. It assigns each object to the cluster which has the nearest center. The center is defined as the average of all the objects in the cluster, which starts from a set of random initial centers. It assumes that the object attributes form a vector space and the objective for the algorithm to achieve is to minimize total intra-cluster variance or, the squared error function (http://en.wikipedia.org/wiki/Cluster_analysis):

$$ V=\sum_{i=1}^k{\sum}_{x_j\in {S}_i}{\left({x}_j-{\mu}_i\right)}^2 $$

where Si , i = 1 , 2 , … , k are K clusters and μi is the center of cluster Si .

The main steps of K-Means clustering algorithm are as follows [9]:

  1. 1.

    Setup the cluster number K;

  2. 2.

    Randomly generate K clusters and calculate the cluster centers, or directly generate K random points as cluster centers;

  3. 3.

    Assign each other points to the nearest cluster center;

  4. 4.

    Recalculate the new cluster centers after new points are clustered into the clusters;

  5. 5.

    Repeat 3 and 4 until some convergence criterion is met;

Co-clustering

In Co-Clustering method, the document collection would be modeled as a bipartite graph between document and words. That makes the clustering problem could be posed as a graph partitioning problem. Then Co-Clustering is developed as a spectral algorithm which could simultaneously yield a clustering of documents and words based on this document and word graph. The Co-Clustering algorithm uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings [4].

Performance Measurements

There are generally two types of measurements used to evaluate the performance of different text clustering algorithms. One is internal quality measure and the other is external quality measure. The authors of [12] had made a thorough introduction of various clustering algorithms measurements. The readers could refer to their work for more details. Here a brief introduction for both internal and external quality measurements would be introduced in the following.

Internal Quality Measure

The internal quality measure is used to compare different sets of clusters without referring to external knowledge (like human labeled/known classes/categories). One approach of this kind of internal quality measurement is to calculate the “overall similarity” based on the pair-wise similarity of documents in a cluster [12].

External Quality Measure

The external quality measure as naming is to leverage external knowledge as known classes (categories) to make comparisons with the generated clusters from the clustering algorithms. Entropy [12] is one external measure which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering. F-measure is another good example of external quality measure, which is more oriented toward measuring the effectiveness of a hierarchical clustering.

The readers should be aware that there are still many other different quality measures than those ones introduced here. The more important thing is that the performance of different clustering algorithms could vary substantially depending on which measure is applied [12].

Key Applications

Text clustering has many applications, including search results clustering, topic detection and tracking, email clustering, and etc.

Cross-References