Definition
Text clustering is to automatically group textual documents (for example, documents in plain text, web pages, emails and etc) into clusters based on their content similarity. The problem of text clustering can be defined as follows. Given a set of n documents noted as DS and a pre-defined cluster number K (usually set by users), DS is clustered into K document clusters DS1 , DS2 , … , DSk, (i . e , {DS1, DS2, … , DSk} = DS) so that the documents in a same document cluster are similar to one another while documents from different clusters are dissimilar [14].
Foundations
Text clustering consists of several important components including document representation, text clustering algorithms and performance measurements. The readers should refer to [6, 8, 13] for more details.
Document Representation
The original representation of textual documents (like plain texts, web pages, emails and etc) could not be interpreted by text clustering algorithms directly. A proper document representation method is necessary for any text clustering algorithms. Vector Space Model [6] is generally used to represent a document d as a vector of term weights d = < w1 , w2 , … , w|V|>, where V is the set of terms (also named as features sometimes) that occur at least once in the document set DS. Different representation approaches vary in two issues: (i) different ways of understanding what a term is; (ii) different ways of computing term weights. For issue (i), a straightforward way is to identify terms with words. This is often called either the set-of-words or the bag-of-words approach to document representation, depending on whether weights are binary or not [11]. Some previous work has found that representations more sophisticated than this are not significantly more effective [5]. For issue (ii), the weight can be binary (1 denoting the presence and 0 absence of the term in the document) or non-binary. For the non-binary value, it can be either the Term Frequency (TF) of the term in a document or TFIDF as computed according to the following equation where N(t, d) is the number of the times the word t appears in d, |D| is the size of the document corpus, nt , D is the number of documents in D containing the term t:
Cosine normalization is sometimes used to normalize the document vectors [11]. It would depend on the text clustering algorithms to choose proper term weight strategies.
Text Clustering Algorithms
Two categories can be used to organize all various clustering algorithms (most of the general clustering algorithms could be applied to text clustering tasks) developed in the past a few years: hierarchical and parititional approaches. The hierarchical algorithms generate successive clusters in a nested sequence. The partitional ones produce all clusters at one time.
In the following section, three popular clustering algorithms would be briefly introduced for readers to get primary impressions of basic clustering algorithms. Single-Link clustering [3] is one basic approach among hierarchical clustering algorithms category (http://en.wikipedia.org/wiki/Cluster_analysis). K-Means clustering [9] is one of the typical partitional algorithms which minimizes square error to generate clusters. Co-clustering [4] is a graphic theory based partitional clustering approach which is very popular in recent years. For more clustering algorithms, the readers can refer to [6].
Single-Link Clustering
In the Single-Link clustering, the distance between two clusters is defined as the minimum of the distances of all linkages drawn from the two clusters, where the linkage is the criterion to determine the distance of pairs of patterns/points between two clusters while patterns/points are associated with them. One shortcoming of the Single-Link clustering is that it would suffer from a chaining effect [10] which has a tendency to produce clusters that are straggly or elongated [6].
The three main steps of Single-Link Clustering algorithm are as follows [6]:
- 1.
With each pattern/point in its own cluster, construct a list of inter-pattern/point distances for all distinct N ordered pairs of patterns/points, and sort this list in ascending order.
- 2.
Step through the sorted list of distances, forming for each distinct dissimilarity value dk a graph on the patterns where pairs of patterns closer than dk are connected by a graph edge.
- 1.
If all the patterns are members of a connected graph, stop.
- 2.
Otherwise, repeat 2.
- 1.
- 3.
The output of the algorithm is a nested hierarchy of graphs which can be cut at a desired dissimilarity level to form a clustering. The clusters would be identified by simply connected components in the corresponding graph.
K-Means Clustering
K-Means clustering algorithm is one of the simple but very efficient clustering algorithms, which allows it to run through large datasets. The main advantages of K-Means are (i) simplicity and efficiency; (ii) does not yield the same result with different run as the resulting clusters depend on the initial random assignments. The main disadvantage is that as it minimizes intra-cluster variance, K-means does not ensure the result has a global minimum of variance (http://en.wikipedia.org/wiki/Cluster_analysis).
K-Means algorithm is to cluster n objects (here textual documents) based on attributes (the document representation as vector space model) into K(K < n) partitions. It assigns each object to the cluster which has the nearest center. The center is defined as the average of all the objects in the cluster, which starts from a set of random initial centers. It assumes that the object attributes form a vector space and the objective for the algorithm to achieve is to minimize total intra-cluster variance or, the squared error function (http://en.wikipedia.org/wiki/Cluster_analysis):
where Si , i = 1 , 2 , … , k are K clusters and μi is the center of cluster Si .
The main steps of K-Means clustering algorithm are as follows [9]:
- 1.
Setup the cluster number K;
- 2.
Randomly generate K clusters and calculate the cluster centers, or directly generate K random points as cluster centers;
- 3.
Assign each other points to the nearest cluster center;
- 4.
Recalculate the new cluster centers after new points are clustered into the clusters;
- 5.
Repeat 3 and 4 until some convergence criterion is met;
Co-clustering
In Co-Clustering method, the document collection would be modeled as a bipartite graph between document and words. That makes the clustering problem could be posed as a graph partitioning problem. Then Co-Clustering is developed as a spectral algorithm which could simultaneously yield a clustering of documents and words based on this document and word graph. The Co-Clustering algorithm uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings [4].
Performance Measurements
There are generally two types of measurements used to evaluate the performance of different text clustering algorithms. One is internal quality measure and the other is external quality measure. The authors of [12] had made a thorough introduction of various clustering algorithms measurements. The readers could refer to their work for more details. Here a brief introduction for both internal and external quality measurements would be introduced in the following.
Internal Quality Measure
The internal quality measure is used to compare different sets of clusters without referring to external knowledge (like human labeled/known classes/categories). One approach of this kind of internal quality measurement is to calculate the “overall similarity” based on the pair-wise similarity of documents in a cluster [12].
External Quality Measure
The external quality measure as naming is to leverage external knowledge as known classes (categories) to make comparisons with the generated clusters from the clustering algorithms. Entropy [12] is one external measure which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering. F-measure is another good example of external quality measure, which is more oriented toward measuring the effectiveness of a hierarchical clustering.
The readers should be aware that there are still many other different quality measures than those ones introduced here. The more important thing is that the performance of different clustering algorithms could vary substantially depending on which measure is applied [12].
Key Applications
Text clustering has many applications, including search results clustering, topic detection and tracking, email clustering, and etc.
Cross-References
Recommended Reading
Croft WB. Organizing and searching large files of documents. Ph.D. thesis, University of Cambridge; 1978.
Cutting DR, Karger DR, Pedersen JO, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.
Day WH, Edelsbrunner H. Efficient algorithms for agglomerative hierarchical clustering methods. J Classif. 1984;1(2):1–24.
Dhillon IS. Co-clustering documents and words using bipartite spectral graph partitioning, UT CS Technical report #TR. Department of Computer Sciences, University of Texas, Austin; 2001.
Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management; 1998. p. 148–55.
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.
Leouski AV, Croft WB. An evaluation of techniques for clustering search results. Technical report IR-76. Department of Computer Science, University of Massachusetts, Amherst; 1996.
Lewis DD. Representation quality in text classification: an introduction and experiment. In: Proceedings of the Workshop on Speech and Natural Language; 1990. p. 288–295.
MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; 1967. p. 281–97.
Nagy G. State of the art in pattern recognition. Proc IEEE. 1968;56(5):836–62.
Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):147.
Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. Technique report, University of Minnesota – Computer Science and Engineering; 2000.
van Rijsbergen CJ. Information retrieval. 2nd ed. London: Butterworths; 1979.
Yoo I, Hu XH. A comprehensive comparison study of document clustering for a biomedical distal library Medline. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries; 2006. p. 220–9.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Li, H. (2018). Text Clustering. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_415
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_415
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering