Abstract
The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The first eigenvector is not discriminative in terms of the clustering structure and can be dropped. Its value can be shown to depend only on the square-root of the frequency of the corresponding term or document.
- 2.
- 3.
Although \(\overline{X_{i}}\) is a binary vector, we are treating it like a set when we use a set-membership notation like \(t_{j} \in \overline{X_{i}}\). Any binary vector can also be viewed as a set of the 1s in it.
- 4.
This model is discussed only in later chapters. The uninitiated reader may choose to skip over this section in the first reading.
- 5.
Each of the respective sums of the diagonal entries of S and Δ are the same because the trace of a matrix is invariant under similarity transformation [460]. Therefore, the eigenvalues sum to 0. Unless all eigenvalues are 0 (i.e., S = 0), at least one negative eigenvalue will exist.
- 6.
This section requires an understanding of the classification problem. We recommend the uninitiated reader to skip this section at the first reading of the book, and return to it only after covering the material in the next chapter. The notations and terminologies used in this section assume such an understanding.
- 7.
The sigmoid function 1∕(1 + e −λs) can be used to convert an arbitrary score s to the range (0, 1), which is followed by normalizing the scores to sum to 1 over all classes.
Bibliography
C. Aggarwal. Data mining: The textbook. Springer, 2015.
C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]
C. Aggarwal and C. Reddy. Data clustering: algorithms and applications, CRC Press, 2013.
C. Aggarwal and S. Sathe. Outlier ensembles: An introduction. Springer, 2017.
C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.
C. Aggarwal and P. Yu. On clustering massive text and categorical data streams. Knowledge and Information Systems, 24(2), pp. 171–196, 2010.
C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.
J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998.
L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.
P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy. AAAI Press/MIT Press, 1996.
W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28, pp. 341–344, 1977.
D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. ACM SIGIR Conference, pp. 318–329, 1992.
I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. ACM KDD Conference, pp. 269–274, 2001.
I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1–2), pp. 143–175, 2001.
C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, pp. 606–610, 2005.
C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. ACM KDD Conference, pp. 126–135, 2006.
J. Ghosh and A. Acharya. Cluster ensembles: Theory and applications. Data Clustering: Algorithms and Applications, CRC Press, 2013.
S. Gilpin, T. Eliassi-Rad, and I. Davidson. Guided learning for role discovery (glrd): framework, algorithms, and applications. ACM KDD Conference, pp. 113–121, 2013.
T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.
T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 41(1–2), pp. 177–196, 2001.
Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.
D. Lee and H. Seung. Algorithms for non-negative matrix factorization. Advances in Meural Information Processing Systems, pp. 556–562, 2001.
H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.
Y. Li, C. Luo, and S. Chung. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), pp. 641–652, 2008.
S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1), pp. 24–45, 2004.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography (special issue), 3(4), pp. 235–312, 1990. https://wordnet.princeton.edu/
A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS Conference, pp. 849–856, 2002.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.
H. Paulheim and R. Meusel. A decomposition of the outlier detection problem into a set of supervised learning problems. Machine Learning, 100(2–3), pp. 509–531, 2015.
F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. ACL Conference, pp. 183–190, 1993.
H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.
F. Shahnaz, M. Berry, V. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2), pp. 378–386, 2006.
G. Strang. An introduction to linear algebra. Wellesley Cambridge Press, 2009.
E. Voorhees. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management, 22(6), pp. 465–476, 1986.
W. Wilbur and K. Sirotkin. The automatic identification of stop words. Journal of Information Science, 18(1), pp. 45–55, 1992.
C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. NIPS Conference, 2000.
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. ACM SIGIR Conference, pp. 267–273, 2003.
M. Zaki and W. Meira Jr. Data mining and analysis: Fundamental concepts and algorithms. Cambridge University Press, 2014.
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. ACM SIGIR Conference, pp. 46–54, 1998.
Y. Zhao, G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004.
S. Zhong. Efficient streaming text clustering. Neural Networks, Vol. 18, 5–6, 2005.
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Aggarwal, C.C. (2018). Text Clustering. In: Machine Learning for Text. Springer, Cham. https://doi.org/10.1007/978-3-319-73531-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-73531-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73530-6
Online ISBN: 978-3-319-73531-3
eBook Packages: Computer ScienceComputer Science (R0)