Text Clustering



The problem of text clustering is that of partitioning a corpus into groups of similar documents. Clustering is an unsupervised learning application because no data-driven guidance is provided about specific types of groups (e.g., sports, politics, and so on) with the use of training data.


  1. [2]
    C. Aggarwal. Data mining: The textbook. Springer, 2015.Google Scholar
  2. [6]
    C. Aggarwal, S. Gates, and P. Yu. On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245–255, 2004. [Extended version of ACM KDD 1998 paper “On the merits of building categorization systems by supervised clustering.”]CrossRefGoogle Scholar
  3. [8]
    C. Aggarwal and C. Reddy. Data clustering: algorithms and applications, CRC Press, 2013.Google Scholar
  4. [9]
    C. Aggarwal and S. Sathe. Outlier ensembles: An introduction. Springer, 2017.Google Scholar
  5. [12]
    C. Aggarwal and P. Yu. On effective conceptual indexing and similarity search in text data. ICDM Conference, pp. 3–10, 2001.Google Scholar
  6. [13]
    C. Aggarwal and P. Yu. On clustering massive text and categorical data streams. Knowledge and Information Systems, 24(2), pp. 171–196, 2010.CrossRefGoogle Scholar
  7. [14]
    C. Aggarwal, and C. Zhai, Mining text data. Springer, 2012.Google Scholar
  8. [19]
    J. Allan, R. Papka, V. Lavrenko. Online new event detection and tracking. ACM SIGIR Conference, 1998.Google Scholar
  9. [33]
    L. Baker and A. McCallum. Distributional clustering of words for text classification. ACM SIGIR Conference, pp. 96–103, 1998.Google Scholar
  10. [47]
    Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3, pp. 1137–1155, 2003.zbMATHGoogle Scholar
  11. [91]
    P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy. AAAI Press/MIT Press, 1996.Google Scholar
  12. [118]
    W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28, pp. 341–344, 1977.CrossRefGoogle Scholar
  13. [124]
    D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. ACM SIGIR Conference, pp. 318–329, 1992.Google Scholar
  14. [132]
    I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. ACM KDD Conference, pp. 269–274, 2001.Google Scholar
  15. [133]
    I. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1–2), pp. 143–175, 2001.CrossRefGoogle Scholar
  16. [135]
    C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, pp. 606–610, 2005.CrossRefGoogle Scholar
  17. [137]
    C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52(8), pp. 3913–3927, 2008.MathSciNetCrossRefGoogle Scholar
  18. [138]
    C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for clustering. ACM KDD Conference, pp. 126–135, 2006.Google Scholar
  19. [187]
    J. Ghosh and A. Acharya. Cluster ensembles: Theory and applications. Data Clustering: Algorithms and Applications, CRC Press, 2013.Google Scholar
  20. [189]
    S. Gilpin, T. Eliassi-Rad, and I. Davidson. Guided learning for role discovery (glrd): framework, algorithms, and applications. ACM KDD Conference, pp. 113–121, 2013.Google Scholar
  21. [224]
    T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, pp. 50–57, 1999.Google Scholar
  22. [225]
    T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine learning, 41(1–2), pp. 177–196, 2001.CrossRefGoogle Scholar
  23. [275]
    Q. Le and T. Mikolov. Distributed representations of sentences and documents. ICML Conference, pp. 1188–196, 2014.Google Scholar
  24. [276]
    D. Lee and H. Seung. Algorithms for non-negative matrix factorization. Advances in Meural Information Processing Systems, pp. 556–562, 2001.Google Scholar
  25. [289]
    H. Li, and K. Yamanishi. Document classification using a finite mixture model. ACL Conference, pp. 39–47, 1997.Google Scholar
  26. [291]
    Y. Li, C. Luo, and S. Chung. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5), pp. 641–652, 2008.CrossRefGoogle Scholar
  27. [317]
    S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 1(1), pp. 24–45, 2004.CrossRefGoogle Scholar
  28. [325]
    A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering., 1996.
  29. [341]
    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv:1301.3781, 2013.
  30. [347]
    G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to WordNet: An on-line lexical database. International Journal of Lexicography (special issue), 3(4), pp. 235–312, 1990. CrossRefGoogle Scholar
  31. [361]
    A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS Conference, pp. 849–856, 2002.Google Scholar
  32. [364]
    K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification with labeled and unlabeled data using EM. Machine Learning, 39(2), pp. 103–134, 2000.CrossRefGoogle Scholar
  33. [379]
    H. Paulheim and R. Meusel. A decomposition of the outlier detection problem into a set of supervised learning problems. Machine Learning, 100(2–3), pp. 509–531, 2015.MathSciNetCrossRefGoogle Scholar
  34. [381]
    F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. ACL Conference, pp. 183–190, 1993.Google Scholar
  35. [438]
    H. Schütze and C. Silverstein. Projections for Efficient Document Clustering. ACM SIGIR Conference, pp. 74–81, 1997.CrossRefGoogle Scholar
  36. [443]
    F. Shahnaz, M. Berry, V. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2), pp. 378–386, 2006.CrossRefGoogle Scholar
  37. [460]
    G. Strang. An introduction to linear algebra. Wellesley Cambridge Press, 2009.Google Scholar
  38. [486]
    E. Voorhees. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management, 22(6), pp. 465–476, 1986.CrossRefGoogle Scholar
  39. [498]
    W. Wilbur and K. Sirotkin. The automatic identification of stop words. Journal of Information Science, 18(1), pp. 45–55, 1992.CrossRefGoogle Scholar
  40. [501]
    C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. NIPS Conference, 2000.Google Scholar
  41. [508]
    W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. ACM SIGIR Conference, pp. 267–273, 2003.Google Scholar
  42. [524]
    M. Zaki and W. Meira Jr. Data mining and analysis: Fundamental concepts and algorithms. Cambridge University Press, 2014.Google Scholar
  43. [525]
    O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. ACM SIGIR Conference, pp. 46–54, 1998.Google Scholar
  44. [536]
    Y. Zhao, G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering, Machine Learning, 55(3), pp. 311–331, 2004.CrossRefGoogle Scholar
  45. [537]
    S. Zhong. Efficient streaming text clustering. Neural Networks, Vol. 18, 5–6, 2005.CrossRefGoogle Scholar
  46. [550]
  47. [551]
  48. [553]
  49. [569]
  50. [570]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM T. J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations