Clustering small-sized collections of short texts


The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14


  1. 1.

    We note that our approach of associating texts with term clusters is shown below to outperform in our setting a method which is commonly used in co-clustering algorithms.

  2. 2.

    Implementation available at

  3. 3.

    The datasets are available at

  4. 4.

    To account for multi-cluster assignment, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each text has partial membership in each of its multiple clusters.

  5. 5.

    We publish our list of stopwords along with the datasets.

  6. 6.

    Available for download from

  7. 7.

    We used the following tools: LingPipe for Complete Link, R implementation of the PAM algorithm, for co-clustering and for LDA. For the KMY baseline we applied for text clustering the algorithm in Fig. 6 with \(\theta =0.7\) as suggested in the original report (Ye and Young 2006). We experimented with additional thresholds (0 through 1 with the step of 0.1) and found the threshold of 0.7 to be one of the best for our setting. The algorithm is not very sensitive to threshold values around 0.7, although much higher and lower thresholds from the range of [0, 1] result in degraded performance.

  8. 8.

    We used the word2vec software accompanying (Mikolov et al. 2013) with context size of 5, the negative-training approach with 15 negative samples (NEG-15), and sub-sampling of frequent words with a parameter of \(10^{-5}\). The parameter settings follow Mikolov et al. (2013).

  9. 9.

    We experimented also with K randomly selected initial medoids, but having each text as an initial medoid showed better results. Since our text collections are small this is not computationally expensive.


  1. Aggarwal, C.C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Springer.

  2. Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of SIGIR (pp. 37–45).

  3. Aslam, J. A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., McCreadie, R., & Sakai, T. (2014). TREC 2014 temporal summarization track overview. In Proceedings of TREC.

  4. Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). ACM (1998)

  5. Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 33–36), Association for Computational Linguistics, Stroudsburg, PA, USA.

  6. Berger, A.L., & Lafferty, J.D. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR (pp. 222–229).

  7. Biemann, C. (2006). Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the first workshop on graph based methods for natural language processing (pp. 73–80). Association for Computational Linguistics.

  8. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the. Journal of machine Learning research, 3, 993–1022.

    MATH  Google Scholar 

  9. Boros, E., Kantor, P. B., & Neu, D. J. (2001). A clustering based approach to creating multi-document summaries. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval.

  10. De Boom, C., Van Canneyt, S., Demeester, T., & Dhoedt, B. (2016). Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, 80, 150–156.

    Article  Google Scholar 

  11. Denkowski, M., & Lavie, A. (2010). Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, (pp. 339–342). Association for Computational Linguistics.

  12. Dhillon, I.S., Mallela, S., & Modha, D.S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 89–98). ACM.

  13. Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics, 39(3), 709–754.

    Article  Google Scholar 

  14. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895–1923.

    Article  Google Scholar 

  15. Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.

    Google Scholar 

  16. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

    Google Scholar 

  17. Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th web as corpus workshop (WAC-4) can we beat Google (pp. 47–54).

  18. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of AAAI, 6, 1301–1306.

    Google Scholar 

  19. Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). Ppdb: The paraphrase database. In HLT-NAACL (pp. 758–764).

  20. Glickman, O., Shnarch, E., & Dagan, I. (2006). Lexical reference: A semantic matching subtask. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP ’06 (pp. 172–179). Association for Computational Linguistics, Stroudsburg, PA, USA.

  21. Green, S. J. (1999). Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5), 713–730.

    Article  Google Scholar 

  22. Habash, N., & Dorr, B. (2003). Catvar: A database of categorial variations for english. In Proceedings of the MT summit (pp. 471–474).

  23. Hearst, M.A., Karger, D.R., & Pedersen, J.O. (1995). Scatter/gather as a tool for the navigation of retrieval results. In Working Notes of the 1995 AAAI fall symposium on AI applications in knowledge navigation and retrieval.

  24. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of SIGIR (pp. 50–57).

  25. Hotho, A., Staab, S., Stumme, G. (2003). Ontologies improve text document clustering. In Third IEEE international conference on data mining, 2003. ICDM 2003 (pp. 541–544). IEEE.

  26. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 179–186). ACM.

  27. Hu, X., Zhang, X., Lu, C., Park, E.K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 389–396). ACM.

  28. Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 323–330). ACM.

  29. Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. Applied probability and statistics section (EUA): Wiley series in probability and mathematical statistics.

    Google Scholar 

  30. Kenter, T., & De Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1411–1420). ACM.

  31. Kotlerman, L., Dagan, I., Gorodetsky, M., & Daya, E. (2012a). Sentence clustering via projection over term clusters. In: *SEM 2012: The first joint conference on lexical and computational semantics—Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 38–43). Association for Computational Linguistics, Montréal, Canada (2012).

  32. Kotlerman, L., Dagan, I., Magnini, B., & Bentivogli, L. (2015b). Textual entailment graphs. Natural Language Engineering, 21(5), 699–724.

    Article  Google Scholar 

  33. Kotlerman, L., Dagan, I., Szpektor, I., & Zhitomirsky-Geffet, M. (2010). Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4), 359–389.

  34. Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Information Retrieval, 12(4), 437–460.

    Article  Google Scholar 

  35. Levy, O., & Goldberg, Y. (2014). Dependencybased word embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 2, pp. 302–308).

  36. Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of ICML (pp. 577–584).

  37. Liebeskind, C., Kotlerman, L., Dagan, I. (2015). Text categorization from category name in an industry-motivated scenario. Language resources and evaluation (pp. 1–35).

  38. Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL (pp. 768–774).

  39. Liu, X., & Croft, W.B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193).

  40. Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. Berlin: Springer.

    Google Scholar 

  41. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  42. Naughton, M., Kushmerick, N., & Carthy, J. (2006). Clustering sentences for discovering events in news articles. In Advances in information retrieval (pp. 535–538). Springer.

  43. Nomoto, T., & Matsumoto, Y. (2001). A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 26–34). ACM.

  44. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP, 14, 1532–1543.

    Google Scholar 

  45. Phan, X.H., Nguyen, L.M., & Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on world wide web, (pp. 91–100). ACM.

  46. Raiber, F., Kurland, O., Radlinski, F., & Shokouhi, M. (2015). Learning asymmetric co-relevance. In Proceedings of ICTIR (pp. 281–290).

  47. Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC, 2, 827–832.

    Google Scholar 

  48. Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In: Proceedings of HLT-NAACL 2004: short papers (pp. 77–80). Association for Computational Linguistics.

  49. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW (pp. 377–386).

  50. Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Association for Computational Linguistics.

  51. Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 373–382). ACM.

  52. Shehata, S. (2009). A wordnet-based semantic model for enhancing text clustering. In IEEE international conference on data mining workshops, 2009. ICDMW’09, (pp. 477–482). IEEE.

  53. Shnarch, E., Barak, L., & Dagan, I. Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1-Volume 1, ACL ’09 (pp. 450–458). Association for Computational Linguistics, Stroudsburg, PA, USA.

  54. Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525–526). Boston

  55. Tan, B., Velivelli, A., Fang, H., & Zhai, C. (2007). Term feedback for information retrieval with language models. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 263–270), Amsterdam, The Netherlands, July 23–27, 2007.

  56. Tsur, O., Littman, A., & Rappoport, A. (2013). Efficient clustering of short messages into general domains. In Proceedings of ICWSM.

  57. Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). “A term is known by the company it keeps”: On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of second international conference on the theory of information retrieval, advances in information retrieval theory, ICTIR 2009 (pp. 104–115), Cambridge, UK, September 10–12, 2009.

  58. Whissell, J. S., & Clarke, C. L. (2011). Improving document clustering using okapi bm25 feature weighting. Information retrieval, 14(5), 466–487.

    Article  Google Scholar 

  59. Ye, H., & Young, S. (2006). A clustering approach to semantic decoding. In: Ninth international conference on spoken language processing.

  60. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of SIGIR (pp. 46–54).

Download references


This work was partially supported by the MAGNETON Grant No. 43834 of the Israel Ministry of Industry, Trade and Labor, the Israel Ministry of Science and Technology, the Israel Science Foundation Grant 1112/08 and Grant 1136/17 the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and the European Communitys Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 287923 (EXCITEMENT). We would like to thank NICE Systems and especially Maya Gorodetsky, Gennadi Lembersky and Ezra Daya for help in creating the datasets. Finally, we thank the anonymous reviewers for their useful comments and suggestions.

Author information



Corresponding author

Correspondence to Lili Kotlerman.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kotlerman, L., Dagan, I. & Kurland, O. Clustering small-sized collections of short texts. Inf Retrieval J 21, 273–306 (2018).

Download citation


  • Clustering
  • Clustering short texts
  • Short text similarities