The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
We note that our approach of associating texts with term clusters is shown below to outperform in our setting a method which is commonly used in co-clustering algorithms.
Implementation available at https://marketplace.gephi.org/plugin/chinese-whispers-clustering/
The datasets are available at http://u.cs.biu.ac.il/~davidol/.
To account for multi-cluster assignment, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each text has partial membership in each of its multiple clusters.
We publish our list of stopwords along with the datasets.
Available for download from https://github.com/hltfbk/EOP-1.2.0/wiki/English-Knowledge-Resources.
We used the following tools: LingPipe http://alias-i.com/lingpipe/index.html for Complete Link, R implementation of the PAM algorithm https://cran.r-project.org/web/packages/cluster/cluster.pdf, http://www.cs.utexas.edu/users/dml/Software/cocluster.html for co-clustering and http://mallet.cs.umass.edu for LDA. For the KMY baseline we applied for text clustering the algorithm in Fig. 6 with \(\theta =0.7\) as suggested in the original report (Ye and Young 2006). We experimented with additional thresholds (0 through 1 with the step of 0.1) and found the threshold of 0.7 to be one of the best for our setting. The algorithm is not very sensitive to threshold values around 0.7, although much higher and lower thresholds from the range of [0, 1] result in degraded performance.
We experimented also with K randomly selected initial medoids, but having each text as an initial medoid showed better results. Since our text collections are small this is not computationally expensive.
Aggarwal, C.C., & Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Springer.
Allan, J., Papka, R., & Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of SIGIR (pp. 37–45).
Aslam, J. A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., McCreadie, R., & Sakai, T. (2014). TREC 2014 temporal summarization track overview. In Proceedings of TREC.
Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 96–103). ACM (1998)
Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: the 2009 annual conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09 (pp. 33–36), Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1620853.1620864
Berger, A.L., & Lafferty, J.D. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR (pp. 222–229).
Biemann, C. (2006). Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of the first workshop on graph based methods for natural language processing (pp. 73–80). Association for Computational Linguistics.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the. Journal of machine Learning research, 3, 993–1022.
Boros, E., Kantor, P. B., & Neu, D. J. (2001). A clustering based approach to creating multi-document summaries. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval.
De Boom, C., Van Canneyt, S., Demeester, T., & Dhoedt, B. (2016). Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, 80, 150–156.
Denkowski, M., & Lavie, A. (2010). Meteor-next and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, (pp. 339–342). Association for Computational Linguistics.
Dhillon, I.S., Mallela, S., & Modha, D.S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 89–98). ACM.
Di Marco, A., & Navigli, R. (2013). Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics, 39(3), 709–754.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7), 1895–1923.
Erkan, G., & Radev, D. R. (2004). Lexrank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22, 457–479.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th web as corpus workshop (WAC-4) can we beat Google (pp. 47–54).
Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of AAAI, 6, 1301–1306.
Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). Ppdb: The paraphrase database. In HLT-NAACL (pp. 758–764).
Glickman, O., Shnarch, E., & Dagan, I. (2006). Lexical reference: A semantic matching subtask. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP ’06 (pp. 172–179). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1610075.1610103
Green, S. J. (1999). Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering, 11(5), 713–730.
Habash, N., & Dorr, B. (2003). Catvar: A database of categorial variations for english. In Proceedings of the MT summit (pp. 471–474).
Hearst, M.A., Karger, D.R., & Pedersen, J.O. (1995). Scatter/gather as a tool for the navigation of retrieval results. In Working Notes of the 1995 AAAI fall symposium on AI applications in knowledge navigation and retrieval.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of SIGIR (pp. 50–57).
Hotho, A., Staab, S., Stumme, G. (2003). Ontologies improve text document clustering. In Third IEEE international conference on data mining, 2003. ICDM 2003 (pp. 541–544). IEEE.
Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 179–186). ACM.
Hu, X., Zhang, X., Lu, C., Park, E.K., & Zhou, X. (2009). Exploiting wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 389–396). ACM.
Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (pp. 323–330). ACM.
Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. Applied probability and statistics section (EUA): Wiley series in probability and mathematical statistics.
Kenter, T., & De Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1411–1420). ACM.
Kotlerman, L., Dagan, I., Gorodetsky, M., & Daya, E. (2012a). Sentence clustering via projection over term clusters. In: *SEM 2012: The first joint conference on lexical and computational semantics—Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the sixth international workshop on semantic evaluation (SemEval 2012) (pp. 38–43). Association for Computational Linguistics, Montréal, Canada (2012). http://www.aclweb.org/anthology/S12-1005
Kotlerman, L., Dagan, I., Magnini, B., & Bentivogli, L. (2015b). Textual entailment graphs. Natural Language Engineering, 21(5), 699–724.
Kotlerman, L., Dagan, I., Szpektor, I., & Zhitomirsky-Geffet, M. (2010). Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4), 359–389. http://dblp.uni-trier.de/db/journals/nle/nle16.html#KotlermanDSZ10
Kurland, O. (2009). Re-ranking search results using language models of query-specific clusters. Information Retrieval, 12(4), 437–460.
Levy, O., & Goldberg, Y. (2014). Dependencybased word embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 2, pp. 302–308).
Li, W., & McCallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of ICML (pp. 577–584).
Liebeskind, C., Kotlerman, L., Dagan, I. (2015). Text categorization from category name in an industry-motivated scenario. Language resources and evaluation (pp. 1–35).
Lin, D. (1998). Automatic retrieval and clustering of similar words. In COLING-ACL (pp. 768–774).
Liu, X., & Croft, W.B. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193).
Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. Berlin: Springer.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Naughton, M., Kushmerick, N., & Carthy, J. (2006). Clustering sentences for discovering events in news articles. In Advances in information retrieval (pp. 535–538). Springer.
Nomoto, T., & Matsumoto, Y. (2001). A new approach to unsupervised text summarization. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 26–34). ACM.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. EMNLP, 14, 1532–1543.
Phan, X.H., Nguyen, L.M., & Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on world wide web, (pp. 91–100). ACM.
Raiber, F., Kurland, O., Radlinski, F., & Shokouhi, M. (2015). Learning asymmetric co-relevance. In Proceedings of ICTIR (pp. 281–290).
Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. LREC, 2, 827–832.
Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In: Proceedings of HLT-NAACL 2004: short papers (pp. 77–80). Association for Computational Linguistics.
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW (pp. 377–386).
Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Association for Computational Linguistics.
Severyn, A., & Moschitti, A. (2015). Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (pp. 373–382). ACM.
Shehata, S. (2009). A wordnet-based semantic model for enhancing text clustering. In IEEE international conference on data mining workshops, 2009. ICDMW’09, (pp. 477–482). IEEE.
Shnarch, E., Barak, L., & Dagan, I. Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1-Volume 1, ACL ’09 (pp. 450–458). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1687878.1687942
Steinbach, M., Karypis, G., Kumar, V., et al. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, pp. 525–526). Boston
Tan, B., Velivelli, A., Fang, H., & Zhai, C. (2007). Term feedback for information retrieval with language models. In SIGIR 2007: proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 263–270), Amsterdam, The Netherlands, July 23–27, 2007. https://doi.org/10.1145/1277741.1277788.
Tsur, O., Littman, A., & Rappoport, A. (2013). Efficient clustering of short messages into general domains. In Proceedings of ICWSM.
Udupa, R., Bhole, A., & Bhattacharyya, P. (2009). “A term is known by the company it keeps”: On selecting a good expansion set in pseudo-relevance feedback. In Proceedings of second international conference on the theory of information retrieval, advances in information retrieval theory, ICTIR 2009 (pp. 104–115), Cambridge, UK, September 10–12, 2009. https://doi.org/10.1007/978-3-642-04417-5
Whissell, J. S., & Clarke, C. L. (2011). Improving document clustering using okapi bm25 feature weighting. Information retrieval, 14(5), 466–487.
Ye, H., & Young, S. (2006). A clustering approach to semantic decoding. In: Ninth international conference on spoken language processing.
Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of SIGIR (pp. 46–54).
This work was partially supported by the MAGNETON Grant No. 43834 of the Israel Ministry of Industry, Trade and Labor, the Israel Ministry of Science and Technology, the Israel Science Foundation Grant 1112/08 and Grant 1136/17 the PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and the European Communitys Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 287923 (EXCITEMENT). We would like to thank NICE Systems and especially Maya Gorodetsky, Gennadi Lembersky and Ezra Daya for help in creating the datasets. Finally, we thank the anonymous reviewers for their useful comments and suggestions.
About this article
Cite this article
Kotlerman, L., Dagan, I. & Kurland, O. Clustering small-sized collections of short texts. Inf Retrieval J 21, 273–306 (2018). https://doi.org/10.1007/s10791-017-9324-8
- Clustering short texts
- Short text similarities