The clustering of short texts is an emerging field of research. Because of the shortness of the texts and the sparsity of the occurrences of terms, short texts suffer from a lack of contextual information, which may lead to their ambiguity. As most words occur only once in a short text, the representation of texts by term frequency-inverse document frequency (tf-idf ) vectors cannot work well in this case. Without any contextual information and with only a small number of words available in the document, achieving semantic comparisons at an acceptable level is a challenge.
The problems with short texts can be analyzed from various points of view. An overview of the issues is presented by Shrestha et al. (2012). Here, the authors consider several classical data mining clustering algorithms (Complete Link, Single Link, Average Link Hierarchical clustering and Spectral clustering), and analyze various quality measures from the point of view of the usability of the clustering results, concluding that the values given by the evaluation methods do not always represent the readability of the clusters.
The issue of similarity measure is discussed by Sahami and Heilman (2006). The authors propose a novel method for measuring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search results to provide a wider context for short texts. Then, a similarity kernel function is defined, so that it works with the use of initial texts and Google search engine.
The issue of similarity measure is also discussed by Metzler et al. (2007). The authors study the problem from an information retrieval perspective. As for short texts the standard text similarity measures perform poorly the authors analyze usefulness of purely lexical measures, stemming, and language modeling-based measures.
One of the approaches to short text clustering is to enrich the representations of texts with semantics. In order to address the problem, some research focuses on the semantic enrichment of the context of data from external knowledge resources (Hu et al. 2009; Hotho et al. 2003). These proposals are still far from helping to achieve high-level quality clustering; therefore we decided to conduct research on semantic enrichment with the use of new semantic-oriented approaches, namely the BabelNet eco-system,Footnote 2 and with neural-network based distributional semantic models.
In general, one can distinguish three main groups of methods: those based on surface representation; sense-enhanced ones; and those based on enriching texts with additional terms from external resources. Below we discuss them in more detail.
Surface representation methods
Surface Representation methods are usually based on collocation frequencies, and ignore the senses of words. The clustering algorithms of this group can be classified as data-centric, and description-centric.
Data-Centric Text Clustering
The data-centric approach focuses more on the problem of data clustering, than on presenting the results to the user. The algorithm Scatter/Gather (Cutting et al. 1992) is an example of this approach. It divides the data set into a small number of clusters and, after the selection of a group, it performs clustering again and proceeds iteratively using the Buckshot-fractionation algorithm. Other data-centric methods use Bisecting k-means or hierarchical agglomerative clustering. The Bisecting k-means algorithm by Steinbach et al. (2000) starts from a single cluster that contains all points. Iteratively, it finds divisible clusters at the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or until no leaf clusters are divisible. If bisecting all divisible clusters at the last level would give more than k leaf clusters, larger clusters receive higher priority.
Description-Centric Text Clustering
Description-centric approaches are focused on describing the resulting clusters. They reveal diverse groups of semantically related documents associated with meaningful, comprehensible and compact text labels. Among the most popular and successful approaches are phrase-based ones that form clusters based on recurring phrases instead of the numerical frequencies of isolated terms. The Suffix Tree Clustering algorithm (STC), presented by Zamir and Oren (1998), employs frequently recurring phrases as both a document similarity feature and a final cluster description. Clustering with the STC algorithm is essentially finding groups of documents sharing a high ratio of frequent phrases.
A different idea of ‘label-driven’ clustering appears in clustering with committees algorithm (Pantel and Dekang 2002). The Description-Comes-First (DCF) approach reverses the traditional order of cluster discovery. This is a special case within the description-centric approach. Instead of calculating the proximity between documents and then labeling the revealed groups, DCF first attempts to find good, conceptually varied cluster labels and then assigns documents to the labels to form groups.
A good example of the DCF approach is the algorithm called Lingo (Osinski et al. 2004; Osinski and Weiss 2005). Lingo combines common phrase discovery with latent semantic indexing techniques to separate search results into meaningful groups. Lingo uses singular value decomposition of the term-document matrix to select good cluster labels among the candidates extracted from the text (frequent phrases). The algorithm was designed to cluster results from web search engines (short snippets and fragmented descriptions of original documents), and has proven to provide diverse meaningful cluster labels. Lingo bridges existing phrase-based methods and numerical cluster analysis to form readable and diverse cluster descriptions.
Sense-enhanced text clustering
Phrase-based methods reveal some problems when some senses are dominated, or texts contain various words with the same meaning. Di Marco and Navigli (2011, 2013); Navigli and Crisafulli (2010) present a novel approach for clustering snippets, based on automatically uncovering word senses from raw text. Word Sense Induction (WSI) is performed in order to dynamically acquire an inventory of senses of the input set of texts. Instead of clustering texts based on a surface similarity of the snippets, induced word senses are used to group snippets. The acquisition of senses is done by a graph-based clustering algorithm that exploits cycles in the co-occurrence graph of the query. Then, the results are clustered on the basis of their semantic similarity to the induced senses. Methods of this kind usually need representative external corpora in order to build relevant sense representations. They are called sense-enhanced clustering algorithms.
Another approach to sense-enhanced clustering is presented by Kozlowski and Rybinski (2014, 2017b). Here an algorithm named SnSRC is proposed; it also induces word senses, but does not use any external corpora. Instead, it induces word senses solely from the corpus. As shown, SnSRC is well suited to clustering short to medium texts. More information about SnSRC is provided in Sections 3 and 4.
Data expansion based clustering
Several ways of using external information resources have been proposed to resolve the problem of insufficient text representation. We categorize them into the following two groups: (1) the ones using knowledge resources; and (2) the ones applying distributional semantic models. Both approaches are tested in our paper in order to verify the impact of text enrichment on clustering quality. In general, data expansion is defined as inducing new features and integrating them with original data. In the case of short texts, data expansion consists in producing terms and phrases that do not appear explicitly in the original text. Such new terms extend the original text features, so that expanded data become input for clustering methods. Data expansion is usually achieved in two phases: (1) automatic summarization (obtained by applying either of the two methods mentioned above); and (2) given the summarization results, the retrieval of additional information by means of knowledge resources or distributional models. Automatic summarization can be done by modeling some dense representation or through keyword extraction. Using distributional semantic models (e.g. neural-based distributional models proposed by Mikolov et al. (2013a, b) we can compute a vector derived from words embeddings for a whole text, and this way we obtain summarized texts.
Keyword extraction methods may be categorized by the type of technique used to identify important words. In most cases they are based on the linguistic approach (Justeson and Katz 1995), the statistical approach (Andrade and Valencia 1998), machine learning (Hulth 2003), or knowledge resources (Milne and Witten 2013). Retrieving additional information is based on expanding native features resulting from the summarization phase. In the case of distributional semantic models for texts represented as vectors, top words that are semantically similar (by means of cosine measure) are retrieved.
Once keyphrases are extracted from a short text, methods based on well-structured knowledge resources (like Wikipedia or WordNet) perform a search for an entry in the resource repository that has title or labels equal to, or are at least fuzzy-similar to the extracted keyphrase, and then enrich the text with acquired labels, synonyms, categories, glossesFootnote 3 associated with the concepts.
Extending texts using knowledge resources
The idea of using external knowledge resources for various tasks of NLP emerged alongside the dynamic development of very large semantic tools, such as WordNet or later Wikipedia. A lot of research has been done with these resources, for example, WordNet was used by Hotho et al. (2003) for the improvement of clustering by extending short texts with synsets,Footnote 4 Wikipedia was used by Gabrilovich and Markovitch (2005, 2009) for improving the semantics of texts. Wikipedia was also used by Ferragina and Scaiella (2012) in order to improve clustering by resolving polysemy and synonymy in short texts by referencing to Wikipedia articles.
In general, Wikipedia can be exploited in two ways – term oriented and text-oriented. The former describes a given term with other words that come from Wikipedia content, whereas the latter approach consists in extracting keywords from a text using a dictionary of terms (defined as Wikipedia article’s titles/labels). Milne et al. (2006) use Wikipedia in order to expand data in two steps: (1) given a term extracted from a processed document, it searches for an article with the title or labels equal to the term, or at least containing it; (2) from the found article its labels, senses, translation, or glosses are extracted.
Among the text clustering methods one can also distinguish between those that apply well-formed knowledge resources and those that use them as large text repositories. A good example of the first approach is presented by Hotho et al. (2003), where the authors use WordNet as background knowledge, and integrate it into the process of clustering text documents. WordNet is treated here as a general purpose ontology. The original text representations are extended by applying the following strategies: (1) by adding concepts; (2) by replacing terms by concepts – in particular, terms that appear in WordNet are only counted at the concept level, but terms that do not appear in WordNet are not discarded; (3) terms that do not appear in WordNet are discarded, and others are replaced by concepts signatures. Having extended the texts, the documents are subject to clustering by a standard partitional algorithm. This approach is similar to one of the enrichment methods tested in this paper (presented in Section 5.1.3). There are, however, differences: (1) we use BabelNet as a knowledge resource, which is semantically much richer than WordNet as it integrates WordNet with Wikipedia; and (2) the authors experiment on texts from Reuters news, whereas we test our algorithm on much shorter texts.
A method that uses knowledge repository as an external text resource is presented by Banerjee et al. (2007). In this method the conventional bag-of-words representation of text items is augmented with the titles of selected Wikipedia articles. The titles of the retrieved Wikipedia articles serve as additional features for clustering. The experiments were performed on Google News with the use of agglomerative bisections algorithms.
Hu et al. (2009) also propose Wikipedia as an additional generator of features. The authors propose a framework combining internal and external semantics to prepare new term-features that extend the original ones. In their proposed framework, internal semantics represents the features from the original text by applying techniques specific for NLP, such as segmentation or Part of Speech tagging (PoS-tagging), while external semantics represent the features derived from external knowledge bases, which in this case are Wikipedia and WordNet, treated here as text repositories. To evaluate the methods, two clustering algorithms, k-means and EM, are employed on two test collections: Reuters-21578 and 10-category Web Dataset with 1000 texts. The use of external resources improves the original clustering algorithms by some 4%.
All the methods discussed above have been tested on sets containing at least several hundred documents, each composed of at least one paragraph, and of at least 50 words. Our sets, on the other hand, contain texts composed of on average 8 words. by means of statistical co-occurrences models.
The work on applying Wikipedia as a knowledge base is further developed by Flati and Navigli (2014), Krause et al. (2016), Bovi and Navigli (2017), and widespread in the form of BabelNet. BabelNet is a multilingual encyclopedic dictionary and semantic network, and it currently covers 284 languages. It provides both lexicographic and encyclopedic knowledge thanks to the seamless integration of WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata and Open Multilingual WordNet. BabelNet encodes knowledge within a labeled directed graph G = (V, E), where V is a set of nodes (concepts) and E is a set of edges connecting pairs of concepts. Each edge is labeled with a semantic relation. Each node contains a set of lexicalizations of the concepts for different languages. The multilingually lexicalized concepts are Babel synsets. At its core, concepts and relations in BabelNet are harvested from WordNet, and Wikipedia.
Babelfy (Bovi and Navigli 2017) is a unified graph-based approach that leverages BabelNet to perform both Word Sense Disambiguation (WSD) and entity linking in any language covered by BabelNet. The WSD algorithm is based on a loose identification of candidate meanings coupled with a densest subgraph heuristic. As shown by Moro et al. (2014), the evaluation of the Babelfy WSD outperforms many state-of-the-art supervised systems in various applications. It is therefore one of our goals to test how the new semantic tools based on Wikipedia can improve clustering of very short texts with the use of SnSRC and other typical text-oriented clustering algorithms.
Neural network based distributional semantic models
Distributional Semantic Models have recently received increased attention alongside the rise of neural architectures for the scalable training of dense vector embeddings. One of the strongest trends in NLP is the use of word embeddings (Huang et al. 2012) by means of vectors, whose relative similarities correlate with semantic similarity. Distributional semantics that has been modeled on neural networks has received an especially substantial increase in attention. The main reason for this is the very promising approach of employing neural network language models trained on large corpora to learn distributional vectors for words (Mikolov et al. 2013b). Mikolov et al. (2013a); Le and Mikolov (2014) introduced the Skip-gram and Continuous Bag-of-Words models, which are efficient methods for learning high-quality vector representations of words from large amounts of unstructured text data.
The word representations computed with neural networks are very interesting as the learned vectors explicitly encode many linguistic regularities and patterns. In their paper, Le and Mikolov (2014) propose the Paragraph Vector, which is an unsupervised framework that learns continuous distributed vector representations for pieces of texts. Taghipour and Ng (2015) show that the performance of conventional supervised WSD systems can be increased by taking advantage of word embeddings as new features. The authors show how word embeddings alone can provide significant improvement over a state-of-the-art WSD system. Iacobacci et al. (2016) take the real-valued word embeddings as new features This approach inspired us to enrich semantically short texts in a similar way.
The most well-known tool in this area is now Word2vec,Footnote 5 which allows fast training on huge amount of raw linguistic data. Word2vec takes as its input a large corpus and builds a vector space, typically of a few hundred dimensions, with each unique word in the corpus represented by a corresponding vector.
Bearing in mind the novel approaches in the area of knowledge resources, and neural network based distributional semantic models, we decided to evaluate new tools for enriching very short texts and to investigate which of the approaches can improve SnSRC for clustering small corpora of very short texts. We also evaluate the impact of the new approaches to text-oriented knowledge-poor clustering algorithms, such as STC and Lingo. In particular we focus on enriching short texts by applying:
-
1.
BabelNet/Babelfy, which are well-structured knowledge resources, integrating Wikipiedia and WordNet;
-
2.
neural network based distributional semantic models, built with Word2vec.
To the best of our knowledge, no experiments were done with so small repositories containing very short texts (usually several words), enriching texts on the feature level by the above mentioned approaches.