Pairwise document similarity measure based on present term set
 623 Downloads
Abstract
Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two realworld document collections for a variety of text mining tasks, such as text document classification, clustering, and nearduplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.
Keywords
Similarity measure Distance metric Document clustering Document classification Nearduplicates detection Information retrievalAbbreviations
 AC
ACcuracy
 CMU
Central Michigan University
 CPU
Central Processing Unit
 EJ
extended Jaccard coefficient
 En
entropy
 idf
inverse document frequency
 IR
information retrieval
 ITSim
informationtheoretic measure for document similarity
 kNN
k nearest neighbors
 ML
Machine Learning
 NDD
near duplicate detection
 PDSM
pairwise document similarity measure
 R8
Reuters8
 RAM
Random Access Memory
 SMTP
similarity measure for text processing
 STD
Suffix Tree Document
 tf
term frequency
 VSM
Vector Space Model
 WebKB
World Wide Knowledge Base (Web→KB)
Introduction
In text mining, a similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1, 2]. In big data, which includes masses of unstructured data, Information Retrieval (IR) is the dominant form of information access [3]. Among ML methods, classification and clustering help discover patterns and correlations and extract information from largescale collections [1]. These two techniques also offer benefits to different IR applications. For example, document clustering can be applied to the document collection to improve search speed, precision, and recall or to the search results to provide more effective information presentation to user [3]. Document classification is also used in vertical search engines [4] and sentiment detection [5].
In largescale collections, one of the challenging issues is to identify documents with high similarity values, known as nearduplicate documents (or nearduplicates) [6, 7, 8]. Integration of heterogeneous collections, storing multiple copies of the same document, and plagiarism are the main causes for the existence of nearduplicates. These documents increase processing overheads and storage. Detecting and filtering nearduplicates can address these issues and also improve the search quality [6]. Using a similarity measure is a quantitative way to define two documents as nearduplicates [7].
For a collection of text documents, an appropriate document representation model is required, such as Vector Space Model (VSM) [9]. According to VSM, each document is represented as an Mdimensional vector in which each dimension corresponds to a vocabulary term or a feature. The exact ordering of terms in the document is also discarded (the BagofWords Model [3]). The vocabulary includes all terms that appear in the document collection. A term or feature can be a single word, multiple words, a phrase^{1} or other indexing units [9, 10]. The weight of a term represents the importance of it in the relevant document and is assigned by a term weighting scheme [11]. Term frequency (tf) [3], inverse document frequency (idf) [12], or multiplication of tf and idf (tfidf) [13, 14, 15] are commonly used term weighting schemes. In largescale text document collections, using VSM results sparse vectors, i.e., most of the term weights in a document vector are zero [16, 17]. High dimensionality can be a problem for computing the similarity between two documents.
Using VSM, there are numerous measures to calculate pairwise document similarity. For instance, Euclidean distance is a geometric measure used to measure the distance between two vectors [18, 19]. Cosine similarity compares two documents with respect to the angle between their vectors [11]. Similar to two previous measures, Manhattan distance is also a geometric measure [20, 21]. For two 0–1 vectors,^{2} the Hamming distance [17] is the number of positions at which the stored term weights are different. The Chebyshev distance [16] between two vectors is the greatest of absolute differences along any dimension. A similarity measure for text processing (SMTP) [17] is used for comparing two text documents. SMTP can also be extended to measure the similarity between two document collections. Heidarian and Dinneen [18] proposed a novel geometric measure to determine the similarity level between two documents. An InformationTheoretic measure for document Similarity (ITSim) is a similarity measure based on information theory [22]. Based on the Suffix Tree Document (STD) model, Chim and Deng [23] proposed a phrasebased measure to compute the similarity between two documents. Sohangir and Wang [16] proposed a new document similarity measure, named Improved SqrtCosine (ISC) similarity. Jaccard coefficient [24] calculates the ratio of the number of terms used in both documents to the number of terms used in at least one of them.
In context of document classification and clustering, there have been numerous researches on the effectiveness of different similarity measures. For instance, Subhashini et al. [25] evaluated the clustering performance of different measures on three web document collections. The results of their experiment showed that Cosine similarity performs better than Euclidean distance and Jaccard coefficient. Hammouda and Kamel [9] proposed a system for web document clustering. In their system, a phrasebased similarity measure was used to calculate the similarity between two documents. D’hondt et al. [11] proposed a novel dissimilarity measure for document clustering, called pairwiseadaptive. This Cosinebased measure selects the K most important terms of each document based on their weights and reduces the dimensionality to the document’s most important terms. Thus, pairwiseadaptive has lower computational burden and is applicable in high dimensional document collections. Lin et al. [17] used SMTP for text clustering and classification.
Moreover, the effect of various similarity measures on the performance of nearduplicates detection has been investigated in recent researches. For instance, Rezaeian and Novikova [8] used Hamming distance to discover plagiarism in Russian texts. Xiao et al. [7] proposed a new algorithm, which is useful in detecting nearduplicate documents. They adopted their proposed algorithm to commonly used similarity measures, such as Jaccard coefficient, Cosine similarity, and Hamming distance. Hajishirzi et al. [26] proposed a new vector representation for documents. Using their method, Jaccard coefficient and Cosine similarity provide higher nearduplicates detection accuracy.
As explained in more detail in “Proposed similarity measure” section, most of the similarity measures judge the closeness of two documents to each other based on the term weights. Although, term weights provide important information about the similarity between two documents, sometimes similarity judgment based on the term weights alone is not sufficient. The motivation behind the work in this paper is that we believe that pairwise document similarity should be based not only on the term weights but on the number of terms appeared in at least one of the two documents, as well. In this paper, we propose a new measure to compute the similarity between two text documents. In this symmetric measure, the similarity increases as the number of terms used in both documents, present terms, and the information content associated with these terms increases. Furthermore, the terms used in only one of the two documents, presenceabsence terms, contribute to the similarity measurement.
We conduct a comprehensive experiment to evaluate the effectiveness of our proposed measure on the performance of several text mining applications, including nearduplicates detection, singlelabel classification, and Kmeans like clustering. The obtained results show that our proposed similarity measure improves the efficiency of these algorithms. The rest of this paper is organized as follows: “Related work” presents the background of similarity measures. “Proposed similarity measure” section introduces the proposed similarity measure. “Methods” and “Results and discussion” sections discuss experiment details and experimental results, respectively. Finally, the conclusion and the discussion about future work are given in the last section.
Related work
In this paper, the VSM is selected as the document representation model; \( \vec{d} \) is the vector of document d. Furthermore, M and N indicate the size of vocabulary and the number of documents in the document collection, respectively. Some of the measures have been briefly reported in the previous section, but this section covers some popular measures for computing the similarity between two text documents.
Proposed similarity measure
Problem definition
 1.
The presence or absence of a term is more important than the difference between two nonzero weights of a present term.
 2.
The similarity degree between two documents should decrease when the difference between two nonzero weights of a present term increases.
 3.
The similarity degree should decrease when the number of presenceabsence terms between two documents increases.
 4.
Two documents are least similar to each other, if there is no present term between them.
 5.The similarity measure should be symmetric:$$ {\text{Similarity }}\left( {{\text{d}}_{ 1} ,{\text{ d}}_{ 2} } \right) \, = {\text{Similarity }}\left( {{\text{d}}_{ 2} ,{\text{ d}}_{ 1} } \right) $$
 6.
The distribution of term weights in the document collection should contribute to the similarity measurement.
According to these properties, the measures described in “Related work” section have one or more deficiencies. For example, Cosine similarity does not satisfy Properties 3 and 6, and Euclidean distance does not meet Properties 1, 3, 4, and 6 [17].
Second, λ is determined according to the properties of document collection, therefore all presenceabsence terms have identical importance; that is, the weight of presenceabsence terms has no contribution to the similarity measurement and also the weight distribution is only taken into consideration for the present terms.
Careful examination of similar documents in different instances showed that two documents with the highest similarity degree use almost identical term sets. In other words, the more terms two documents have in common, the more similar they are. In such cases, some of the wellknown measures cannot recognize the most similar documents. The following example clarifies the case. Suppose three documents d_{1}, d_{2}, and d_{3} with four terms and tf as term weighting scheme. Further, the terms are not shared by all the documents of the collection. Our goal is to find the most similar document to d_{1}.
Example 1
Using Manhattan distance, the distance between d_{1} and d_{2} and also the distance between d_{1} and d_{3} is 10. This same similarity indicates that both d_{2} and d_{3} share same amount of information with d_{1}, however, a clear difference can be seen between the number of present terms; d_{1} and d_{2} use same term set, but d_{1} and d_{3} only share one term in common. Using Cosine similarity, the similarity between d_{1} and d_{2} is 0.763, and the similarity between d_{1} and d_{3} is 0.831. Although d_{1} and d_{2} both use same term set, Cosine similarity selects d_{3} as the most similar document to d_{1}.
Example 2

d_{1} = “this garden is different because it only has different types of vegetables.”

d_{2} = “different types of vegetables are growing in this garden. This garden only has vegetables.”

d_{3} = “different types of vegetables growing in your garden can have different benefits for your health.”
As can be seen from previous examples, the measures described in “Related work” section, such as EJ, Euclidean, Manhattan, and Cosine, are insufficient to determine the most similar documents, because they judge the similarity between two documents based on the term weights alone. Based on these observations, we conclude that an appropriate similarity measure should take the number of present terms into account to achieve more accurate similarity value.
However, the number of present terms offers efficient benefits to the similarity calculation between two documents, but it alone is not sufficient. Third example clarifies this case (we have the same assumption mentioned in Example 1).
Example 3
In this example, if we want to find the most similar document to d_{1}, the number of present terms alone cannot help, because d_{2} and d_{3} both have 4 terms in common with d_{1}. Using Manhattan distance, the distance between d_{1} and d_{2} is 12 and the distance between d_{1} and d_{3} is 3. If we use Cosine similarity, the similarity between d_{1} and d_{2} is 0.9569 and the similarity between d_{1} and d_{3} is 0.989. Accordingly, in this example, both Manhattan distance and Cosine similarity accurately recognize the most similar document to d_{1}. Therefore, it is possible that two documents use same term set but have different contents, because the associated terms have different weights. In such cases, most of the similarity measures can accurately judge the similarity between documents.
The third example shows that the number of present terms alone cannot capture all the similarity information between documents and the term weights are still required. But in cases like Examples 1 and 2, the number of present terms can help to find documents with the highest similarity degree.
Pairwise document similarity measure
In this section, we introduce the proposed similarity measure. As explained earlier, the number of present terms and their weights play an important role in accurately judging the similarity between the documents and help find the most similar ones. The idea is that the shared information content created by more number of present terms is more informative than that created by fewer present terms. In other words, if two documents use similar term sets, they tend to have more similar content and theme.
 7.
Similarity degree should increase when the number of present terms increases.
Deficiencies of some popular measures according to the preferable properties for a similarity measure
Measure  Property 1  Property 3  Property 4  Property 7 

Cosine  ✗  ✗  
Euclidean  ✗  ✗  ✗  ✗ 
Manhattan  ✗  ✗  ✗  ✗ 
Pairwise adaptive  ✗  ✗  
Chebyshev  ✗  ✗  ✗  
EJ  ✗  ✗  
ITSIM  ✗  ✗  
SMTP  ✗ 
PF(d_{1}, d_{2}) and AF(d_{1}, d_{2}) are the number of present terms and the number of absent terms, respectively. To avoid DividebyZero error, 1 is added to numerator and denominator. Like most of the similarity measures, such as Cosine similarity and Euclidean distance, the time complexity of PDSM is linear to the dimensionality of the document vector (i.e. O(M)). We assess PDSM based on the modified version of preferable properties for the similarity measure in the following.
Property 1
Let d_{1} and d_{2} be two documents and vocabulary contains one term, and w_{i} be the ith term weight. Consider the following two cases (a, b, c > 0): (1) d_{1} = 〈a〉, d_{2} = 〈b〉, and (2) d_{1} = 〈c〉, d_{2} = 〈0〉. In the first case, PDSM(d_{1}, d_{2}) > 0 and in the second case PDSM(d_{1}, d_{2}) = 0. Obviously, the importance of presence and absence of terms is considered in PDSM.
Property 2
Property 3
Since the number of presenceabsence terms between d_{1} and d_{3} is greater than that between d_{1} and d_{2}, the similarity degree between d_{1} and d_{2} is greater than that between d_{1} and d_{3}.
Property 4
Property 5
Property 6
As stated earlier, this property can be involved in the term weighting scheme.
Property 7
Example 4
Methods
Machine specification
RAM (Random Access Memory)  CPU (Central Processing Unit)  Operating System 

4 GB  Intel, Core i5, 2.67 GHz  Windows 7−64 bit 
In the context of kNN and Kmeans, we compare the performance of PDSM with that of other four measures, Euclidean distance, Cosine similarity, EJ, and Manhattan. Moreover, to see the effect of different term weighting schemes on the performance of PDSM, we employ some popular weighting schemes, including tf and tfidf. We also apply the Cosine normalization factor (16) [33] to tf and tfidf (tfnormal and tfidfnormal).
In the shingling algorithm, we use PDSM and Jaccard coefficient as the similarity measure.
Applications
This section gives a brief description for three applications used in this experiment.
Kmeans clustering
 1.
K documents are selected randomly as initial cluster centers or seedpoints.
 2.
Each document d in the document collection is assigned to one of the K clusters whose center is closest to d. The closeness between the cluster centers and d is calculated by a similarity measure or distance metric.
 3.
For each cluster, c_{i}, its center is set to the mean of the documents in c_{i}.
 4.
Steps 2 and 3 are repeated until a stopping criterion is met (e.g. until all cluster centers converge or a fixed number of iterations has been completed).
kNN classification
 1.
Using a similarity measure or distance metric, the closeness between d and all the documents in the training set is calculated.
 2.
The class of d is assigned to the majority class of its k nearest neighbors.
The shingling algorithm
Shingling [6] is a wellknown solution to the problem of detecting nearduplicate documents. In the shingling algorithm, each document is converted to a set of all consecutive sequences of k terms, called kshingles, where k is a parameter. Two documents are nearduplicate, if their kshingles are nearly the same. In this algorithm, a similarity measure is used to measure the degree of overlap between kshingles. Jaccard coefficient is the commonly used similarity measure in the shingling algorithm. If the similarity of two documents is more than a given threshold, the algorithm regards them as nearduplicates otherwise original ones.
Document collections
Properties of realworld document collections
Document collection  #Documents  #Involved terms (vector dimension)  #Categories  #Test documents  #Train documents 

WebKB  4199  7772  4  1396  2803 
R8  7674  17,387  8  2189  5485 
Category distribution for the R8 collection
Category (class)  #Documents 

Acq  2292 
Crude  374 
Earn  3923 
Grain  51 
Interest  271 
Moneyfx  293 
Ship  144 
Trade  326 
Category distribution for the WebKB collection
Category (Class)  #Documents 

Course  930 
Faculty  1124 
Project  504 
Student  1641 
Porter stemmer [36] has been applied to both collections and stop words [37] and words with less than three characters are also removed.
Properties of document sets used in nearduplicates detection experiment
Document Sets  #Documents  #NearDuplicate Documents 

WebKB_NDD  1000  50 
R8_NDD  1000  50 
Evaluation metrics
In the ith cluster, n_{i} is the cluster size, max_{i} is the majority number of documents with the identical label, and \( {\text{n}}_{\text{i}}^{\text{j}} \) is the number of documents with label j.
Results and discussion
In this section, we provide the results of our experiments and compare the performance of PDSM with that of other similarity measures used in kNN, Kmeans, and the shingling algorithm.
Classification results
In this section, kNN classification results are presented. For WebKB and R8, we trained the kNN classifier on the training set and evaluated it on the test set, using AC defined in Eq. (19).
kNN classification AC (WebKB, tfidf)
k = 1  k = 3  k = 5  k = 7  k = 9  k = 11  k = 13  k = 15  

PDSM  0.7572  0.8109  0.8367  0.8431  0.8496  0.8517  0.8496  0.8510 
Manhattan  0.6089  0.5193  0.4692  0.4520  0.4427  0.4255  0.4191  0.4097 
Euclidean  0.6211  0.5681  0.5487  0.5351  0.5165  0.5050  0.4893  0.4799 
Cosine  0.6569  0.6913  0.7278  0.5201  0.7307  0.7364  0.7443  0.7500 
EJ  0.6576  0.7070  0.7242  0.7414  0.7421  0.7543  0.7672  0.7665 
kNN classification AC (R8, tfidf)
k = 1  k = 3  k = 5  k = 7  k = 9  k = 11  k = 13  k = 15  

PDSM  0.9296  0.9434  0.9502  0.9520  0.9511  0.9520  0.9516  0.9529 
Manhattan  0.6862  0.6601  0.6341  0.6158  0.6053  0.5980  0.5898  0.5797 
Euclidean  0.7159  0.6825  0.6715  0.6583  0.6482  0.6377  0.6277  0.6208 
Cosine  0.7926  0.8246  0.8483  0.8483  0.8616  0.8675  0.8748  0.8757 
EJ  0.8237  0.8552  0.8899  0.8890  0.8945  0.9022  0.9045  0.9073 
Clustering results
The results of Kmeans clustering are presented in this section. In this experiment, we performed Kmeans on the combination of the test set and the training set and the quality of the clustering was evaluated by AC and En (Eqs. (17) and (18), respectively). Because seedpoints are selected randomly, Kmeans was executed 10 times. Thus, we are confident that results are not obtained by chance. We also applied different values of K to Kmeans. Therefore, the results are not biased by any values of K.
Kmeans clustering AC (WebKB, tfnormal)
K = 5  K = 10  K = 15  K = 20  K = 25  K = 30  

PDSM  0.5941  0.5989  0.5552  0.6060  0.5745  0.5670 
Manhattan  0.4172  0.4296  0.3933  0.4025  0.3947  0.3942 
Euclidean  0.5704  0.5893  0.6053  0.6090  0.6251  0.6263 
Cosine  0.5667  0.5937  0.6047  0.6013  0.6086  0.6103 
EJ  0.5463  0.5754  0.5909  0.5915  0.6160  0.6162 
The category distribution in clusters generated by PDSM (WebKB, tfnormal)
Course  Faculty  Project  Student  

Cluster 1  40  36  33  379 
Cluster 2  701  5  12  10 
Cluster 3  73  20  141  385 
Cluster 4  12  14  107  7 
Cluster 5  104  1049  211  860 
The category distribution in clusters generated by euclidean (webkb, tfnormal)
Course  Faculty  Project  Student  

Cluster 1  607  21  9  28 
Cluster 2  91  57  52  532 
Cluster 3  84  238  299  195 
Cluster 4  28  360  40  709 
Cluster 5  120  448  104  177 
The category distribution in clusters generated by Cosine (WebKB, tfnormal)
Course  Faculty  Project  Student  

Cluster 1  72  60  30  444 
Cluster 2  71  365  261  204 
Cluster 3  59  578  63  602 
Cluster 4  152  114  150  377 
Cluster 5  576  7  0  14 
The category distribution in clusters generated by PDSM (R8, tfidf)
Acq  Crude  Earn  Grain  Interest  Moneyfx  Ship  Trade  

Cluster 1  1  0  1  0  0  0  0  1 
Cluster 2  1285  150  2892  35  15  69  56  153 
Cluster 3  1  0  262  0  0  0  0  0 
Cluster 4  28  0  1  0  0  0  0  0 
Cluster 5  196  51  82  5  256  222  6  86 
Cluster 6  5  8  680  0  0  0  1  0 
Cluster 7  29  164  0  11  0  2  79  86 
Cluster 8  747  1  5  0  0  0  2  0 
The category distribution in clusters generated by Cosine similarity (R8, tfidf)
Acq  Crude  Earn  Grain  Interest  Moneyfx  Ship  Trade  

Cluster 1  71  9  579  6  254  150  4  32 
Cluster 2  4  0  812  0  0  0  1  0 
Cluster 3  771  340  36  0  0  0  4  0 
Cluster 4  2  0  945  0  0  0  0  0 
Cluster 5  1  0  1476  0  0  0  0  0 
Cluster 6  49  19  2  43  17  141  99  294 
Cluster 7  1114  5  66  0  0  2  10  0 
Cluster 8  280  1  7  2  0  0  26  0 
Nearduplicates detection results
In this section, the results of PDSM performance in detecting nearduplicate documents are given. To evaluate the quality of results, we used precision, recall, and Fmeasure (Eqs. (20), (21), and (22), respectively). We performed the shingling algorithm with PDSM and Jaccard coefficient on WebKB_NDD and R8_NDD. We also considered different values of shingle size to see the effect of k on the performance of similarity measures. Furthermore, the threshold of similarity was 0.5. The number of occurrences of shingle, s, in document, d, was considered as the weighting scheme.
Performance of PDSM in detecting nearduplicate documents (WebKB_NDD)
k = 3  k = 4  k = 5  

Precision  1  1  1 
Recall  1  1  1 
Fmeasure  1  1  1 
Performance of Jaccard coefficient in detecting nearduplicate documents (WebKB_NDD)
k = 3  k = 4  k = 5  

Precision  0.9259  1  1 
Recall  1  1  1 
Fmeasure  0.9616  1  1 
Performance of PDSM in detecting nearduplicate documents (R8_NDD)
k = 3  k = 4  k = 5  

Precision  1  1  1 
Recall  1  1  1 
Fmeasure  1  1  1 
Performance of Jaccard coefficient in detecting nearduplicate documents (R8_NDD)
k = 3  k = 4  k = 5  

Precision  0.641  0.8065  0.8929 
Recall  1  1  1 
Fmeasure  0.7813  0.8928  0.9434 
Conclusion and future research
We have presented a novel similarity measure between two text documents. For each pairwise comparison between two documents, the term weights and the number of present terms contribute to the similarity between both related vectors. Moreover, the proposed measure satisfies preferable properties for a similarity measure. For example, the larger the number of presenceabsence terms, the more dissimilar the two documents are. We also added a new property to these preferable properties: the similarity degree should increase when the number of present terms increases. This property is also satisfied by our proposed measure. To investigate the effectiveness of our proposed measure, we applied it in kNN classification, Kmeans clustering, and the shingling algorithm for nearduplicates detection on two realworld document collections. We also used the new similarity measure in different term weighting schemes. Our proposed measure yields better results than other popular measures of similarity.
Although the proposed similarity measure is aimed at comparing two text document vectors, it could be easily adapted to any vector type, such as vector representations of terms based on the contexts in which they appear in a collection [39]. Also, the experimental results have shown that the performance of the proposed measure could depend on the type of application, the document collection, and the term weighting scheme. It would be a very interesting topic to investigate the performance of our measure in other ML applications that use similarity measures, such as hierarchical agglomerative clustering and recommendation systems. Moreover, our intention is to investigate the usage of our proposed measure on other popular and largescale document collections, such as the 20 Newsgroups [40], and see its effect on the performance of ML algorithms compared to traditional measures, such as ITSim.
Footnotes
Notes
Authors’ contributions
MO was the major contributor in conception and design, and also implemented and performed the algorithms. MZ helped for the interpretation of the results of this experiment and was involved in the preparation of the manuscript and also critically revised it. The manuscript was written by MO. Both authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The document collections used in classification and clustering experiments are available in the ‘Datasets for singlelabel text categorization’ repository, http://ana.cachopo.org/datasetsforsinglelabeltextcategorization. Furthermore, two document sets used in the nearduplicates detection experiment are available in ‘NDD_DocSets’ repository, https://github.com/marziehoghbaie/NDD_DocSets.
Funding
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.ReyesOrtiz JL, Oneto L, Anguit D. Big data analytics in the cloud: spark on hadoop vs mpi/openmp on Beowulf. Procedia Comput Sci. 2015;53:121–30.CrossRefGoogle Scholar
 2.Chen M, Mao S, Liu Y. Big data: a survey. Mobile Netw Appl. 2014;19(2):171–209.CrossRefGoogle Scholar
 3.Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: CUP; 2008.CrossRefGoogle Scholar
 4.Chau M, Chen H. Comparison of three vertical search spiders. Comput. 2003;36(5):56–62. https://doi.org/10.1109/MC.2003.1198237.CrossRefGoogle Scholar
 5.Tang H, Tan S, Cheng X. A survey on sentiment detection of reviews. Expert Syst Appl. 2009;36(7):10760–73.CrossRefGoogle Scholar
 6.Varol C, Hari S. Detecting nearduplicate text documents with a hybrid approach. J Inf Sci. 2015;41(4):405–14.CrossRefGoogle Scholar
 7.Xiao C, Wang W, Lin X, Yu JX, Wang G. Efficient similarity joins for nearduplicate detection. ACM Trans Database Syst. 2011;36(3):15.CrossRefGoogle Scholar
 8.Rezaeian N, Novikova GM. Detecting nearduplicates in russian documents through using fingerprint algorithm Simhash. Procedia Comput Sci. 2017;103:421–5.CrossRefGoogle Scholar
 9.Hammouda KM, Kamel MS. Efficient phrasebased document indexing for web document clustering. IEEE Trans Knowl Data Eng. 2004;16:1279–96.CrossRefGoogle Scholar
 10.Chen J, Yeh CH, Chau R. Identifying multiword terms by textsegments. In: 7th int. conf. on webage information management workshops, China, Hong Kong, 2006. https://doi.org/10.1109/waimw.2006.16.
 11.D’hondt J, Vertommen J, Verhaegen PA, Cattrysse D, Duflou JR. Pairwiseadaptive dissimilarity measure for document clustering. Inf Sci. 2010;180(12):2341–58.CrossRefGoogle Scholar
 12.Robertson S. Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 2004;60(5):503–20.CrossRefGoogle Scholar
 13.Reed JW, et al. TFICF: a new term weighting scheme for clustering dynamic data streams. Orlando: ICMLA; 2006. p. 258–63.Google Scholar
 14.Luo Q, Chen E, Xiong H. A semantic term weighting scheme for text categorization. Expert Syst Appl. 2011;38(10):12708–16.CrossRefGoogle Scholar
 15.Zhang W, Yoshida T, Tang X. A comparative study of TF* IDF, LSI and multiwords for text classification. Expert Syst Appl. 2011;38(3):2758–65.CrossRefGoogle Scholar
 16.Sohangir S, Wang D. Improved sqrtcosine similarity measurement. J Big Data. 2017. https://doi.org/10.1186/s4053701700836.CrossRefGoogle Scholar
 17.Lin YS, Jiang JY, Lee SJ. A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng. 2014;26:1575–90.CrossRefGoogle Scholar
 18.Heidarian A, Dinneen MJ. A hybrid geometric approach for measuring similarity level among documents and document clustering. In: IEEE second int. conf. big data computing service and applications (BigDataService), Oxford, 2016, p. 142–151. https://doi.org/10.1109/bigdataservice.2016.14.
 19.De Amorim RC, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in Kmeans clustering. Pattern Recognit. 2012;45(3):1061–75.CrossRefGoogle Scholar
 20.Schoenharl TW, Madey G. Evaluation of measurement techniques for the validation of agentbased simulations against streaming data. Comput Sci. 2008; 45:6–15.Google Scholar
 21.Francois D, Wertz V, Verleysen M. The concentration of fractional distances. IEEE Trans Knowl Data Eng. 2007;19:873–86.CrossRefGoogle Scholar
 22.Aslam JA, Frost M. An informationtheoretic measure for document similarity. In: Proc. 26th SIGIR, Toronto. 2003. p. 449–50.Google Scholar
 23.Chim H, Deng X. Efficient phrasebased document similarity for clustering. IEEE Trans Knowl Data Eng. 2008;20:1217–29.CrossRefGoogle Scholar
 24.Strehl A, Ghosh J. Valuebased customer grouping from large retail datasets. In: Proc. SPIE, Orlando, 2000. vol. 4057, p. 33–42.Google Scholar
 25.Subhashini R, Kumar VJ. Evaluating the performance of similarity measures used in document clustering and information retrieval. In: 1st int. conf. integrated intelligent computing, Bangalore, 2010, p. 27–31. https://doi.org/10.1109/iciic.2010.42.
 26.Hajishirzi H, Yih W, Kolcz A. Adaptive nearduplicate detection via similarity learning. Geneva: ACM SIGIR’10; 2010. p. 419–26.Google Scholar
 27.Han J, Pei J, Kamber M. Data mining: concepts and techniques. 3rd ed. USA: Elsevier; 2011.zbMATHGoogle Scholar
 28.Lin D. An informationtheoretic definition of similarity. San Francisco: ICML; 1998.Google Scholar
 29.Lin YS, Liao TY, Lee SJ. Detecting nearduplicate documents using sentencelevel features and supervised learning. Expert Syst Appl. 2013;40(5):1467–76.CrossRefGoogle Scholar
 30.Nagwani NK. A comment on ‘a similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng. 2015;27:2589–90.CrossRefGoogle Scholar
 31.Fahim AM, Salem AM, Torkey FA, Ramadan MA. An efficient enhanced kmeans clustering algorithm. J Zhejiang Univ Sci. 2006;7(10):1626–33.CrossRefGoogle Scholar
 32.Žalik KR. An efficient k′means clustering algorithm. Pattern Recognit Lett. 2008;29(9):1385–91.CrossRefGoogle Scholar
 33.Singhal A, Buckley C, Mitra M. Pivoted document length normalization. In: Proc. ACM SIGIR’96, NY, USA, 1996. p. 21–9.Google Scholar
 34.Cachopo AM. Improving methods for singlelabel text categorization. M.S. thesis, Instituto Superior Técnico, Portugal, Lissabon, 2007.Google Scholar
 35.Datasets for singlelabel text categorization. http://ana.cachopo.org/datasetsforsinglelabeltextcategorization. Accessed 15 May 2018.
 36.Willett P. The Porter stemming algorithm: then and now. Program. 2006;40(3):219–23.CrossRefGoogle Scholar
 37.Wilbur WJ, Sirotkin K. The automatic identification of stop words. J Inf Sci. 1992;18(1):45–55.CrossRefGoogle Scholar
 38.NDD_DocSets. https://github.com/marziehoghbaie/NDD_DocSets. Accessed 11 Nov. 2018.
 39.Malandrakis N, Potamianos A, Iosif E, Narayanan S. Distributional semantic models for affective text analysis. IEEE Audio Speech Lang Process. 2013;21(11):2379–92. https://doi.org/10.1109/TASL.2013.2277931.CrossRefGoogle Scholar
 40.Lang K. Newsweeder: learning to filter netnews. In: Machine learning proceeding. 1995, p. 331–9.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.