On Cosine and Tanimoto Near Duplicates Search among Vectors with Domains Consisting of Zero, a Positive Number and a Negative Number
The cosine and Tanimoto similarity measures are widely applied in information retrieval, text and Web mining, data cleaning, chemistry and bio-informatics for finding similar objects, their clustering and classification. Recently, a few very efficient methods were offered to deal with the problem of lossless determination of such objects, especially in large and very high-dimensional data sets. They typically relate to objects that can be represented by (weighted) binary vectors. In this paper, we offer methods suitable for searching vectors with domains consisting of zero, a positive number and a negative number; that is, being a generalization of weighted binary vectors. Our results are not worse than their existing analogs offered for (weighted) binary vectors.
Keywordsthe cosine similarity the Tanimoto similarity nearest neighbors near duplicates exact duplicates non-zero dimensions vector’s length
Unable to display preview. Download preview PDF.
- 1.Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. of VLDB 2006. ACM (2006)Google Scholar
- 2.Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proc. of WWW 2007, pp. 131–140. ACM (2007)Google Scholar
- 3.Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks 29(8-13) 1157–1166 (1997)Google Scholar
- 4.Chaudhuri, S., Ganti, V., Kaushik, R.L.: A primitive operator for similarity joins in data cleaning. In: Proceedings of ICDE 2006. IEEE Computer Society (2006)Google Scholar
- 6.Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via hashing. In: Proc. of VLDB 1999, pp. 518–529 (1999)Google Scholar
- 11.Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann (1999)Google Scholar
- 12.Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proc. of WWW Conference, pp. 131–140 (2008)Google Scholar