Abstract
Recent growth in internet has generated large amount of data over web. Representations of most of such data are high-dimensional and sparse. Many fundamental subroutines of various data analytics tasks such as clustering, ranking, nearest neighbour scales poorly with the data dimension. In spite of significant growth in the computational power performing such computations on high dimensional data sets are infeasible, and at times impossible. Thus, it is desirable to investigate on compression algorithms that can significantly reduce dimension while preserving similarity between data objects. In this work, we consider the data points as sets, and use Jaccard similarity as the similarity measure. Pratap and Kulkarni [10] suggested a compression technique for high dimensional, sparse, binary data for preserving the Inner product and Hamming distance. In this work, we show that their algorithm also works well for Jaccard similarity. We present a theoretical analysis of compression bound and complement it with rigorous experimentation on synthetic and real-world datasets. We also compare our results with the state-of-the-art “min-wise independent permutation [6]”, and show that our compression algorithm achieves almost equal accuracy while significantly reducing the compression time and the randomness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A document is a string of characters. A k-shingle for a document is defined as a contiguous substring of length k found within the document. For example: if our document is abcd, then shingles of size 2 are \(\{ab, bc, cd\}\).
References
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, 8–12 May 2007, pp. 131–140 (2007)
Broder, A., Glassman, S., Nelson, C., Manasse, M., Zweig, G.: Method for clustering closely resembling data objects. US Patent 6,119,124, 12 September 2000
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, June 1997
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45123-4_1
Broder, A.Z.: Min-wise independent permutations: theory and practice. In: Montanari, U., Rolim, J.D.P., Welzl, E. (eds.) ICALP 2000. LNCS, vol. 1853, pp. 808–808. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45022-X_67
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23–26 May 1998, pp. 327–336 (1998)
Li, P., Mahoney, M.W., She, Y.: Approximating higher-order distances using random projections. In: UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 8–11 July 2010, pp. 312–321 (2010)
Lichman, M.: UCI machine learning repository (2013)
Mitzenmacher, M., Pagh, R., Pham, N.: Efficient estimation for high similarities using odd sketches. In: 23rd International World Wide Web Conference, WWW 2014, Seoul, Republic of Korea, 7–11 April 2014, pp. 109–118 (2014)
Pratap, R., Kulkarni, R.: Similarity preserving compressions of high dimensional sparse data. CoRR, abs/1612.06057 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Pratap, R., Sohony, I., Kulkarni, R. (2018). Efficient Compression Technique for Sparse Sets. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-93040-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)