Efficient Compression Technique for Sparse Sets

Pratap, Rameshwar; Sohony, Ishan; Kulkarni, Raghav

doi:10.1007/978-3-319-93040-4_14

Rameshwar Pratap¹⁹,
Ishan Sohony²⁰ &
Raghav Kulkarni²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3365 Accesses
5 Citations

Abstract

Recent growth in internet has generated large amount of data over web. Representations of most of such data are high-dimensional and sparse. Many fundamental subroutines of various data analytics tasks such as clustering, ranking, nearest neighbour scales poorly with the data dimension. In spite of significant growth in the computational power performing such computations on high dimensional data sets are infeasible, and at times impossible. Thus, it is desirable to investigate on compression algorithms that can significantly reduce dimension while preserving similarity between data objects. In this work, we consider the data points as sets, and use Jaccard similarity as the similarity measure. Pratap and Kulkarni [10] suggested a compression technique for high dimensional, sparse, binary data for preserving the Inner product and Hamming distance. In this work, we show that their algorithm also works well for Jaccard similarity. We present a theoretical analysis of compression bound and complement it with rigorous experimentation on synthetic and real-world datasets. We also compare our results with the state-of-the-art “min-wise independent permutation [6]”, and show that our compression algorithm achieves almost equal accuracy while significantly reducing the compression time and the randomness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A document is a string of characters. A k-shingle for a document is defined as a contiguous substring of length k found within the document. For example: if our document is abcd, then shingles of size 2 are \(\{ab, bc, cd\}\).

References

Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, 8–12 May 2007, pp. 131–140 (2007)
Google Scholar
Broder, A., Glassman, S., Nelson, C., Manasse, M., Zweig, G.: Method for clustering closely resembling data objects. US Patent 6,119,124, 12 September 2000
Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, June 1997
Google Scholar
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45123-4_1
Chapter Google Scholar
Broder, A.Z.: Min-wise independent permutations: theory and practice. In: Montanari, U., Rolim, J.D.P., Welzl, E. (eds.) ICALP 2000. LNCS, vol. 1853, pp. 808–808. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45022-X_67
Chapter Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, 23–26 May 1998, pp. 327–336 (1998)
Google Scholar
Li, P., Mahoney, M.W., She, Y.: Approximating higher-order distances using random projections. In: UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 8–11 July 2010, pp. 312–321 (2010)
Google Scholar
Lichman, M.: UCI machine learning repository (2013)
Google Scholar
Mitzenmacher, M., Pagh, R., Pham, N.: Efficient estimation for high similarities using odd sketches. In: 23rd International World Wide Web Conference, WWW 2014, Seoul, Republic of Korea, 7–11 April 2014, pp. 109–118 (2014)
Google Scholar
Pratap, R., Kulkarni, R.: Similarity preserving compressions of high dimensional sparse data. CoRR, abs/1612.06057 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Wipro Technologies, Bangalore, India
Rameshwar Pratap
PICT, Pune, India
Ishan Sohony
Chennai Mathematical Institute, Chennai, India
Raghav Kulkarni

Authors

Rameshwar Pratap
View author publications
You can also search for this author in PubMed Google Scholar
Ishan Sohony
View author publications
You can also search for this author in PubMed Google Scholar
Raghav Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rameshwar Pratap .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pratap, R., Sohony, I., Kulkarni, R. (2018). Efficient Compression Technique for Sparse Sets. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-93040-4_14
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Compression Technique for Sparse Sets