Skip to main content

Consistent Subset Sampling

  • Conference paper
Algorithm Theory – SWAT 2014 (SWAT 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8503))

Included in the following conference series:

  • 1071 Accesses

Abstract

Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset \(\mathcal{I}\subseteq U\) it should be possible to quickly compute \(\mathcal{I}\cap S\), i.e., the elements in \(\mathcal{I}\) satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream.

In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(b k). Using a carefully designed hash function, for a given sampling probability p ∈ (0,1], we show how to improve the time complexity to Θ(b k/2⌉loglogb + pb k) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(b k/4⌉).

We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

This work is supported by the Danish National Research Foundation under the Sapere Aude program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baran, I., Demaine, E.D., Pǎtraşcu, M.: Subquadratic Algorithms for 3SUM. Algorithmica 50(4), 584–596 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  2. Boley, M., Grosskreutz, H.: A Randomized Approach for Approximating the Number of Frequent Sets. In: ICDM 2008, pp. 43–52 (2008)

    Google Scholar 

  3. Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient semi-streaming algorithms for local triangle counting in massive graphs. In: KDD 2008, pp. 16–24 (2008)

    Google Scholar 

  4. Bordino, I., Donato, D., Gionis, A., Leonardi, S.: Mining Large Networks with Subgraph Counting. In: ICDM 2008, pp. 737–742 (2008)

    Google Scholar 

  5. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-Wise Independent Permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  6. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic Clustering of the Web. Computer Networks 29(8-13), 1157–1166 (1997)

    Article  Google Scholar 

  7. Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: PODS 2006, pp. 253–262 (2006)

    Google Scholar 

  8. Buriol, L.S., Frahling, G., Leonardi, S., Sohler, C.: Estimating Clustering Indexes in Data Streams. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 618–632. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. Campagna, A., Kutzkov, K., Pagh, R.: On Parallelizing Matrix Multiplication by the Column-Row Method. In: ALENEX 2013, pp. 122–132 (2013)

    Google Scholar 

  10. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theor. Comput. Sci. 312(1), 3–15 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  11. Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  12. Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial Hash Functions Are Reliable (Extended Abstract). In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992)

    Chapter  Google Scholar 

  13. Dinur, I., Dunkelman, O., Keller, N., Shamir, A.: Efficient Dissection of Composite Problems, with Applications to Cryptanalysis, Knapsacks, and Combinatorial Search Problems. In: Safavi-Naini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol. 7417, pp. 719–740. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Geerts, F., Goethals, B., Van den Bussche, J.: Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst. 30(2), 333–363 (2005)

    Article  Google Scholar 

  15. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)

    Google Scholar 

  16. Han, Y., Thorup, M.: Integer Sorting in \(O(n \sqrt{\log \log n})\) Expected Time and Linear Space. In: FOCS 2002, pp. 135–144 (2002)

    Google Scholar 

  17. Impagliazzo, R., Paturi, R., Zane, F.: Which Problems Have Strongly Exponential Complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  18. Indyk, P.: A Small Approximately Min-Wise Independent Family of Hash Functions. J. Algorithms 38(1), 84–90 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  19. Jin, R., McCallen, S., Breitbart, Y., Fuhry, D., Wang, D.: Estimating the number of frequent itemsets in a large database. In: EDBT, pp. 505–516 (2009)

    Google Scholar 

  20. Kane, D.M., Mehlhorn, K., Sauerwald, T., Sun, H.: Counting Arbitrary Subgraphs in Data Streams. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) ICALP 2012, Part II. LNCS, vol. 7392, pp. 598–609. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS 2010, pp. 41–52 (2010)

    Google Scholar 

  22. Pǎtraşcu, M., Williams, R.: On the Possibility of Faster SAT Algorithms. In: SODA 2010, pp. 1065–1075 (2010)

    Google Scholar 

  23. Schroeppel, R., Shamir, A.: A T = O(2n/2), S = O(2n/4) Algorithm for Certain NP-Complete Problems. SIAM J. Comput. 10(3), 456–464 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  24. Willard, D.E.: Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N). Inf. Process. Lett. 17(2), 81–84 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  25. Woeginger, G.J.: Space and Time Complexity of Exact Algorithms: Some Open Problems (Invited Talk). In: Downey, R.G., Fellows, M.R., Dehne, F. (eds.) IWPEC 2004. LNCS, vol. 3162, pp. 281–290. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Kutzkov, K., Pagh, R. (2014). Consistent Subset Sampling. In: Ravi, R., Gørtz, I.L. (eds) Algorithm Theory – SWAT 2014. SWAT 2014. Lecture Notes in Computer Science, vol 8503. Springer, Cham. https://doi.org/10.1007/978-3-319-08404-6_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08404-6_26

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08403-9

  • Online ISBN: 978-3-319-08404-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics