Data Mining and Knowledge Discovery

, Volume 21, Issue 2, pp 310–326 | Cite as

Mining top-K frequent itemsets through progressive sampling

  • Andrea Pietracaprina
  • Matteo Riondato
  • Eli Upfal
  • Fabio Vandin


We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets’ frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.


Sampling Top-K frequent itemsets Frequent itemsets mining Bloom filters Progressive sampling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1): 3–15MATHCrossRefMathSciNetGoogle Scholar
  2. Chakaravarthy VT, Pandit V, Sabharwal Y (2009) Analysts of sampling techniques for association rule mining. Proceedings of ICDT 2009, pp 276–283Google Scholar
  3. Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. Proceedings of KDD 2002, pp 462–468Google Scholar
  4. Cohen E, Grossaug N, Kaplan H (2008) Processing top-k queries from samples. Comput Netw 52(14): 2605–2622MATHCrossRefGoogle Scholar
  5. Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. Proceedings of SIGMOD 1998, pp 331–342Google Scholar
  6. John GH, Langley P (1996) Static versus dynamic sampling for data mining. Proceedings of KDD 1996, pp 367–370Google Scholar
  7. Li Y, Gopalan RP (2004) Effective sampling for mining association rules. Proceedings of AUS-AI 2004, pp 391–401Google Scholar
  8. Manku GS, Motwani R (2002) Approximate frequency counts over data streams. Proceedings of VLDB 2002, pp 346–357Google Scholar
  9. Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. Proceedings of ICDT 2005, pp 398–412Google Scholar
  10. Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, CambridgeMATHGoogle Scholar
  11. Parthasarathy S (2002) Efficient progressive sampling for association rules. Proceedings of ICDM 2002, pp 354–361Google Scholar
  12. Pietracaprina A, Vandin F (2007) Efficient incremental mining of top-K frequent closed itemsets. Proceedings of discovery science 2007, pp 275–280Google Scholar
  13. Toivonen H (1996) Sampling large databases for association rules. Proceedings of VLDB 1996, pp 134–145Google Scholar
  14. Vasudevan D, Vjnović M (2009) Ranking through random sampling. ManuscriptGoogle Scholar
  15. Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5): 652–664CrossRefGoogle Scholar
  16. Wong RC-W, Fu AW-C (2006) Mining top-K frequent itemsets from data streams. Data Min Knowl Discov 13(2): 193–217CrossRefMathSciNetGoogle Scholar
  17. Zaki MJ, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. Proceedings of RIDE 1997, pp 42–50Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Andrea Pietracaprina
    • 1
  • Matteo Riondato
    • 2
  • Eli Upfal
    • 2
  • Fabio Vandin
    • 2
  1. 1.Dipartimento di Ingegneria dell’InformazioneUniversità di PadovaPadovaItaly
  2. 2.Department of Computer ScienceBrown UniversityProvidenceUSA

Personalised recommendations