Hierarchical Group-Based Sampling

  • Rainer Gemulla
  • Henrike Berthold
  • Wolfgang Lehner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3567)


Approximate query processing is an adequate technique to reduce response times and system load in cases where approximate results suffice. In database literature, sampling has been proposed to evaluate queries approximately by using only a subset of the original data. Unfortunately, most of these methods consider either only certain problems arising due to the use of samples in databases (e.g. data skew) or only join operations involving multiple relations. We describe how well-known sampling techniques dealing with group-by operations can be combined with foreign-key joins such that the join is computed after the generation of the sample. In detail, we show how senate sampling and small group sampling can be combined efficiently with the idea of join synopses. Additionally, we introduce different algorithms which maintain the sample if the underlying data changes. Finally, we prove the superiority of our method to the naive approach in an extensive set of experiments.


Simple Random Sampling Naive Approach Reference Table Successor Node Source Relation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    University of California at Berkeley: How much Information? (2003),
  2. 2.
    Acharya, S., Gibbons, P., Poosala, V.: Congressional Samples for Approximate Answering of Group-By Queries. In: Proc. ACM SIGMOD, pp. 487–498 (2000)Google Scholar
  3. 3.
    Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proc. ACM SIGMOD, pp. 539–550 (2003)Google Scholar
  4. 4.
    Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: Proc. ACM SIGMOD, pp. 275–286 (1999)Google Scholar
  5. 5.
    Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P., Hellerstein, J., Ioannidis, Y., Jagadish, H., Johnson, T., Ng, R., Poosala, V., Ross, K., Sevcik, K.: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20, 3–45 (1997)Google Scholar
  6. 6.
    Hellerstein, J., Haas, P., Wang, H.: Online Aggregation. In: Proc. ACM SIGMOD, pp. 171–182 (1997)Google Scholar
  7. 7.
    Vitter, J.: Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Gemulla, R., Lehner, W.: On Incremental Maintenance of Materialized Offline Samples (2005) (submitted for publication) Google Scholar
  9. 9.
    Ganti, V., Lee, M., Ramakrishnan, R.: ICICLES: Self-Tuning Samples for Approximate Query Answering. The VLDB Journal, 176–187 (2000)Google Scholar
  10. 10.
    Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming Limitations of Sampling for Aggregation Queries. In: Proc. ICDE, pp. 534–544 (2001)Google Scholar
  11. 11.
    Chaudhuri, S., Motwani, R., Narasayya, V.: On Random Sampling over Joins. In: Proc. ACM SIGMOD, pp. 263–274 (1999)Google Scholar
  12. 12.
    Gemulla, R., Berthold, H., Lehner, W.: Hierarchical Group-based Sampling (2005), Full version available at
  13. 13.
    Transaction Processing Performance Council: TPC-D Benchmark Version 2.1 (1998),

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Rainer Gemulla
    • 1
  • Henrike Berthold
    • 1
  • Wolfgang Lehner
    • 1
  1. 1.Database Technology GroupDresden University of Technology 

Personalised recommendations