“On-the-fly” VS Materialized Sampling and Heuristics

  • Pedro Furtado
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2737)


Aggregation queries can take hours to return answers in large Data warehouses (DW). The user interested in exploring data in several iterative steps using decision support or data mining tools may feel frustrated for such long response times. The ability to return fast approximate answers accurately and efficiently is important to these applications. Samples for use in query answering can be obtained “On-the-fly” (OS) or from a materialized summary of samples (MS). While MS are typically faster than OS summaries, they have the limitation that sampling rates are predefined upon construction. This paper analyzes the use of OS versus MS for approximate answering of aggregation queries and proposes a Sampling Heuristic that chooses the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The experimental section compares OS to MS, analyzing response time and accuracy (TPC-H benchmark), and shows the heuristics strategy in action.


Sampling Rate Accuracy Target Heuristic Strategy Query Pattern Query Answering 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Acharaya, S., Gibbons, P.B., Poosala, V.: Congressional Samples for Approximate Answering of Group-By Queries. In: ACM SIGMOD Int. Conference on Management of Data, pp. 487–498 (June 2000)Google Scholar
  2. 2.
    Acharaya, S., et al.: Join synopses for approximate query answering. In: ACM SIGMOD Int. Conference on Management of Data, pp. 275–286 (June 1999)Google Scholar
  3. 3.
    Barbara, D., et al.: The New Jersey data reduction report. Bulletin of the Technical Committee on Data Engineering 20(4), 3–45 (1997)Google Scholar
  4. 4.
    Furtado, P., Costa, J.P.: Time-interval sampling for improved estimations in data warehouses. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 327–337. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Furtado, P., Costa, J.P.: The BofS Solution to Limitations of Approximate Summaries. In: DASFAA 2003 (2003)Google Scholar
  6. 6.
    Gibbons, P.B., Matias, Y., Poosala, V.: Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey (December 1997)Google Scholar
  7. 7.
    Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proc. ACM SIGMOD Int. Conference on Management of Data, pp. 331–342 (June 1998)Google Scholar
  8. 8.
    Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proc. 9th Intl. Conf. Scientific and Statistical Database Management (August 1997)Google Scholar
  9. 9.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Int. Conference on Management of Data, pp. 171–182 (May 1997)Google Scholar
  10. 10.
    Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Pedro Furtado
    • 1
  1. 1.Centro de Informática e Sistemas (DEI-CISUC)Universidade de Coimbra 

Personalised recommendations