“On-the-fly” VS Materialized Sampling and Heuristics
Aggregation queries can take hours to return answers in large Data warehouses (DW). The user interested in exploring data in several iterative steps using decision support or data mining tools may feel frustrated for such long response times. The ability to return fast approximate answers accurately and efficiently is important to these applications. Samples for use in query answering can be obtained “On-the-fly” (OS) or from a materialized summary of samples (MS). While MS are typically faster than OS summaries, they have the limitation that sampling rates are predefined upon construction. This paper analyzes the use of OS versus MS for approximate answering of aggregation queries and proposes a Sampling Heuristic that chooses the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The experimental section compares OS to MS, analyzing response time and accuracy (TPC-H benchmark), and shows the heuristics strategy in action.
KeywordsSampling Rate Accuracy Target Heuristic Strategy Query Pattern Query Answering
Unable to display preview. Download preview PDF.
- 1.Acharaya, S., Gibbons, P.B., Poosala, V.: Congressional Samples for Approximate Answering of Group-By Queries. In: ACM SIGMOD Int. Conference on Management of Data, pp. 487–498 (June 2000)Google Scholar
- 2.Acharaya, S., et al.: Join synopses for approximate query answering. In: ACM SIGMOD Int. Conference on Management of Data, pp. 275–286 (June 1999)Google Scholar
- 3.Barbara, D., et al.: The New Jersey data reduction report. Bulletin of the Technical Committee on Data Engineering 20(4), 3–45 (1997)Google Scholar
- 5.Furtado, P., Costa, J.P.: The BofS Solution to Limitations of Approximate Summaries. In: DASFAA 2003 (2003)Google Scholar
- 6.Gibbons, P.B., Matias, Y., Poosala, V.: Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey (December 1997)Google Scholar
- 7.Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proc. ACM SIGMOD Int. Conference on Management of Data, pp. 331–342 (June 1998)Google Scholar
- 8.Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proc. 9th Intl. Conf. Scientific and Statistical Database Management (August 1997)Google Scholar
- 9.Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Int. Conference on Management of Data, pp. 171–182 (May 1997)Google Scholar