Advertisement

Journal of Intelligent Information Systems

, Volume 19, Issue 2, pp 145–167 | Cite as

Approximate Query Answering Using Data Warehouse Striping

  • Jorge R. Bernardino
  • Pedro S. Furtado
  • Henrique C. Madeira
Article

Abstract

This paper presents and evaluates a simple but very effective method to implement large data warehouses on an arbitrary number of computers, achieving very high query execution performance and scalability. The data is distributed and processed in a potentially large number of autonomous computers using our technique called data warehouse striping (DWS). The major problem of DWS technique is that it would require a very expensive cluster of computers with fault tolerant capabilities to prevent a fault in a single computer to stop the whole system. In this paper, we propose a radically different approach to deal with the problem of the unavailability of one or more computers in the cluster, allowing the use of DWS with a very large number of inexpensive computers. The proposed approach is based on approximate query answering techniques that make it possible to deliver an approximate answer to the user even when one or more computers in the cluster are not available. The evaluation presented in the paper shows both analytically and experimentally that the approximate results obtained this way have a very small error that can be negligible in most of the cases.

data warehousing distributed query optimization data partitioning performance optimization approximate query answering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Acharaya, S., Gibbons, P., and Poosala, V. (2000). Congressional Samples for Approximate Answering of Groupby-Queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Dallas, Texas, USA (pp. 487–498).Google Scholar
  2. Albrecht, J., Gunzel, H., and Lehner, W. (1998). An Architecture for Distributed OLAP. In Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, USA.Google Scholar
  3. Barbara, D. et al. (1997). The New Jersey Data Reduction Report. Bulletin of the Technical Committee on Data Engineering, 20(4), 3–45.Google Scholar
  4. Bernardino, J. and Madeira, H. (2000). A New Technique to Speedup Queries in Data Warehousing. In Proc. of Chalenges ADBIS-DASFA A Symposium on Advances in Databases and Information Systems, Prague, Czech Republic (pp. 21–32).Google Scholar
  5. Bernardino, J. and Madeira, H. (2001). Experimental Evaluation of a New Distributed Partitioning Technique for DataWarehouses. In Proc. of Int. Database Engineering &; Applications Symposium IDEAS, Grenoble, France (pp. 312–321).Google Scholar
  6. Chauduri, S. and Dayal, U. (1997). An Overview of DataWarehousing and OLAP Technology. SIGMOD Record, 26(1), 65–74.Google Scholar
  7. Chen, C.M. and Roussopoulos, N. (1994). Adaptive Selectivity Estimation Using Query Feedback. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 161–172).Google Scholar
  8. Cochran, W.G. (1977). Sampling Techniques (3rd edn.). New York: John Wiley &; Sons.Google Scholar
  9. Codd, E.F., Codd, S.B., and Salley, C.T. (1993). Providing OLAP (Online Analitycal Processing) to User Analysts: An IT Mandate. Technical Report, E.F. Codd &; Associates.Google Scholar
  10. Datta, A., Moon, B., and Thomas, H. (1998). A Case for Parallelism in Data Warehousing and OLAP. In Proc. of the 9th Int. Conf. on Database and Expert Systems Applications DEXA Workshop (pp. 226–231).Google Scholar
  11. DeWitt, D.J. et al. (1990). The Gamma Database Machine Project. IEEE Trans. Knowledge and Data Engineering, 2(1), 44–62.Google Scholar
  12. DeWitt, D.J. and Gray, J. (1992). Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM, 35(6), 85–98.Google Scholar
  13. Ganguly, S., Gibbons, P.B., Matias, Y., and Silberschatz, A. (1996). Bifocal Sampling for Skew-Resistant Join Size Estimation. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 271–281).Google Scholar
  14. Gibbons, P.B. and Matias, Y. (1998a). New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 331–342).Google Scholar
  15. Gibbons, P.B. and Matias, Y. (1998b). AQUA: System and Techniques for Approximate Query Answering. Bell Labs Technical Report.Google Scholar
  16. Gibbons, P.B., Matias, Y., and Poosala, V. (1997a). Aqua Project, White Paper. Technical Report, Bell Laboratories, Murray Hill, New Jersey.Google Scholar
  17. Gibbons, P.B., Matias, Y., and Poosala, V. (1997b). Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd Int. Conf. on Very Large Data Bases VLDB (pp. 466–475).Google Scholar
  18. Haas, P.J. (1997). Large-Sample and Deterministic Confidence Intervals for Online Aggregation. In uProc. 9th Int. Conf. on Scientific and Statistical Database Management, SSDBM (pp. 51–62).Google Scholar
  19. Haas, P.J. (1999). Techniques for Online Exploration of Large Object-Relational Datasets. In Proc. 9th Int. Conf. on Scientific and Statistical Database Management, SSDBM (pp. 4–12).Google Scholar
  20. Haas, P.J., Naughton, J.F., Seshadri, S., and Stokes, L. (1995). Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proc. 21st Int. Conf. on Very Large Data Bases VLDB (pp. 311–322).Google Scholar
  21. Haas, P.J., Naughton, J.F., and Swami, A.N. (1994). On the Relative Cost of Sampling for Join Selectivity Estimation. In Proc. 13th ACM Symp. on Principles of Database Systems (pp. 14–24).Google Scholar
  22. Hansen, M.H., Hurwitz, W.M., and Madow, W.G. (1953). Sample Survey Methods and Theory (vols. I e II). New York: John Wiley &; Sons.Google Scholar
  23. Hellerstein, J.M., Haas, P.J., and Wang, H.J. (1997). Online Aggregation. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 171–182).Google Scholar
  24. Hou, W.-C. and Taneja, B.K. (1998). Statistical Estimators for Relational Algebra Expressions. In Proc. 7th ACM Symp. on Principles of Database Systems (pp. 276–287).Google Scholar
  25. Kimball, R. (1996). The Data Warehouse Toolkit. New York: J. Wiley &; Sons.Google Scholar
  26. Kimball, R., Reeves, L., Ross, M., and Thornthwalte, W. (1998). The Data Warehouse Lifecycle Toolkit. New York: J. Wiley &; Sons.Google Scholar
  27. Kooi, R.P. (1980). The Optimization of Queries in Relational Databases. PhD Thesis, Case Western Reserve University.Google Scholar
  28. Lipton, R.J. and Naughton, J.F. (1995). Query Size Estimation by Adaptive Sampling. J. Computer and System Sciences, 51(1), 18–25.Google Scholar
  29. Lipton, R.J., Naughton, J.F., and Schneider, D.A. (1990). Practical Selectivity Estimation Through Adaptive Sampling. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 1–11).Google Scholar
  30. Lu, H., Ooi, B.C., and Tan, K.L. (1994). Query Processing in Parallel Relational Database Systems. IEEE Computer Society.Google Scholar
  31. Olap Council, APB-1 Benchmark, Olap Council, November 1998, available at www.olpacouncil.org.Google Scholar
  32. Olken, F. and Rotem, D. (1992). Maintenance of Materialized Views of Sampling Queries. In Proc. 8th IEEE Int. Conf. on Data Engineering ICDE (pp. 632–664).Google Scholar
  33. Poosala, V. (1997). Histogram-Based Estimation Techniques in Databases. PhD Thesis, University of Wisconsin-Madison.Google Scholar
  34. Poosala, V., Ganti, V., and Ioannidis, Y.E. (1999). Approximate Query Answering Using Histograms. IEEE Data Engineering Bulletin, 22(4), 5–14.Google Scholar
  35. Poosala, V., Ioannidis, Y.E., Haas, P.J., and Shekita, E.J. (1996). Improved Histograms for Selectivity Estimation of Range Predicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 294–305).Google Scholar
  36. Rao, J. and Ross, K.A. (1998). Reusing Invariants: A New Strategy for Correlated Queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Seattle, USA (pp. 37–48).Google Scholar
  37. Selinger, P. et al. (1979). Access Path Selection in a Relational Database Management System. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 23–34).Google Scholar
  38. Seshadri, P., Pirahesh, H., and Cliff, T.Y. (1996). Complex Query Decorrelation. In Proc. IEEE Int. Conf. on Data Engineering ICDE (pp. 450–458).Google Scholar
  39. Stonebraker, M., Katz, R., Patterson, D., and Oustershout, J. (1998). The Design of XPRS. In Proc. of the Int. Conf. on Very Large Databases VLDB, Los Angeles, USA (pp. 318–330).Google Scholar
  40. Transaction Processing Council (1999). TPC Benchmark H. Transaction Processing Council, June 1999, available at www.tpc.org.Google Scholar
  41. Vitter, J. and Wang, M. (1999). Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 193–204).Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Jorge R. Bernardino
    • 1
  • Pedro S. Furtado
    • 2
  • Henrique C. Madeira
    • 2
  1. 1.Polytechnic of Coimbra, ISEC, DEISCoimbraPortugal
  2. 2.DEI, Pólo IIUniversity of CoimbraCoimbraPortugal

Personalised recommendations