Advertisement

A Global Paradigm for Designing Parallel Relational Data Warehouses in Distributed Environments

  • Soumia BenkridEmail author
  • Ladjel Bellatreche
  • Alfredo Cuzzocrea
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8920)

Abstract

Designing a Parallel Relational Data Warehouse (PRDW) consists of a set of tasks: (i) choosing the hardware architecture; (ii) fragmenting the data warehouse schema; (iii) allocating the generated fragments; (iv) replicating fragments in order to ensure high performance; (v) defining the strategies for load balancing and query processing. The major drawback of this life-cycle is the fact that it does not consider the inter-dependency among sub-problems related to the design of PRDW, and it makes use of heterogeneous metrics to evaluate the “quality” of the final design. In previous research efforts, we introduced an analytical cost model for parallel OLAP query processing in cluster environments. In a second experience, we have taken into account the inter-dependency existing between fragmentation and allocation. In this paper, we propose a novel methodology, called \(\mathcal {F}\)&\(\mathcal {A}\)&\(\mathcal {R}\), which further extends previous results, and defines an approach where the main PRDW design phases (i.e., fragmentation, allocation, and replication) are performed simultaneously, in a global fashion. In particular, our approach determines whether the fragmentation pattern currently generated is relevant to the allocation process or not. An original method of supporting data replication, based on fuzzy k-means clustering, is also proposed and successfully integrated within the whole design framework. Finally, we experimentally assessed the performance of \(\mathcal {F}\)&\(\mathcal {A}\)&\(\mathcal {R}\) against a well-known data warehouse benchmark, with very promising results.

Keywords

Data warehouse Distributed environment Fragmentation Allocation Replication Load balancing  Analytical cost model Design methodology 

References

  1. 1.
    Agrawal, D., Das, S., El Abbadi, A.: Data Management in the Cloud: Challenges and Opportunities. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2012)Google Scholar
  2. 2.
    Ahmad, I., Karlapalem, K., Ghafoor, R.A.: Evolutionary algorithms for allocating data in distributed database systems. Distrib. Parallel Databases 11, 5–32 (2002)CrossRefzbMATHGoogle Scholar
  3. 3.
    Akal, F., Böhm, K., Schek, H.-J.: OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In: Manolopoulos, Y., Návrat, P. (eds.) ADBIS 2002. LNCS, vol. 2435, pp. 218–231. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Apers, P.M.G.: Data allocation in distributed database systems. ACM Trans. Database Syst. 13(3), 263–304 (1988)CrossRefGoogle Scholar
  5. 5.
    Bellatreche, L., Benkrid, S.: A joint design approach of partitioning and allocation in parallel data warehouses. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691, pp. 99–110. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  6. 6.
    Bellatreche, L., Benkrid, S., Crolotte, A., Cuzzocrea, A., Ghazal, A.: The F&A methodology and its experimental validation on a real-life parallel processing database system. In: CISIS’12, pp. 114–121 (2012)Google Scholar
  7. 7.
    Bellatreche, L., Boukhalfa, K.: An evolutionary approach to schema partitioning selection in a data warehouse. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 115–125. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  8. 8.
    Bellatreche, L., Boukhalfa, K., Richard, P.: Data partitioning in data warehouses: hardness study, heuristics and ORACLE validation. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 87–96. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  9. 9.
    Bellatreche, L., Boukhalfa, K., Richard, P.: Referential horizontal partitioning selection problem in data warehouses: hardness study and selection algorithms. Int. J. Data Warehous. Min. 5(4), 1–23 (2009)CrossRefGoogle Scholar
  10. 10.
    Bellatreche, L., Cuzzocrea, A., Benkrid, S.: F&A: a methodology for effectively and efficiently designing parallel relational data warehouses on heterogenous database clusters. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWak 2010. LNCS, vol. 6263, pp. 89–104. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Bellatreche, L., Cuzzocrea, A., Benkrid, S.: Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: the F&A approach. J. Database Manage. 23, 17–51 (2012)CrossRefGoogle Scholar
  12. 12.
    Bergsten, B., Couprie, M., Valduriez, P.: Overview of parallel architectures for databases. Comput. J. 36(8), 734–740 (1993)CrossRefGoogle Scholar
  13. 13.
    Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geo-sci. 10(2–3), 191–203 (1984)CrossRefGoogle Scholar
  14. 14.
    Borr, A.: Transaction monitoring in encompass: reliable distributed transaction processing. In: Proceedings of the Very Large Database Conference, pp. 155–165. IEEE Press (1981)Google Scholar
  15. 15.
    Burkhard, W.A., Menon, J.: Disk array storage system reliability. In: FTCS, pp. 432–441 (1993)Google Scholar
  16. 16.
    Ceri, S., Negri, M., Pelagatti, G.: Horizontal data partitioning in database design. In: 1982 ACM SIGMOD International Conference on Management of Data, pp. 128–136 (1982)Google Scholar
  17. 17.
    Chang, R.-S., Chang, H.-P., Wang, Y.-T.: A dynamic weighted data replication strategy in data grids. In: Proceedings of the 2008 IEEE/ACS International Conference on Computer Systems and Applications, AICCSA ’08, pp. 414–421. IEEE Computer Society, Washington, DC (2008)Google Scholar
  18. 18.
    Ciciani, B., Dias, D.M., Yu, P.S.: Analysis of replication in distributed database systems. IEEE Trans. Knowl. Data Eng. 2, 247–261 (1990)CrossRefGoogle Scholar
  19. 19.
    Copeland, G.P., Alexander, W., Boughter, E., Keller, T.: Data placement in bubba. In: ACM SIGMOD International Conference on Management of Data, pp. 99–108 (1988)Google Scholar
  20. 20.
    Costa, J.P., Furtado, P.: Poster session: towards a QoS-aware DBMS. In: ICDE Workshops, pp. 50–55 (2008)Google Scholar
  21. 21.
    Cuzzocrea, A.: Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP. In: 8th ACM International Workshop on Data Warehousing and OLAP (DOLAP 05), pp. 97–106 (2005)Google Scholar
  22. 22.
    Cuzzocrea, A.: Theoretical and practical aspects of warehousing, querying and mining sensor and streaming data. J. Comput. Syst. Sci. 79(3), 309–311 (2013)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Cuzzocrea, A., Darmont, J., Mahboubi, H.: Fragmenting very large XML data warehouses via k-means clustering algorithm. Int. J. Bus. Intell. Data Min. 4(3–4), 301–328 (2009)CrossRefGoogle Scholar
  24. 24.
    Cuzzocrea, A., Mansmann, S.: OLAP visualization: models, issues, and techniques. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, pp. 1439–1446. IGI Global, Hershey (2009)CrossRefGoogle Scholar
  25. 25.
    Cuzzocrea, A., Russo, V., Saccà, D.: A robust sampling-based framework for privacy preserving OLAP. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 97–114. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  26. 26.
    Cuzzocrea, A., Serafino, P.: LCS-Hist: taming massive high-dimensional data cube compression. In: 12th International Conference on Extending Database Technology (EDBT 09), pp. 768–779 (2009)Google Scholar
  27. 27.
    Cuzzocrea, A., Wang, W.: Approximate range-sum query answering on data cubes with probabilistic guarantees. J. Intell. Inf. Syst. 28(2), 161–197 (2007)CrossRefGoogle Scholar
  28. 28.
    Darabant, A.S., Campan, A.: Semi-supervised learning techniques: k-means clustering in OODB fragmentation. In: Second IEEE International Conference on Computational Cybernetics (ICCC 04), Vienna, Austria, pp. 333–338. IEEE Computer Society (2004)Google Scholar
  29. 29.
    Dewitt, D., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma - a high performance dataflow database machine. VLDB 10, 228–237 (1986)Google Scholar
  30. 30.
    DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)CrossRefGoogle Scholar
  31. 31.
    DeWitt, D., Madden, S., Stonebraker, M.: How to build a high-performance data warehouse. http://db.lcs.mit.edu/madden/high_perf.pdf
  32. 32.
    Forestiero, A., Mastroianni, C., Spezzano, G.: Qos-based dissemination of content in grids. Future Gener. Comp. Syst. 24(3), 235–244 (2008)CrossRefGoogle Scholar
  33. 33.
    Furtado, P.: Experimental evidence on partitioning in parallel data warehouses. In: 7th ACM International Workshop on Data Warehousing and OLAP (DOLAP), pp. 23–30 (2004)Google Scholar
  34. 34.
    Furtado, P.: Efficient, chunk-replicated node partitioned data warehouses. In: ISPA, pp. 578–583 (2008)Google Scholar
  35. 35.
    Furtado, P.: Efficient and robust node-partitioned data warehouses. In: Erickson, J. (ed.) Database Technologies: Concepts, Methodologies, Tools, and Applications, pp. 658–677. IGI Global, IGI Global (2009)CrossRefGoogle Scholar
  36. 36.
    Gorla, N., Yan, B.P.W.: Vertical fragmentation in databases using data-mining technique. In: Erickson, J. (ed.) Database Technologies: Concepts, Methodologies, Tools, and Applications, pp. 2543–2563. IGI Global, Hershey (2009)CrossRefGoogle Scholar
  37. 37.
    Hababeh, I.O., Ramachandran, M., Bowring, N.: A high-performance computing method for data allocation in distributed database systems. J. Supercomput. 39(1), 3–18 (2007)CrossRefGoogle Scholar
  38. 38.
    Hsiao, H.-I., DeWitt, D.J.: Replicated data management in the gamma database machine. In: Workshop on the Management of Replicated Data, pp. 79–84 (1990)Google Scholar
  39. 39.
    Hsiao, H.-I., Dewitt, D.J.: Chained declustering: a new availability strategy for multiprocssor database machines. In: ICDE’90, pp. 456–465 (1990)Google Scholar
  40. 40.
    Coffman Jr., E.G., Leung, J.Y., Ting, D.W.: Bin packing: maximizing the number of pieces packed. Acta Inform. 9, 263–271 (1978)CrossRefMathSciNetzbMATHGoogle Scholar
  41. 41.
    Karimi Adl, R., Rouhani Rankoohi, S.M.T.: A new ant colony optimization based algorithm for data allocation problem in distributed databases. Knowl. Inf. Syst. 20(3), 349–373 (2009)CrossRefGoogle Scholar
  42. 42.
    Lima, A.A.B., Mattoso, M., Valduriez, P.: Adaptive virtual partitioning for OLAP query processing in a database cluster. In: Lifschitz, S. (ed.) SBBD’04, Brasilia, Brésil, pp. 92–105 (2004)Google Scholar
  43. 43.
    Lima, A.B., Furtado, C., Valduriez, P., Mattoso, M.: Parallel olap query processing in database clusters with data replication. Distrib. Parallel Database J. 25(1–2), 97–123 (2009)CrossRefGoogle Scholar
  44. 44.
    Loukopoulos, T., Ahmad, I.: Static and adaptive distributed data replication using genetic algorithms. J. Parallel Distrib. Comput. 64(11), 1270–1285 (2004)CrossRefzbMATHGoogle Scholar
  45. 45.
    Mansouri, Y., Monsefi, R.: Optimal number of replicas with qos assurance in data grid environment. In: Proceedings of the 2008 Second Asia International Conference on Modelling & Simulation (AMS), AMS ’08, pp. 168–173. IEEE Computer Society, Washington, DC (2008)Google Scholar
  46. 46.
    Märtens, H., Rahm, E., Stöhr, T.: Dynamic query scheduling in parallel data warehouses: concurrency computation practice and experience. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 321–331. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  47. 47.
    Menon, S.: Allocating fragments in distributed databases. IEEE Trans. Parallel Distrib. Syst. 16(7), 577–585 (2005)CrossRefGoogle Scholar
  48. 48.
    Nehme, R.V., Bruno, N.: Automated partitioning design in parallel database systems. In: ACM SIGMOD’11, pp. 1137–1148 (2011)Google Scholar
  49. 49.
    Noaman, A.Y., Barker, K.: A horizontal fragmentation algorithm for the fact relation in a distributed data warehouse. In: 8th International Conference on Information and Knowledge Management (CIKM’99), November 1999, pp. 154–161 (1999)Google Scholar
  50. 50.
    O’Neil, P., O’Neil, E.B., Chen, X.: The star schema benchmark (2007). http://www.cs.umb.edu/poneil/starschemab.pdf
  51. 51.
    Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 2nd edn. Prentice Hall, Englewood Cliffs (1999)Google Scholar
  52. 52.
    Page, T.H.: Tpc benchmark\(^{TM}\)d (decision support). http://www.tpc.org
  53. 53.
    Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: ACM SIGMOD’12, pp. 61–72. ACM, New York (2012)Google Scholar
  54. 54.
    Phan, T., Li, W.-S.: Load distribution of analytical query workloads for database cluster architectures. In: EDBT, pp. 169–180 (2008)Google Scholar
  55. 55.
    Rao, J., Zhang, C., Lohman, G., Megiddo, N.: Automating physical database design in a parallel database. In: ACM SIGMOD’02, June 2002, pp. 558–569 (2002)Google Scholar
  56. 56.
    Saccà, D., Wiederhold, G.: Database partitioning in a cluster of processors. ACM Trans. Database Syst. 10(1), 29–56 (1985)CrossRefzbMATHGoogle Scholar
  57. 57.
    Sarathy, R., Shetty, B., Sen, A.: A constrained nonlinear 0–1 program for data allocation. Eur. J. Oper. Res. 102(3), 626–647 (1997)CrossRefzbMATHGoogle Scholar
  58. 58.
    Stöhr, T., Märtens, H., Rahm, E.: Multi-dimensional database allocation for parallel data warehouses. In: VLDB’00, pp. 273–284 (2000)Google Scholar
  59. 59.
    Taniar, D., Leung, C.H.C., Rahayu, W., Goel, S.: High Performance Parallel Database Processing and Grid Databases. Wiley Publishing, Hoboken (2008)CrossRefGoogle Scholar
  60. 60.
    Teradata. Dbc/1012 database computer system manual release 2.0. Technical document C10-0001-02 (1985)Google Scholar
  61. 61.
    Thiele, M., Bader, A., Lehner, W.: Multi-objective scheduling for real-time data warehouses. Comput. Sci. - R&D 24(3), 137–151 (2009)Google Scholar
  62. 62.
    Wolfson, O., Milo, A.: The multicast policy and its relationship to replicated data placement. ACM Trans. Database Syst. 16(1), 181–205 (1991)CrossRefMathSciNetGoogle Scholar
  63. 63.
    Zhu, H., Gu, P., Wang, J.: Shifted declustering: a placement-ideal layout scheme for multi-way replication storage architecture. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS ’08, pp. 134–144. ACM, New York (2008)Google Scholar
  64. 64.
    Zilio, D.C., Jhingran, A., Padmanabhan, S.: Partitioning key selection for a shared-nothing parallel database system. In: IBM Research Report RC (1994)Google Scholar
  65. 65.
    Zilio, D.C., Rao, J., Lightstone, S., Lohman, G.M., Storm, A., Garcia-Arellano, C., Fadden, S.: DB2 design advisor: integrated automatic physical database design. In: Proceedings of the International Conference on Very Large Databases (VLDB), August 2004, pp. 1087–1097 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Soumia Benkrid
    • 1
    • 2
    Email author
  • Ladjel Bellatreche
    • 1
  • Alfredo Cuzzocrea
    • 3
  1. 1.LIAS/ISAE-ENSMAPoitiersFrance
  2. 2.National High School for Computer Science (ESI)AlgiersAlgeria
  3. 3.ICAR-CNR and University of CalabriaRendeItaly

Personalised recommendations