Advertisement

A rewrite/merge approach for supporting real-time data warehousing via lightweight data integration

  • Alfredo CuzzocreaEmail author
  • Nickerson Ferreira
  • Pedro Furtado
Article
  • 37 Downloads

Abstract

This paper proposes and experimentally assesses a rewrite/merge approach for supporting real-time data warehousing via lightweight data integration. Real-time data warehouses are becoming more and more relevant actually, due to emerging research challenges such as Big Data and Cloud Computing. Our contribution fulfills limitations of actual data warehousing architectures, which are no suitable to perform classical operations (e.g., loading, aggregation, indexing, OLAP query answering, and so forth) under real-time constraints. The proposed approach is based on intelligent manipulation of SQL statements of input queries, which are decomposed in suitable sub-queries (the rewrite phase) that are finally submitted as (final) input queries to an ad hoc component responsible for the cooperative query answering via a parallel query processing inspired method (the merge phase). This method induces in a novel data warehousing framework where the static phase is separated by the dynamic phase, in order to achieve the real-time processing features. We complete our analytical contributions by means of an extensive experimental campaign where we stress the performance of our proposed real-time data warehousing framework against a popular data warehouse benchmark, and in comparison with traditional architectures, which finally confirms the benefits deriving from our proposal.

Keywords

Real-time data warehousing Data warehouse optimization Data warehouse performance 

References

  1. 1.
    Agrawal D, Das D, El Abbadi A (2011) Big data and cloud computing: current state and future opportunities. In: Proceedings of EDBT, pp 530–533Google Scholar
  2. 2.
    Apache. Apache Spark. http://spark.apache.org/. Accessed: Sept 2018. Apache. Apache Spark. http://spark.apache.org/. Accessed: Sept. 2018
  3. 3.
    Apache. Spark Streaming. http://spark.apache.org/streaming/. Accessed: Sept. 2018
  4. 4.
    Babu S, Widom J (2001) Continuous queries over data streams. SIGMOD Rec 30(3):109–120Google Scholar
  5. 5.
    Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Inf 1(3):173–189zbMATHGoogle Scholar
  6. 6.
    Barkhordari M, Niamanesh M (2017) Atrak: a MapReduce-based data warehouse for big data. J Supercomput 73(10):4596–4610Google Scholar
  7. 7.
    Bateni MH, Golab L, Hajiaghayi MT, Karloff HJ (2011) Scheduling to minimize staleness and stretch in real-time data warehouses. Theory Comput Syst 49(4):757–780MathSciNetzbMATHGoogle Scholar
  8. 8.
    Bellatreche L, Cuzzocrea A, Benkrid S (2012) Effectively and efficiently designing and querying parallel relational data warehouses on heterogeneous database clusters: the F&A approach. J Database Manag 23(4):17–51Google Scholar
  9. 9.
    Benslimane D, Dustdar S, Sheth A (2008) Services mashups: the new generation of web applications. IEEE Internet Comput 10(5):13–15Google Scholar
  10. 10.
    Bernstein PA (1996) Middleware: a model for distributed system services. Commun ACM 39(2):86–98Google Scholar
  11. 11.
    Bouaziz S, Nabli A, Gargouri F (2016) From traditional data warehouse to real time data warehouse. In: Proceedings of ISDA, pp 467–477Google Scholar
  12. 12.
    Chan CY, Ioannidis YE (1998) Bitmap index design and evaluation. In: Proceedings of ACM SIGMOD, pp 355–366Google Scholar
  13. 13.
    Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec 26(1):65–74Google Scholar
  14. 14.
    Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD skills: new analysis practices for big data. PVLDB 2(2):1481–1492Google Scholar
  15. 15.
    Cuzzocrea A (2005) Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP. In: Proceedings of ACM DOLAP, pp 97–106Google Scholar
  16. 16.
    Cuzzocrea A (2005) Overcoming limitations of approximate range query answering in OLAP. In: Proceedings of IEEE IDEAS, pp 200–209Google Scholar
  17. 17.
    Cuzzocrea A (2011) A framework for modeling and supporting data transformation services over data and knowledge grids with real-time bound constraints. Concur Comput Pract Exp 23(5):436–457Google Scholar
  18. 18.
    Cuzzocrea A (2011) Data warehousing and knowledge discovery from sensors and streams. Knowl Inf Syst 28(3):491–493Google Scholar
  19. 19.
    Cuzzocrea A (2013) Analytics over big data: exploring the convergence of data warehousing, OLAP and data-intensive cloud infrastructures. In: Proceedings of IEEE COMPSAC, pp 481–483Google Scholar
  20. 20.
    Cuzzocrea A (2017) Big web data: warehousing and analytics—recent trends and future challenges. In: Proceedings of ICWE Workshops, pp 265–266Google Scholar
  21. 21.
    Cuzzocrea A (2013) Theoretical and practical aspects of warehousing, querying and mining sensor and streaming data. J Comput Syst Sci 79(3):309–311MathSciNetGoogle Scholar
  22. 22.
    Cuzzocrea A (2014) Data warehousing and OLAP over big data. In: Proceedings of BigData CongressGoogle Scholar
  23. 23.
    Cuzzocrea A, Bellatreche L, Song IY (2013) Data warehousing and OLAP over big data: current challenges and future research directions. In: Proceedings of DOLAP, pp 67–70Google Scholar
  24. 24.
    Cuzzocrea A, Furfaro F, Masciari E, Saccà D, Sirangelo C (2004) Approximate query answering on sensor network data streams. In: Stefanidis A, Nittel S (eds) GeoSensor networks. CRC Press, London, pp 53–72Google Scholar
  25. 25.
    Cuzzocrea A, Gunopulos D (2014) A decomposition framework for computing and querying multidimensional OLAP data cubes over probabilistic relational data. Fundam Inf 132(2):239–266Google Scholar
  26. 26.
    Cuzzocrea A, Moussa R, Vercelli G (2018) An innovative lambda-architecture-based data warehouse maintenance framework for effective and efficient near-real-time OLAP over big data. In: Proceedings of BigData Congress, pp 149–165Google Scholar
  27. 27.
    Cuzzocrea A, Saccà D, Serafino P (2007) Semantics-aware advanced OLAP visualization of multidimensional data cubes. Int J Data Warehous Min 3(4):1–30Google Scholar
  28. 28.
    Cuzzocrea A, Saccà D, Ullman JD (2013) Big data: a research agenda. In: Proceedings of ACM IDEAS, pp 198–203Google Scholar
  29. 29.
    Cuzzocrea A, Serafino P (2009) LCS-Hist: taming massive high-dimensional data cube compression. In: Proceedings of ACM EDBT, pp 768–779Google Scholar
  30. 30.
    Cuzzocrea A, Song I-Y, Davis KC (2011) Analytics over large-scale multidimensional data: the big data revolution! In: Proceedings of ACM DOLAP, pp 101–104Google Scholar
  31. 31.
    Das S, Botev C, Surlaker K, Ghosh B, Varadarajan B, Nagaraj S, Zhang D, Gao L, Westerman J, Ganti P, Shkolnik B, Topiwala S, Pachev A, Somasundaram N, Subramaniam S (2012) All aboard the databus! linkedin’s scalable consistent change data capture platform. In: Proceedings of SoCC, p 18Google Scholar
  32. 32.
    Davoudian A, Chen L, Liu MA (2018) Survey on NoSQL stores. ACM Comput Surv 51(2):40:1–40:43Google Scholar
  33. 33.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113Google Scholar
  34. 34.
    Eavis T, Cueva D (2007) A Hilbert space compression architecture for data warehouse environments. In: Proceedings of DaWaK, pp 1–12Google Scholar
  35. 35.
    Eccles MJ, Evans DJ, Beaumont AJ (2010) True real-time change data capture with web service database encapsulation. In: Proceedings of SERVICES, pp 128–131Google Scholar
  36. 36.
    Erl T (2005) Service-oriented architecture: concepts, technology, and design. Prentice Hall, Upper Saddle RiverGoogle Scholar
  37. 37.
    Ferreira N, Furtado P (2013) Real-time data warehouse: a solution and evaluation. Int J Bus Intell Data Min 8(3):244–263Google Scholar
  38. 38.
    Furtado P (2005) Efficiently processing query-intensive databases over a non-dedicated local network. In: Proceedings of IEEE IPDPS, p 72Google Scholar
  39. 39.
    Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min Knowl Discov 1(1):152–159Google Scholar
  40. 40.
    Guo K, Pan W, Lu M, Zhou X, Ma J (2015) An effective and economical architecture for semantic-based heterogeneous multimedia big data retrieval. J Syst Softw 102(1):207–216Google Scholar
  41. 41.
    Guo K, Tang Y, Zhang P (2017) CSF: crowdsourcing semantic fusion for heterogeneous media big data in the internet of things. Inf Fusion 37(1):77–85Google Scholar
  42. 42.
    Gupta A, Mumick IS (1999) Materialized, views: techniques, implementations, and applications. MIT Press, CambridgeGoogle Scholar
  43. 43.
    Gupta A, Yang F, Govig J, Kirsch A, Chan K, Lai K, Wu S, Dhoot SG, Kumar AR, Agiwal A, Bhansali S, Hong M, Cameron J, Siddiqi M, Jones D, Shute J, Gubarev A, Venkataraman S, Agrawal D (2014) Mesa: geo-replicated, near real-time, scalable data warehousing. PVLDB 7(12):1259–1270Google Scholar
  44. 44.
    Hamdi I, Bouazizi E, Alshomrani S, Feki J (2015) 2LPA-RTDW: a two-level data partitioning approach for real-time data warehouse. In: Proceedings of ICIS, pp 632–638Google Scholar
  45. 45.
    Hamdi I, Bouazizi E, Alshomrani S, Feki J (2018) Improving QoS in real-time data warehouses by using feedback control scheduling. Int J Inf Decis Sci 10(3):181–211Google Scholar
  46. 46.
    Hamdi I, Bouazizi E, Feki J (2014) Dynamic management of materialized views in real-time data warehouses. In: Proceedings of SoCPaR, pp 168–173Google Scholar
  47. 47.
    Ishigaki A, Hibino H (2014) Optimal storage assignment for an automated warehouse system with mixed loading. In: Proceedings of APMS, pp 475–482Google Scholar
  48. 48.
    Jain T, Rajasree S, Saluja S (2012) Refreshing data warehouse in near real-time. Int J Comput Appl 46(18):24–29Google Scholar
  49. 49.
    Jia R, Xu S, Peng C (2013) Research on real time data warehouse architecture. In: Proceedings of ICICA, pp 333–342Google Scholar
  50. 50.
    Kimball R (2008) The data warehouse lifecycle toolkit, 2nd edn. Wiley, HobokenGoogle Scholar
  51. 51.
    Larson P-A (2013) Special issue on main-memory database systems. IEEE Data Eng Bull 36(2):1MathSciNetGoogle Scholar
  52. 52.
    Li J, Srivastava J (2002) Efficient aggregation algorithms for compressed data warehouses. IEEE Trans Knowl Data Eng 14(3):515–529Google Scholar
  53. 53.
    Lpez MA, Nadal S, Djedaini M, Marcel P, Peralta V, Furtado P (2015) An approach for alert raising in real-time data warehouses. In: Proceedings of EDA, pp 145–160Google Scholar
  54. 54.
    Lu H, Tan KL, Ooi B-C (1994) Query processing in parallel relational database systems. IEEE Computer Society Press, Los AlamitosGoogle Scholar
  55. 55.
    Naeem MA (2013) Tuned X-HYBRIDJOIN for near-real-time data warehousing. In: Proceedings of APWeb, pp 494–505Google Scholar
  56. 56.
    Naeem MA (2013) A robust join operator to process streaming data in real-time data warehousing. In: Proceedings of ICDIM, pp 119–124Google Scholar
  57. 57.
    Naeem MA, Dobbie G, Weber G (2014) Efficient processing of streaming updates with archived master data in near-real-time data warehousing. Knowl Inf Syst 40(13):615–637Google Scholar
  58. 58.
    Naeem MA, Jamil N (2014) An efficient stream-based join to process end user transactions in real-time data warehousing. J Dig Inf Manag 12(3):201–215Google Scholar
  59. 59.
    Naeem MA, Nguyen KT, Weber G (2017) A multi-way semi-stream join for a near-real-time data warehouse. In: Proceedings of ADC, pp 59–70Google Scholar
  60. 60.
    Navathe SB, Ceri S, Wiederhold G, Dou J (1984) Vertical partitioning algorithms for database design. ACM Trans Database Syst 9(4):680–710Google Scholar
  61. 61.
    Nguyen M, Tjoa AM (2003) Zero-latency data warehousing for heterogeneous data sources and continuous data streams. In: Proceedings of iiWAS, pp 55–64Google Scholar
  62. 62.
    O’Neil P, O’Neil E, Chen X, Revilak S (2009) Star schema benchmark and augmented fact table indexing. In: Proceedings of TPCTC, pp 237–252Google Scholar
  63. 63.
    Oracle (2012) Best practices for real-time data warehousing. White PaperGoogle Scholar
  64. 64.
    Pereira DA, de Morais WO, de Freitas EP (2018) NoSQL real-time database performance comparison. Int J Parallel Emerg Distrib Syst 33(2):144–156Google Scholar
  65. 65.
    Qu W, Basavaraj V, Shankar S, Dessloch S (2015) Real-time snapshot maintenance with incremental ETL pipelines in data warehouses. In: Proceedings of DaWaK, pp 217–228Google Scholar
  66. 66.
    Qu W, Deloch S (2017) Incremental ETL pipeline scheduling for near real-time data warehouses. In: Proceedings of BTW, pp 299–308Google Scholar
  67. 67.
    Ram P, Do L (2000) Extracting delta for incremental data warehouse maintenance. In: Proceedings of IEEE ICDE, pp 220–229Google Scholar
  68. 68.
    Reese G (2000) Database programming with JDBC & Java, 2nd edn. O’Reilly, SebastopolzbMATHGoogle Scholar
  69. 69.
    Santos RJ, Bernardino J (2008) Real-time data warehouse loading methodology. In: Proceedings of ACM IDEAS, pp 49–58Google Scholar
  70. 70.
    Sarawagi S, Sathe G (2000) i3: intelligent, interactive investigation of OLAP data cubes. In: Proceedings of ACM SIGMOD, p 589Google Scholar
  71. 71.
    Shi J, Bao Y, Leng F, Yu G (2008) Study on log-based change data capture and handling mechanism in real-time data warehouse. In: Proceedings of IEEE CSSE, pp 478–481Google Scholar
  72. 72.
    Snoddy D, Spyker J, Rupik M, Jory M, Kobylinski K (2009) Change data capture: what is it and how it impacts solutions architecture. In: Proceedings of CASCON, pp 297–298Google Scholar
  73. 73.
    Song X, Shibasaki R, Yuan NJ, Xie X, Li T, Adachi R (2017) DeepMob: learning deep knowledge of human emergency behavior and mobility from big and heterogeneous data. ACM Trans Inf Syst 35(4):41:1–41:19Google Scholar
  74. 74.
    Ting I-H, Lin C-H, Wang C-S (2011) Constructing a cloud computing based social networks data warehousing and analyzing system. In: Proceedings of ASONAM, pp 735–740Google Scholar
  75. 75.
    Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org/tpch/. Accessed Apr 2018
  76. 76.
    Valncio CR, Marioto MH, Zafalon GFD, Machado JM, Momente JC (2013) Real time delta extraction based on triggers to support data warehousing. In: Proceedings of PDCAT, pp 293–297Google Scholar
  77. 77.
    Vassiliadis P, Simitsis A (2009) Near real time ETL. New trends in data warehousing and data analysis. Ann Inf Syst 3:1–31Google Scholar
  78. 78.
  79. 79.
    Wu M-C, Buchmann AP (1998) Encoded bitmap indexing for data warehouses. In: Proceedings of IEEE ICDE, pp 220–230Google Scholar
  80. 80.
    Zikopoulos P, Eaton C, Deutsch T, Lapis G (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill, New YorkGoogle Scholar
  81. 81.
    Zhu Y, An L, Liu S (2008) Data updating and query in real-time data warehouse system. In: Proceedings of IEEE CSSE, pp 1295–1297Google Scholar
  82. 82.
    Zuters J (2011) Near real-time data warehousing with multi-stage trickle and flip. In: Proceedings of BIR, pp 73–82Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Alfredo Cuzzocrea
    • 1
    Email author
  • Nickerson Ferreira
    • 2
  • Pedro Furtado
    • 2
  1. 1.DIA DepartmentUniversity of TriesteTriesteItaly
  2. 2.DEI DepartmentUniversity of CoimbraCoimbraPortugal

Personalised recommendations