Advertisement

Distributed and Parallel Databases

, Volume 37, Issue 4, pp 623–649 | Cite as

A memory-optimal many-to-many semi-stream join

  • M. Asif NaeemEmail author
  • Gerald Weber
  • Christof Lutteroth
Article

Abstract

Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.

Keywords

Many-to-many semi-stream join Cache optimization Performance evaluation 

Notes

References

  1. 1.
    Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 647–651. ACM (2003)Google Scholar
  2. 2.
    Bonnet, P., Gehrke, J., Seshadri, P.: Towards sensor database systems. In: Proceedings of the Second International Conference on Mobile Data Management (MDM), pp. 03–14. Springer (2001)Google Scholar
  3. 3.
    Chung-Min, C., Agrawal, H., Cochinwala, M., Rosenbluth, D.: Stream query processing for healthcare bio-sensor applications. In: Proceedings of the IEEE 20th International Conference on Data Engineering (ICDE), pp. 791–794. IEEE (2004)Google Scholar
  4. 4.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: Proceedings of the 27th International Conference on Very large Data Bases (VLDB), pp. 79–88. VLDB Endowment (2001)Google Scholar
  5. 5.
    Wu, E., Diao, Y., Rizvi, S.: High-performance complex event processing over streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 407–418 (2006)Google Scholar
  6. 6.
    Blasgen, M.W., Eswaran, K.P.: Storage and access in relational data bases. IBM Syst. J. 16(4), 363–377 (1977)CrossRefGoogle Scholar
  7. 7.
    Abadi, M.W., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. VLDB J. 12(20), 120–139 (2003)CrossRefGoogle Scholar
  8. 8.
    Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S.R., Reiss, F., Shah, M.A.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 668–668 (2003)Google Scholar
  9. 9.
    Chen, J., Carney, D., DeWitt, D.J., Tian, F., Wang, Y.: NiagaraCQ: a scalable continuous query system for Internet databases. SIGMOD Rec. 29(2), 379–390 (2000)CrossRefGoogle Scholar
  10. 10.
    Chakravarthy, S., Jiang, Q.: Stream Data Processing: A Quality of Service Perspective: Modeling, Scheduling, Load Shedding, and Complex Event Processing, vol. 36. Springer, Boston (2009)zbMATHGoogle Scholar
  11. 11.
    Naeem, M.A., Dobbie, G., Weber, G.: An event-based near real-time data integration architecture. In: Proceedings of the 12th Enterprise Distributed Object Computing Conference Workshops (EDOCW ’08), pp. 401–404. IEEE (2008)Google Scholar
  12. 12.
    Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: Proceedings of the 2nd International Workshop on Information Quality in Information Systems (IQIS ’05), pp. 28–39. ACM (2005)Google Scholar
  13. 13.
    Golab, L., Johnson, T., Seidel, J.S., Shkapenyuk, V.: Stream warehousing with DataDepot. In: Proceedings of the 35th ACM SIGMOD International Conference on Management of Data, pp. 847–854 (2009)Google Scholar
  14. 14.
    Madden, S., Franklin, M.J.: Fjording the stream: an architecture for queries over streaming sensor data. In: Proceedings of the IEEE 18th International Conference on Data Engineering (ICDE), pp. 555–566. IEEE (2002)Google Scholar
  15. 15.
    Chakraborty, A., Singh, A.: A partition-based approach to support streaming updates over persistent data in an active data warehouse. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS ’09), pp. 1–11 (2009)Google Scholar
  16. 16.
    Bornea, M.A., Deligiannakis, A., Kotidis, Y., Vassalos, V.: Semi-Streamed Index Join for near-real time execution of ETL transformations. In: Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE’11), pp. 159–170 (2011)Google Scholar
  17. 17.
    Naeem, M.A., Dobbie, G., Weber, G.: A lightweight stream-based join with limited resource consumption. In: Proceedings of the 14th International Conference on Data Warehousing and Knowledge Discovery, pp. 431–442. Springer (2012)Google Scholar
  18. 18.
    Naeem, M.A., Weber, G., Dobbie, G., Lutteroth, C.: SSCJ: a Semi-Stream Cache Join using a front-stage cache module. In: Proceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery, pp. 236–247. Springer (2013)Google Scholar
  19. 19.
    Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.E.: Supporting streaming updates in an active data warehouse. In: Proceedings of the 23rd International Conference on Data Engineering (ICDE ’07), pp. 476–485. IEEE (2007)Google Scholar
  20. 20.
    Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Meshing streaming updates with persistent data in an active data warehouse. IEEE Trans. Knowl. Data Eng. 20(7), 976–991 (2008) (IEEE Educational Activities Department)CrossRefGoogle Scholar
  21. 21.
    Anderson, C.: The Long Tail: Why the Future of Business Is Selling Less of More. Hyperion, New York (2006)Google Scholar
  22. 22.
    Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel main-memory environment. In: Proceedings of the First International Conference on Parallel and Distributed Information Systems (PDIS ’91), pp. 68–77. IEEE (1991)Google Scholar
  23. 23.
    Wilschut, A.N., Apers, P.M.G.: Pipelining in query execution. In: Proceedings of the International Conference on Databases, Parallel Architectures and Their Applications (PARBASE ’90), pp. 562–562. IEEE (1990)Google Scholar
  24. 24.
    Urhan, T., Franklin, M.J.: XJoin: a reactively-scheduled pipelined join operator. IEEE Data Eng. Bull. 23, 27–33 (2000)Google Scholar
  25. 25.
    Ives, Z.G., Florescu, D., Friedman, M., Levy, A., Weld, D.S.: An adaptive query execution system for data integration. SIGMOD Rec. 28(2), 299–310 (1999)CrossRefGoogle Scholar
  26. 26.
    Mokbel, M.F., Lu, M., Aref, W.G.: Hash-merge join: a non-blocking join algorithm for producing fast and early join results. In: Proceedings of the 20th International Conference on Data Engineering (ICDE ’04), p. 251 (2004)Google Scholar
  27. 27.
    Lawrence, R.: Early Hash join: a configurable algorithm for the efficient and early production of join results. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB ’05), pp. 841–852. VLDB Endowment (2005)Google Scholar
  28. 28.
    DeWitt, D.J., Naughton, J.F.: Dynamic Memory Hybrid Hash Join. University of Wisconsin, Madison (1995)Google Scholar
  29. 29.
    Viglas, S.D., Naughton, J.F., Burger, J.: Maximizing the output rate of multi-way join queries over streaming information sources. In: Proceedings of the 29th International Conference on Very large Data Bases (VLDB ’2003D), pp. 285–296. VLDB Endowment (2003)Google Scholar
  30. 30.
    Bateni, M.H., Golab, L., Hajiaghayi, M.T., Karloff, H.: Scheduling to minimize staleness and stretch in real-time data warehouses. In: Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09), pp. 29–38 (2009)Google Scholar
  31. 31.
    Golab, L., Johnson, T., Shkapenyuk, V.: Scheduling updates in a real-time stream warehouse. In: Proceedings of the 25th International Conference on Data Engineering (ICDE ’09), pp. 1207–1210 (2009)Google Scholar
  32. 32.
    Lukasz, G., Theodore, J.: Consistency in a stream warehouse. In: Conference on Innovative Data Systems Research (CIDR ’11), pp. 114–122 (2011)Google Scholar
  33. 33.
    Derakhshan, R., Sattar, A., Stantic, B.: A new operator for efficient stream-relation join processing in data streaming engines. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 793–798. ACM (2013)Google Scholar
  34. 34.
    Botan, I., Cho, Y., Derakhshan, R., Dindar, N., Gupta, A., Haas, L., Kim, K., Lee, C., Mundada, G., Shan, M.C., Tatbul, N.: A demonstration of the MaxStream federated stream processing system. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 1093–1096. IEEE (2010)Google Scholar
  35. 35.
    Naeem, M.A., Weber, G., Dobbie, G., Lutteroth, C.: A generic front-stage for semi-stream processing. In: Proceedings of the of the 22nd ACM International Conference on Information & Knowledge Management, pp. 769–774. ACM (2013)Google Scholar
  36. 36.
    Jeffery, S.R.: Pay-as-you-go Data Cleaning and Integration. University of California, Berkeley (2008)Google Scholar
  37. 37.
    Hahn, C.J., Warren, S.G., London, J.: Edited Synoptic Cloud Reports from Ships and Land Stations over the Globe, 1982–1991. Oak Ridge National Laboratory, Oak Ridge (1996)CrossRefGoogle Scholar
  38. 38.
    Naeem, M.A., Dobbie, G., Weber, G.: HYBRIDJOIN for near-real-time data warehousing. Int. J. Data Wareh. Min. (IJDWM) 7(4), 21–42 (2011)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • M. Asif Naeem
    • 1
    Email author
  • Gerald Weber
    • 2
  • Christof Lutteroth
    • 3
  1. 1.School of Engineering, Computer and Mathematical SciencesAuckland University of TechnologyAucklandNew Zealand
  2. 2.Department of Computer ScienceThe University of AucklandAucklandNew Zealand
  3. 3.Department of Computer ScienceUniversity of BathBathUK

Personalised recommendations