Advertisement

Business Intelligence and Analytics: Big Systems for Big Data

  • Herodotos HerodotouEmail author
Chapter
  • 1.2k Downloads
Part of the Palgrave Studies in Democracy, Innovation, and Entrepreneurship for Growth book series (DIG)

Abstract

The amount of data collected by modern industrial, government, and academic organizations has been increasing exponentially and will continue to grow at an accelerating rate for the foreseeable future. At companies across all industries, servers are overflowing with usage logs, message streams, transaction records, sensor data, business operations records, and mobile device data. Effectively analyzing these massive collections of data (“big data”) can create significant value for the world economy by enhancing productivity, increasing efficiency, and delivering more value to consumers. The need to convert raw data into useful information has led to the development of advanced and unique data storage, management, analysis, and visualization technologies, especially over the last decade. This monograph is an attempt to cover the design principles and core features of systems for analyzing very large datasets for business purposes. In particular, we organize systems into four main categories based on major and distinctive technological innovations. Parallel databases dating back to 1980s have added techniques like columnar data storage and processing, while new distributed platforms like MapReduce have been developed. Other innovations aimed at creating alternative system architectures for more generalized dataflow applications. Finally, the growing demand for interactive analytics has led to the emergence of a new class of systems that combine analytical and transactional capabilities.

Keywords

Business intelligence Big data analytics MPP databases MapReduce systems Dataflow systems Interactive analytics 

References

  1. Abadi, Daniel J., Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: A new model and architecture for data stream management. The VLDB Journal—The International Journal on Very Large Data Bases 12(2): 120–139.CrossRefGoogle Scholar
  2. Abadi, Daniel J., Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, et al. 2005. The design of the borealis stream processing engine. CIDR 5: 277–289.Google Scholar
  3. Abadi, Daniel J., Daniel S. Myers, David J. DeWitt, and Samuel R. Madden. 2007. Materialization strategies in a column-oriented DBMS. In Data Engineering, IEEE 23rd International Conference on, 466–475.Google Scholar
  4. Abadi, Daniel J., Peter A. Boncz, and Stavros Harizopoulos. 2009. Column-oriented database systems. Proceedings of the VLDB Endowment 2(2): 1664–1665.CrossRefGoogle Scholar
  5. Abouzeid, Azza, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment 2(1): 922–933.CrossRefGoogle Scholar
  6. Agrawal, Sanjay, Vivek Narasayya, and Beverly Yang. 2004. Integrating vertical and horizontal partitioning into automated physical database design. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 359–370.Google Scholar
  7. Ailamaki, Anastassia, David J. DeWitt, Mark D. Hill, and Marios Skounakis. 2001. Weaving relations for cache performance. VLDB 1: 169–180.Google Scholar
  8. Alexandrov, Alexander, Max Heimel, Volker Markl, Dominic Battré, Fabian Hueske, Erik Nijkamp, Stephan Ewen, Odej Kao, and Daniel Warneke. 2010. Massively parallel data analysis with PACTs on nephele. Proceedings of the VLDB Endowment 3(1–2): 1625–1628.CrossRefGoogle Scholar
  9. Alexandrov, Alexander, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, et al. 2014. The stratosphere platform for big data analytics. The VLDB Journal—The International Journal on Very Large Data Bases 23(6): 939–964.CrossRefGoogle Scholar
  10. Amazon. 2013. Amazon simple storage service (S3). Accessed 2013. http://aws.amazon.com/s3/
  11. Babu, Shivnath, and Jennifer Widom. 2001. Continuous queries over data streams. ACM SIGMOD Record 30(3): 109–120.CrossRefGoogle Scholar
  12. Bajda-Pawlikowski, Kamil, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. 2011. Efficient processing of data warehousing queries in a split execution environment. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 1165–1176.Google Scholar
  13. Baker, Jason, Chris Bond, James C. Corbett, J.J. Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh. 2011. Megastore: Providing scalable, highly available storage for interactive services. CIDR 11: 223–234.Google Scholar
  14. Baru, Chaitanya K., Gilles Fecteau, Ambuj Goyal, H. Hsiao, Anant Jhingran, Sriram Padmanabhan, George P. Copeland, and Walter G. Wilson. 1995. DB2 parallel edition. IBM Systems Journal 34(2): 292–322.CrossRefGoogle Scholar
  15. Battré, Dominic, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM Symposium on Cloud Computing, 119–130.Google Scholar
  16. Behm, Alexander, Vinayak R. Borkar, Michael J. Carey, Raman Grover, Chen Li, Nicola Onose, Rares Vernica, Alin Deutsch, Yannis Papakonstantinou, and Vassilis J. Tsotras. 2011. Asterix: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases 29(3): 185–216.CrossRefGoogle Scholar
  17. Biem, Alain, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, Haris Koutsopoulos, and Carlos Moran. 2010. IBM infosphere streams for scalable, real-time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 1093–1104.Google Scholar
  18. Boncz, Peter A., Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-pipelining query execution. CIDR 5: 225–237.Google Scholar
  19. Boncz, Peter, Torsten Grust, Maurice Van Keulen, Stefan Manegold, Jan Rittinger, and Jens Teubner. 2006. MonetDB/XQuery: A fast XQuery processor powered by a relational engine. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, 479–490.Google Scholar
  20. Borkar, Vinayak, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In 2011 IEEE 27th International Conference on Data Engineering (ICDE), 1151–1162.Google Scholar
  21. Borthakur, Dhruba, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, et al. 2011. Apache hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 1071–1080.Google Scholar
  22. Bu, Yingyi, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3(1–2): 285–296.CrossRefGoogle Scholar
  23. Buffers, Protocol. 2012. Developer guide. Accessed 2012.Google Scholar
  24. Calder, Brad, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, et al. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, 143–157.Google Scholar
  25. Cascading. 2011. Cascading: Application platform for enterprise big data. http://www.cascading.org/
  26. Chambers, Craig, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. ACM SIGPLAN Notices 45(6): 363–375.CrossRefGoogle Scholar
  27. Chandramouli, Badrish, Jonathan Goldstein, and Songyun Duan. 2012. Temporal analytics on big data for web advertising. In 2012 IEEE 28th International Conference on Data Engineering (ICDE), 90–101.Google Scholar
  28. Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26(2): 4.CrossRefGoogle Scholar
  29. Chen, Songting. 2010. Cheetah: A high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment 3(1–2): 1459–1468.CrossRefGoogle Scholar
  30. Chen, Hsinchun, Roger H.L. Chiang, and Veda C. Storey. 2012. Business intelligence and analytics: From big data to big impact. MIS Quarterly 36(4): 1165–1188.Google Scholar
  31. Cohen, Jeffrey, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD skills: New analysis practices for big data. Proceedings of the VLDB Endowment 2(2): 1481–1492.CrossRefGoogle Scholar
  32. Condie, Tyson, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. NSDI 10(4): 20.Google Scholar
  33. Corbet, J.C., J. Dean, and M. Epstein. 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th USENIX conference on operation systems design and implementation, 251–264. Berkeley, CA: USENIX Association.Google Scholar
  34. Croft, W., Donald Metzler Bruce, and Trevor Strohman. 2010. Search engines: Information retrieval in practice. Reading: Addison-Wesley.Google Scholar
  35. Dean, J., and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters (2004). Gottfrid, D.: Google, Inc.Google Scholar
  36. Dean, Jeffrey, and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51(1): 107–113.CrossRefGoogle Scholar
  37. DeWitt, David, and Jim Gray. 1992. Parallel database systems: The future of high performance database systems. Communications of the ACM 35(6): 85–98.CrossRefGoogle Scholar
  38. DeWitt, David J., Shahram Ghandeharizadeh, Donovan Schneider, Allan Bricker, Hui-I. Hsiao, and Rick Rasmussen. 1990. The Gamma database machine project. IEEE Transactions on Knowledge and Data Engineering 2(1): 44–62.CrossRefGoogle Scholar
  39. DeWitt, David J., Jeffrey F. Naughton, Donovan A. Schneider, and Srinivasan Seshadri. 1992. Practical skew handling in parallel joins. Madison: University of Wisconsin-Madison, Computer Sciences Department.Google Scholar
  40. Dittrich, Jens, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment 3(1–2): 515–529.CrossRefGoogle Scholar
  41. Dittrich, Jens, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, and Jörg Schad. 2012. Only aggressive elephants are fast elephants. Proceedings of the VLDB Endowment 5(11): 1591–1602.CrossRefGoogle Scholar
  42. Ekanayake, Jaliya, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 810–818.Google Scholar
  43. Eltabakh, Mohamed Y., Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in Hadoop. Proceedings of the VLDB Endowment 4(9): 575–585.CrossRefGoogle Scholar
  44. Färber, Franz, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. 2012a. SAP HANA database: Data management for modern business applications. ACM SIGMOD Record 40(4): 45–51.CrossRefGoogle Scholar
  45. Färber, Franz, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012b. The SAP HANA database—An architecture overview. IEEE Data Engineering Bulletin 35(1): 28–33.Google Scholar
  46. Floratou, Avrilia, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-oriented storage techniques for MapReduce. Proceedings of the VLDB Endowment 4(7): 419–429.CrossRefGoogle Scholar
  47. Frankel, Felice, and Rosalind Reid. 2008. Big data: Distilling meaning from data. Nature 455(7209): 30–30.CrossRefGoogle Scholar
  48. Franklin, Michael J., Sailesh Krishnamurthy, Neil Conway, Alan Li, Alex Russakovsky, and Neil Thombre. 2009. Continuous analytics: Rethinking query processing in a network-effect world. In CIDR.Google Scholar
  49. George, Lars. 2011. HBase: The definitive guide. USA: O’Reilly Media, Inc.Google Scholar
  50. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. 2003. The google file system. ACM SIGOPS Operating Systems Review 37(5): 29–43.CrossRefGoogle Scholar
  51. Greenplum. 2013. Pivotal greenplum database. Accessed 2013. http://www.pivotal.io/big-data/pivotal-greenplum-database
  52. Grund, Martin, Philippe Cudré-Mauroux, Jens Krüger, Samuel Madden, and Hasso Plattner. 2012. An overview of HYRISE-a main memory hybrid storage engine. IEEE Data Engineering Bulletin 35(1): 52–57.Google Scholar
  53. Hausenblas, Michael, and Jacques Nadeau. 2013. Apache drill: Interactive ad-hoc analysis at scale. Big Data 1(2): 100–104.CrossRefGoogle Scholar
  54. He, Yongqiang, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu. 2011. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In 2011 IEEE 27th International Conference on Data Engineering (ICDE), 1199–1208.Google Scholar
  55. Herodotou, Herodotos, Nedyalko Borisov, and Shivnath Babu. 2011. Query optimization techniques for partitioned tables. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 49–60.Google Scholar
  56. Hoffman, Steve. 2015. Apache flume: Distributed log collection for hadoop. Birmingham: Packt Publishing.Google Scholar
  57. Hsiao, Hui-I, and David J. DeWitt. 1990. Chained declustering: A new availability strategy for multiprocessor database machines. Madison: University of Wisconsin-Madison, Computer Sciences Department.Google Scholar
  58. IBM Corporation. 2007. IBM knowledge center: Partitioned tables. Accessed 2007. http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.partition.doc/doc/c0021560.html
  59. IBM Netezza. 2012. IBM Netezza data warehouse appliances. Accessed 2012. http://www-01.ibm.com/software/data/netezza/
  60. Idreos, Stratos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Engineering Bulletin 35(1): 40–45.Google Scholar
  61. Infobright. 2013. Infobright—Analytic database for the internet of things. Accessed 2013. http://www.infobright.com/
  62. Isard, Michael, and Yuan Yu. 2009. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 987–994.Google Scholar
  63. Isard, Michael, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3): 59–72.CrossRefGoogle Scholar
  64. Islam, Mohammad, Angelo K. Huang, Mohamed Battisha, Michelle Chiang, Santhosh Srinivasan, Craig Peters, Andreas Neumann, and Alejandro Abdelnur. 2012. Oozie: Towards a scalable workflow management system for hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, 4.Google Scholar
  65. Kemper, Alfons, Thomas Neumann, Florian Funke, Viktor Leis, and Henrik Mühe. 2012. HyPer: Adapting columnar main-memory data management for transactional and query processing. IEEE Data Engineering Bulletin 35(1): 46–51.Google Scholar
  66. KFS. 2013. Kosmos distributed file system. Accessed 2013. http://code.google.com/p/kosmosfs/
  67. Lakshman, Avinash, and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44(2): 35–40.CrossRefGoogle Scholar
  68. Lam, Wang, Lu Liu, S. T. S. Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan. 2012. Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment 5(12): 1814–1825.Google Scholar
  69. Lamb, Andrew, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proceedings of the VLDB Endowment 5(12): 1790–1801.CrossRefGoogle Scholar
  70. Laney, Doug. 2001. 3D data management: Controlling data volume, velocity and variety. META Group Research Note 6: 70.Google Scholar
  71. Lee, Rubao, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, and Xiaodong Zhang. 2011. Ysmart: Yet another SQL-to-MapReduce translator. In 2011 31st International Conference on Distributed Computing Systems (ICDCS), 25–36.Google Scholar
  72. Lee, George, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy. 2012. The unified logging infrastructure for data analytics at Twitter. Proceedings of the VLDB Endowment 5(12): 1771–1780.CrossRefGoogle Scholar
  73. Lin, Yuting, Divyakant Agrawal, Chun Chen, Beng Chin Ooi, and Sai Wu. 2011. Llama: Leveraging columnar storage for scalable join processing in the mapreduce framework. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 961–972.Google Scholar
  74. Low, Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5(8): 716–727.CrossRefGoogle Scholar
  75. MacNicol, Roger, and Blaine French. 2004. Sybase IQ multiplex-designed for analytics. In Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, 1227–1230. Seoul: VLDB Endowment.Google Scholar
  76. Malewicz, Grzegorz, Matthew H. Austern, Aart JC Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 135–146.Google Scholar
  77. MapR. 2013. MapR file system. Accessed 2013. http://www.mapr.com/products/apache-hadoop
  78. Mehta, Manish, and David J. DeWitt. 1997. Data placement in shared-nothing parallel database systems. The VLDB Journal—The International Journal on Very Large Data Bases 6(1): 53–72.CrossRefGoogle Scholar
  79. Meijer, Erik, Brian Beckman, and Gavin Bierman. 2006. Linq: Reconciling object, relations and xml in the .net framework. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, 706–706.Google Scholar
  80. Melnik, Sergey, Andrey Gubarev, Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment 3(1–2): 330–339.CrossRefGoogle Scholar
  81. Morales, Tony. 2007. Oracle database VLDB and partitioning guide 11 g release 1 (11.1). Oracle, July 2007.Google Scholar
  82. Neumeyer, Leonardo, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In 2010 IEEE International Conference on Data Mining Workshops (ICDMW), 170–177.Google Scholar
  83. Nykiel, Tomasz, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing across multiple queries in MapReduce. Proceedings of the VLDB Endowment 3(1–2): 494–505.CrossRefGoogle Scholar
  84. Olston, Christopher, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 1099–1110.Google Scholar
  85. Ovsiannikov, Michael, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. 2013. The quantcast file system. Proceedings of the VLDB Endowment 6(11): 1092–1101.CrossRefGoogle Scholar
  86. ParAccel. 2013. ParAccel analytic platform. Accessed 2013. http://www.paraccel.com/
  87. Rabkin, Ariel, and Randy H. Katz. 2010. Chukwa: A system for reliable large-scale log collection. LISA 10: 1–15.Google Scholar
  88. Rao, Jun, Chun Zhang, Nimrod Megiddo, and Guy Lohman. 2002. Automating physical database design in a parallel database. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 558–569.Google Scholar
  89. Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 1–10.Google Scholar
  90. Stonebraker, Mike, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, et al. 2005. C-store: A column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases, 553–564. Seoul: VLDB Endowment.Google Scholar
  91. Storm, Apache. 2013. Storm, distributed and fault-tolerant real-time computation.Google Scholar
  92. Sumbaly, Roshan, Jay Kreps, and Sam Shah. 2013. The big data ecosystem at linkedin. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 1125–1134.Google Scholar
  93. Talmage, Ron. 2009. Partitioned table and index strategies using SQL server 2008. MSDN Library, March 2009.Google Scholar
  94. Teradata. 2012. Teradata enterprise data warehouse. Accessed 2012. http://www.teradata.com
  95. Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2): 1626–1629.CrossRefGoogle Scholar
  96. Thusoo, Ashish, Zheng Shao, Suresh Anthony, Dhruba Borthakur, Namit Jain, Joydeep Sen Sarma, Raghotham Murthy, and Hao Liu. 2010. Data warehousing and analytics infrastructure at Facebook. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 1013–1020.Google Scholar
  97. Traverso, Martin. 2013. Presto: Interacting with petabytes of data at Facebook. Retrieved February 4, 2014.Google Scholar
  98. Wanderman-Milne, Skye, and Li Nong. 2014. Runtime code generation in cloudera impala. IEEE Data Eng. Bull. 37(1): 31–37.Google Scholar
  99. Weil, Sage A., Scott A. Brandt, Ethan L. Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, 307–320. Berkeley, CA: USENIX Association.Google Scholar
  100. White, Tom. 2010. Hadoop: The definitive guide. Sunnyvale, CA: Yahoo.Google Scholar
  101. Wu, Lili, Roshan Sumbaly, Chris Riccomini, Gordon Koo, Hyung Jin Kim, Jay Kreps, and Sam Shah. 2012. Avatara: Olap for web-scale analytics products. Proceedings of the VLDB Endowment 5(12): 1874–1877.CrossRefGoogle Scholar
  102. Xin, Reynold S., Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2013a. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems, 2.Google Scholar
  103. Xin, Reynold S., Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013b. Shark: SQL and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 13–24.Google Scholar
  104. Zaharia, Matei, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2–2. Berkeley, CA: USENIX Association.Google Scholar
  105. Zhang, Yanfeng, Qixin Gao, Lixin Gao, and Cuirong Wang. 2011. Priter: A distributed framework for prioritized iterative computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing, 13.Google Scholar
  106. Zhou, Jingren, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet MapReduce. The VLDB Journal—The International Journal on Very Large Data Bases 21(5): 611–636.CrossRefGoogle Scholar
  107. Zukowski, Marcin, and Peter A. Boncz. 2012. Vectorwise: Beyond column stores. IEEE Data Engineering Bulletin 35(1): 21–27.Google Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.Department of Electrical Engineering and Computer Engineering and Informatics (EECEI)Cyprus University of TechnologyLimassolCyprus

Personalised recommendations