Abstract
Virtualization is the key concept to provide a scalable and flexible computing environment in general. In this chapter, we focus on virtualization concepts in the context of data management tasks. We review existing concepts and technologies spanning multiple software layers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Different styles of spelling the term “MapReduce” exist, like “map/reduce” or “map-reduce”. We stick to the spelling used by the authors of the original paper [7].
- 2.
The Kleene star is a unary operation over an alphabet (set) and denotes all strings that can be built over that alphabet, including the empty string. In our notation, it denotes zero, one or more occurrences of the elements of the base set. The Kleene plus denotes all strings but the empty set. In our notation, the Kleene plus denotes one or more occurrences of the elements of the base set. The Kleene star and plus operations are commonly used in regular expressions.
- 3.
- 4.
In principle, MapReduce can obtain its input from other sources as well. The distributed file system is however the most common case.
- 5.
Implementations like Hadoop offer special setup and cleanup functions that allow to create and tear down such data structures.
- 6.
- 7.
A common heuristic in DAG dataflows is to reduce the number of vertices and edges. For that reason, the code of the vertices I, m, and S might be combined to one vertex. Likewise could the code from G, r, and O be combined.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
typically specified and encoded using Google’s protocol buffers.
- 16.
- 17.
depending on whether you include the DataSource and DataSink contracts as part of the PACT graphs or not
References
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM symposium on Cloud computing, pp. 119–130 (2010)
Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Kanne, M.E.C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. PVLDB (2011)
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a sql implementation on the mapreduce framework. PVLDB 4(12), 1318–1327 (2011)
Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the conference on Symposium on Opearting Systems Design & Implementation, pp. 10–10 (2004)
DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma – a high performance dataflow database machine. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 228–237 (1986)
DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Communications of the ACM 35(6), 85–98 (1992)
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 209–219 (1986)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database systems – the complete book (2. ed.). Pearson Education (2009)
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB 2(2), 1414–1425 (2009)
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS 37(5), 29–43 (2003)
Graefe, G.: Parallel query execution algorithms. In: Encyclopedia of Database Systems, pp. 2030–2035. Springer US (2009)
Graefe, G.: Modern b-tree techniques. Foundations and Trends in Databases 3(4), 203–402 (2011)
Graefe, G., Bunker, R., Cooper, S.: Hash joins and hash teams in microsoft sql server. In: VLDB, pp. 86–97 (1998)
Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: ICDE, pp. 209–218 (1993)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD Conference, pp. 127–138 (1995)
Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G.: Dvfs in 45nm cmos. IEEE Technology 9(2), 922–933 (2010)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)
Kemper, A., Eickler, A.: Datenbanksysteme: Eine Einf?hrung. Oldenbourg Wissenschaftsverlag (2006)
Kolb, L., Thor, A., Rahm, E.: Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp. 45–64 (2011)
Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE, pp. 618–629 (2012)
Maier, D.: The Theory of Relational Databases. Computer Science Press (1983)
Markl, V., Lohman, G.M., Raman, V.: Leo: An autonomic query optimizer for db2. IBM Systems Journal 42(1), 98–106 (2003)
Neumann, T.: Query optimization (in relational databases). In: Encyclopedia of Database Systems, pp. 2273–2278. Springer US (2009)
Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, pp. 267–273 (2008)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with sawzall. Scientific Programming 13(4), 277–298 (2005)
Rao, J., Ross, K.A.: Reusing invariants: A new strategy for correlated queries. In: SIGMOD, pp. 37–48 (1998)
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented dbms. In: VLDB, pp. 553–564 (2005)
Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Communications of the ACM 53(1), 64–71 (2010)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., 0002, N.Z., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)
Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10 (2009)
Wensel, C.K.: Cascading: Defining and executing complex and fault tolerant data processing workflows on a hadoop cluster (2008). http://www.cascading.org
White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2009)
Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Lehner, W., Sattler, KU. (2013). Web-Scale Analytics for BIG Data. In: Web-Scale Data Management for the Cloud. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6856-1_4
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6856-1_4
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6855-4
Online ISBN: 978-1-4614-6856-1
eBook Packages: Computer ScienceComputer Science (R0)