Web-Scale Analytics for BIG Data

Lehner, Wolfgang; Sattler, Kai-Uwe

doi:10.1007/978-1-4614-6856-1_4

Wolfgang Lehner³ &
Kai-Uwe Sattler⁴

1165 Accesses

Abstract

Virtualization is the key concept to provide a scalable and flexible computing environment in general. In this chapter, we focus on virtualization concepts in the context of data management tasks. We review existing concepts and technologies spanning multiple software layers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Different styles of spelling the term “MapReduce” exist, like “map/reduce” or “map-reduce”. We stick to the spelling used by the authors of the original paper [7].
2.
The Kleene star is a unary operation over an alphabet (set) and denotes all strings that can be built over that alphabet, including the empty string. In our notation, it denotes zero, one or more occurrences of the elements of the base set. The Kleene plus denotes all strings but the empty set. In our notation, the Kleene plus denotes one or more occurrences of the elements of the base set. The Kleene star and plus operations are commonly used in regular expressions.
3.
http://hadoop.apache.org.
4.
In principle, MapReduce can obtain its input from other sources as well. The distributed file system is however the most common case.
5.
Implementations like Hadoop offer special setup and cleanup functions that allow to create and tear down such data structures.
6.
http://blog.data-miners.com/2008/02/mapreduce-and-k-means-clustering.html.
7.
A common heuristic in DAG dataflows is to reduce the number of vertices and edges. For that reason, the code of the vertices I, m, and S might be combined to one vertex. Likewise could the code from G, r, and O be combined.
8.
http://wiki.apache.org/pig/FrontPage.
9.
http://www.jaql.org/.
10.
http://www-01.ibm.com/software/data/infosphere/biginsights/.
11.
http://www.almaden.ibm.com/cs/projects/systemt/.
12.
http://wiki.apache.org/hadoop/Hive.
13.
http://www.slideshare.net/ragho/hive-icde-2010.
14.
http://www.slideshare.net/evamtse/hive-user-group-presentation-from-netflix-3182010-3483386, http://www.youtube.com/watch?v=Idu9OKnAOis.
15.
typically specified and encoded using Google’s protocol buffers.
16.
http://code.google.com/p/szl/.
17.
depending on whether you include the DataSource and DataSink contracts as part of the PACT graphs or not

References

Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM symposium on Cloud computing, pp. 119–130 (2010)
Google Scholar
Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Kanne, M.E.C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. PVLDB (2011)
Google Scholar
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)
Google Scholar
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a sql implementation on the mapreduce framework. PVLDB 4(12), 1318–1327 (2011)
Google Scholar
Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)
Google Scholar
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)
Article MATH Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the conference on Symposium on Opearting Systems Design & Implementation, pp. 10–10 (2004)
Google Scholar
DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma – a high performance dataflow database machine. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 228–237 (1986)
Google Scholar
DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Communications of the ACM 35(6), 85–98 (1992)
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 209–219 (1986)
Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database systems – the complete book (2. ed.). Pearson Education (2009)
Google Scholar
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB 2(2), 1414–1425 (2009)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS 37(5), 29–43 (2003)
Article Google Scholar
Graefe, G.: Parallel query execution algorithms. In: Encyclopedia of Database Systems, pp. 2030–2035. Springer US (2009)
Google Scholar
Graefe, G.: Modern b-tree techniques. Foundations and Trends in Databases 3(4), 203–402 (2011)
Article Google Scholar
Graefe, G., Bunker, R., Cooper, S.: Hash joins and hash teams in microsoft sql server. In: VLDB, pp. 86–97 (1998)
Google Scholar
Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: ICDE, pp. 209–218 (1993)
Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD Conference, pp. 127–138 (1995)
Google Scholar
Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G.: Dvfs in 45nm cmos. IEEE Technology 9(2), 922–933 (2010)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)
Google Scholar
Kemper, A., Eickler, A.: Datenbanksysteme: Eine Einf?hrung. Oldenbourg Wissenschaftsverlag (2006)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp. 45–64 (2011)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE, pp. 618–629 (2012)
Google Scholar
Maier, D.: The Theory of Relational Databases. Computer Science Press (1983)
Google Scholar
Markl, V., Lohman, G.M., Raman, V.: Leo: An autonomic query optimizer for db2. IBM Systems Journal 42(1), 98–106 (2003)
Article Google Scholar
Neumann, T.: Query optimization (in relational databases). In: Encyclopedia of Database Systems, pp. 2273–2278. Springer US (2009)
Google Scholar
Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)
Google Scholar
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, pp. 267–273 (2008)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)
Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with sawzall. Scientific Programming 13(4), 277–298 (2005)
Google Scholar
Rao, J., Ross, K.A.: Reusing invariants: A new strategy for correlated queries. In: SIGMOD, pp. 37–48 (1998)
Google Scholar
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented dbms. In: VLDB, pp. 553–564 (2005)
Google Scholar
Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Communications of the ACM 53(1), 64–71 (2010)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., 0002, N.Z., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)
Google Scholar
Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10 (2009)
Google Scholar
Wensel, C.K.: Cascading: Defining and executing complex and fault tolerant data processing workflows on a hadoop cluster (2008). http://www.cascading.org
White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2009)
Google Scholar
Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Dresden University of Technology, Dresden, Germany
Wolfgang Lehner
Ilmenau University of Technology, Ilmenau, Germany
Kai-Uwe Sattler

Authors

Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Sattler
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lehner, W., Sattler, KU. (2013). Web-Scale Analytics for BIG Data. In: Web-Scale Data Management for the Cloud. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6856-1_4

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6856-1_4
Published: 19 February 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-6855-4
Online ISBN: 978-1-4614-6856-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics