Skip to main content

Web-Scale Analytics for BIG Data

  • Chapter
  • First Online:
Web-Scale Data Management for the Cloud

Abstract

Virtualization is the key concept to provide a scalable and flexible computing environment in general. In this chapter, we focus on virtualization concepts in the context of data management tasks. We review existing concepts and technologies spanning multiple software layers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Different styles of spelling the term “MapReduce” exist, like “map/reduce” or “map-reduce”. We stick to the spelling used by the authors of the original paper [7].

  2. 2.

    The Kleene star is a unary operation over an alphabet (set) and denotes all strings that can be built over that alphabet, including the empty string. In our notation, it denotes zero, one or more occurrences of the elements of the base set. The Kleene plus denotes all strings but the empty set. In our notation, the Kleene plus denotes one or more occurrences of the elements of the base set. The Kleene star and plus operations are commonly used in regular expressions.

  3. 3.

    http://hadoop.apache.org.

  4. 4.

    In principle, MapReduce can obtain its input from other sources as well. The distributed file system is however the most common case.

  5. 5.

    Implementations like Hadoop offer special setup and cleanup functions that allow to create and tear down such data structures.

  6. 6.

    http://blog.data-miners.com/2008/02/mapreduce-and-k-means-clustering.html.

  7. 7.

    A common heuristic in DAG dataflows is to reduce the number of vertices and edges. For that reason, the code of the vertices I, m, and S might be combined to one vertex. Likewise could the code from G, r, and O be combined.

  8. 8.

    http://wiki.apache.org/pig/FrontPage.

  9. 9.

    http://www.jaql.org/.

  10. 10.

    http://www-01.ibm.com/software/data/infosphere/biginsights/.

  11. 11.

    http://www.almaden.ibm.com/cs/projects/systemt/.

  12. 12.

    http://wiki.apache.org/hadoop/Hive.

  13. 13.

    http://www.slideshare.net/ragho/hive-icde-2010.

  14. 14.

    http://www.slideshare.net/evamtse/hive-user-group-presentation-from-netflix-3182010-3483386, http://www.youtube.com/watch?v=Idu9OKnAOis.

  15. 15.

    typically specified and encoded using Google’s protocol buffers.

  16. 16.

    http://code.google.com/p/szl/.

  17. 17.

    depending on whether you include the DataSource and DataSink contracts as part of the PACT graphs or not

References

  1. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM symposium on Cloud computing, pp. 119–130 (2010)

    Google Scholar 

  2. Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Kanne, M.E.C.C., Ozcan, F., Shekita, E.J.: Jaql: A scripting language for large scale semistructured data analysis. PVLDB (2011)

    Google Scholar 

  3. Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)

    Google Scholar 

  4. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing a sql implementation on the mapreduce framework. PVLDB 4(12), 1318–1327 (2011)

    Google Scholar 

  5. Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)

    Google Scholar 

  6. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)

    Article  MATH  Google Scholar 

  7. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the conference on Symposium on Opearting Systems Design & Implementation, pp. 10–10 (2004)

    Google Scholar 

  8. DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L., Kumar, K.B., Muralikrishna, M.: Gamma – a high performance dataflow database machine. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 228–237 (1986)

    Google Scholar 

  9. DeWitt, D.J., Gray, J.: Parallel database systems: The future of high performance database systems. Communications of the ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  10. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  11. Fushimi, S., Kitsuregawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine grace. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 209–219 (1986)

    Google Scholar 

  12. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database systems – the complete book (2. ed.). Pearson Education (2009)

    Google Scholar 

  13. Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB 2(2), 1414–1425 (2009)

    Google Scholar 

  14. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. SIGOPS 37(5), 29–43 (2003)

    Article  Google Scholar 

  15. Graefe, G.: Parallel query execution algorithms. In: Encyclopedia of Database Systems, pp. 2030–2035. Springer US (2009)

    Google Scholar 

  16. Graefe, G.: Modern b-tree techniques. Foundations and Trends in Databases 3(4), 203–402 (2011)

    Article  Google Scholar 

  17. Graefe, G., Bunker, R., Cooper, S.: Hash joins and hash teams in microsoft sql server. In: VLDB, pp. 86–97 (1998)

    Google Scholar 

  18. Graefe, G., McKenna, W.J.: The volcano optimizer generator: Extensibility and efficient search. In: ICDE, pp. 209–218 (1993)

    Google Scholar 

  19. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD Conference, pp. 127–138 (1995)

    Google Scholar 

  20. Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G.: Dvfs in 45nm cmos. IEEE Technology 9(2), 922–933 (2010)

    Google Scholar 

  21. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)

    Google Scholar 

  22. Kemper, A., Eickler, A.: Datenbanksysteme: Eine Einf?hrung. Oldenbourg Wissenschaftsverlag (2006)

    Google Scholar 

  23. Kolb, L., Thor, A., Rahm, E.: Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp. 45–64 (2011)

    Google Scholar 

  24. Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE, pp. 618–629 (2012)

    Google Scholar 

  25. Maier, D.: The Theory of Relational Databases. Computer Science Press (1983)

    Google Scholar 

  26. Markl, V., Lohman, G.M., Raman, V.: Leo: An autonomic query optimizer for db2. IBM Systems Journal 42(1), 98–106 (2003)

    Article  Google Scholar 

  27. Neumann, T.: Query optimization (in relational databases). In: Encyclopedia of Database Systems, pp. 2273–2278. Springer US (2009)

    Google Scholar 

  28. Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)

    Google Scholar 

  29. Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX Annual Technical Conference, pp. 267–273 (2008)

    Google Scholar 

  30. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1099–1110 (2008)

    Google Scholar 

  31. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with sawzall. Scientific Programming 13(4), 277–298 (2005)

    Google Scholar 

  32. Rao, J., Ross, K.A.: Reusing invariants: A new strategy for correlated queries. In: SIGMOD, pp. 37–48 (1998)

    Google Scholar 

  33. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented dbms. In: VLDB, pp. 553–564 (2005)

    Google Scholar 

  34. Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Communications of the ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  35. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., 0002, N.Z., Anthony, S., Liu, H., Murthy, R.: Hive – a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)

    Google Scholar 

  36. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)

    Google Scholar 

  37. Warneke, D., Kao, O.: Nephele: efficient parallel data processing in the cloud. In: Proceedings of the Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10 (2009)

    Google Scholar 

  38. Wensel, C.K.: Cascading: Defining and executing complex and fault tolerant data processing workflows on a hadoop cluster (2008). http://www.cascading.org

  39. White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2009)

    Google Scholar 

  40. Yang, H.c., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1029–1040 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Lehner, W., Sattler, KU. (2013). Web-Scale Analytics for BIG Data. In: Web-Scale Data Management for the Cloud. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6856-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6856-1_4

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-6855-4

  • Online ISBN: 978-1-4614-6856-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics