Skip to main content

Setting Up a Big Data Project: Challenges, Opportunities, Technologies and Optimization

  • Chapter
  • First Online:
Book cover Big Data Optimization: Recent Developments and Challenges

Abstract

In the first part of this chapter we illustrate how a big data project can be set up and optimized. We explain the general value of big data analytics for the enterprise and how value can be derived by analyzing big data. We go on to introduce the characteristics of big data projects and how such projects can be set up, optimized and managed. Two exemplary real word use cases of big data projects are described at the end of the first part. To be able to choose the optimal big data tools for given requirements, the relevant technologies for handling big data are outlined in the second part of this chapter. This part includes technologies such as NoSQL and NewSQL systems, in-memory databases, analytical platforms and Hadoop based solutions. Finally, the chapter is concluded with an overview over big data benchmarks that allow for performance optimization and evaluation of big data technologies. Especially with the new big data applications, there are requirements that make the platforms more complex and more heterogeneous. The relevant benchmarks designed for big data technologies are categorized in the last part.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. On Big Data Velocity. Interview with Scott Jarr, ODBMS Industry Watch, 28 Jan 2013. http://www.odbms.org/blog/2013/01/on-big-data-velocity-interview-with-scott-jarr/ (2015). Accessed 15 July 2015

  2. How to run a Big Data project. Interview with James Kobielus. ODBMS Industry Watch, 15 May 2014. http://www.odbms.org/blog/2014/05/james-kobielus/ (2015). Accessed 15 July 2015

  3. Laney, D.: 3D data management: controlling data volume, velocity and variety. Appl. Deliv. Strateg. File, 949 (2001)

    Google Scholar 

  4. Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, 1st ed. McGraw-Hill Osborne Media (IBM) (2011)

    Google Scholar 

  5. Foster, I.: Big Process for Big Data, Presented at the HPC 2012 Conference. Cetraro, Italy (2012)

    Google Scholar 

  6. Gattiker, A., Gebara, F.H., Hofstee, H.P., Hayes, J.D., Hylick, A.: Big Data text-oriented benchmark creation for Hadoop. IBM J. Res. Dev., 57(3/4), 10: 1–10: 6 (2013)

    Google Scholar 

  7. Zicari, R.: Big Data: Challenges and Opportunities. In: Akerkar, R. (ed.) Big Data Computing, p. 564. Chapman and Hall/CRC (2013)

    Google Scholar 

  8. On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com. ODBMS Industry Watch, 02 Nov 2011. http://www.odbms.org/blog/2011/11/on-big-data-interview-with-dr-werner-vogels-cto-and-vp-of-amazon-com/ (2015). Accessed 15 July 2015

  9. Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner. ODBMS Industry Watch, 15 Nov 2013. http://www.odbms.org/blog/2013/11/big-data-analytics-at-thomson-reuters-interview-with-jochen-l-leidner/ (2015). Accessed 15 July 2015

  10. Setting up a Big Data project. Interview with Cynthia M. Saracco. ODBMS Industry Watch, 27 Jan 2014. http://www.odbms.org/blog/2014/01/setting-up-a-big-data-project-interview-with-cynthia-m-saracco/ (2015). Accessed 15 July 2015

  11. Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)

    Article  Google Scholar 

  12. On Big Data and Hadoop. Interview with Paul C. Zikopoulos. ODBMS Industry Watch, 10 June 2013. http://www.odbms.org/blog/2013/06/on-big-data-and-hadoop-interview-with-paul-c-zikopoulos/ (2015). Accessed 15 July 2015

  13. Next generation Hadoop. Interview with John Schroeder. ODBMS Industry Watch, 07 Sep 2012. http://www.odbms.org/blog/2012/09/next-generation-hadoop-interview-with-john-schroeder/ (2015). Accessed 15 July 2015

  14. On Big Data, Analytics and Hadoop. Interview with Daniel Abadi. ODBMS Industry Watch, 05 Dec 2012. http://www.odbms.org/blog/2012/12/on-big-data-analytics-and-hadoop-interview-with-daniel-abadi/ (2015). Accessed 15 July 2015

  15. Data Analytics at NBCUniversal. Interview with Matthew Eric Bassett. ODBMS Industry Watch, 23 Sep 2013. http://www.odbms.org/blog/2013/09/data-analytics-at-nbcuniversal-interview-with-matthew-eric-bassett/ (2015). Accessed 15 July 2015

  16. Analytics: The real-world use of big data. How innovative enterprises extract value from uncertain data (IBM Institute for Business Value and Saïd Business School at the University of Oxford), Oct 2012

    Google Scholar 

  17. Hopkins, B.: The Patterns of Big Data. Forrester Research, 11 June 2013

    Google Scholar 

  18. Lim, H., Han, Y., Babu, S.: How to Fit when No One Size Fits. In: CIDR (2013)

    Google Scholar 

  19. Cattell, R.: Scalable SQL and NoSql Data Stores. SIGMOD Rec., 39(4), 27 Dec 2010

    Google Scholar 

  20. Gilbert, S., Lynch, N.: Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33(2), 51–59 (2002)

    Article  Google Scholar 

  21. Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv. 15(4), 287–317 (1983)

    Article  MathSciNet  Google Scholar 

  22. Bailis, P., Ghodsi, A.: Eventual Consistency Today: Limitations, Extensions, and Beyond. Queue 11(3), pp. 20:20–20:32, Mar 2013

    Google Scholar 

  23. Pritchett, D.: BASE: an acid alternative. Queue 6(3), 48–55 (2008)

    Article  Google Scholar 

  24. Vogels, W.: Eventually consistent. Commun. ACM 52(1), 40–44 (2009)

    Article  Google Scholar 

  25. Moniruzzaman, A.B.M., Hossain, S.A.: NoSQL Database: New Era of Databases for Big data Analytics—Classification, Characteristics and Comparison. CoRR (2013). arXiv:1307.0191

  26. Datastax, Datastax Apache Cassandra 2.0 Documentation. http://www.datastax.com/documentation/cassandra/2.0/index.html (2015). Accessed 15 Apr 2015

  27. Apache Cassandra White Paper. http://www.datastax.com/wp-content/uploads/2011/02/DataStax-cBackgrounder.pdf

  28. MongoDB Inc., MongoDB Documentation. http://docs.mongodb.org/manual/MongoDB-manual.pdf (2015). Accessed 15 Apr 2015

  29. Chang, F., Dean, S., Ghemawat, W.C., Hsieh, D.A. Wallach, Burrows, M., Chandra, T., Fikes, A.,Gruber, R.E.:Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, vol 7, pp. 15–15. Berkeley, CA, USA (2006)

    Google Scholar 

  30. George, L.: HBase: The Definitive Guide, 1st ed. O’Reilly Media (2011)

    Google Scholar 

  31. Apache Software Foundation, The Apache HBase Reference Guide. https://hbase.apache.org/book.html

  32. Buerli, M.: The Current State of Graph Databases, Dec-2012, http://www.cs.utexas.edu/~cannata/dbms/Class%20Notes/08%20Graph_Databases_Survey.pdf (2015). Accessed 15 Apr 2015

  33. Angles, R.: A comparison of current graph database models. In: ICDE Workshops, pp. 171–177 (2012)

    Google Scholar 

  34. McColl, R.C., Ediger, D., Poovey, J., Campbell, D., Bader, D.A.: A performance evaluation of open source graph databases. In: Proceedings of the First Workshop on Parallel Programming for Analytics Applications, pp. 11–18. New York, NY, USA (2014)

    Google Scholar 

  35. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. SPARQL 1.1 Query Language, 21-Mar-2013. http://www.w3.org/TR/sparql11-query/ (2013)

  36. Holzschuher, F., Peinl, R.: Performance of graph query languages: comparison of cypher, gremlin and native access in Neo4 J. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 195–204. New York, NY, USA (2013)

    Google Scholar 

  37. VoltDB Inc., Using VoltDB. http://voltdb.com/download/documentation/

  38. Pezzini, M., Edjlali, R.: Gartner top technology trends, 2013. In: Memory Computing Aims at Mainstream Adoption, 31 Jan 2013

    Google Scholar 

  39. Herschel, G., Linden, A., Kart, L.: Gartner Magic Quadrant for Advanced Analytics Platforms, 19 Feb 2014

    Google Scholar 

  40. Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Proj. Website 11, 21 (2007)

    Google Scholar 

  41. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43. New York, NY, USA (2003)

    Google Scholar 

  42. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  43. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  44. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099–1110 (2008)

    Google Scholar 

  45. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. Washington, DC, USA (2010)

    Google Scholar 

  46. White, T.: Hadoop: The Definitive Guide, 1st ed. O’Reilly Media, Inc., (2009)

    Google Scholar 

  47. Apache Spark Project. http://spark.apache.org/

  48. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–10. Berkeley, CA, USA (2010)

    Google Scholar 

  49. Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 647–651. New York, NY, USA (2003)

    Google Scholar 

  50. Arasu, A., Babu, S., Widom, J.: The CQL Continuous Query Language: Semantic Foundations and Query Execution. VLDB J. 15(2), 121–142 (2006)

    Article  Google Scholar 

  51. Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: NiagaraCQ: a scalable continuous query system for internet databases. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 379–390. New York, NY, USA (2000)

    Google Scholar 

  52. Agrawal, J., Diao, Y., Gyllstrom, D, Immerman, N.: Efficient pattern matching over Event streams. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 147–160. New York, NY, USA (2008)

    Google Scholar 

  53. Jain, N., Mishra, S., Srinivasan, A., Gehrke, J., Widom, J., Balakrishnan, H., Çetintemel, U., Cherniack, M., Tibbetts, R., Zdonik, S.: Towards a Streaming SQL Standard. Proc VLDB Endow 1(2), 1379–1390 (2008)

    Article  Google Scholar 

  54. Balkesen, C., Tatbul, N.: Scalable data partitioning techniques for parallel sliding window processing over data streams. In: VLDB International Workshop on Data Management for Sensor Networks (DMSN’11). Seattle, WA, USA (2011)

    Google Scholar 

  55. Ahmad, Y., Berg, B., Çetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., Maskey, A., Papaemmanouil, O., Rasin, A., Tatbul, N., Xing, W., Xing, Y., Zdonik, S.B.: Distributed operation in the Borealis stream processing engine. In: SIGMOD Conference, pp. 882–884 (2006)

    Google Scholar 

  56. Apache Spark. http://spark.apache.org/

  57. Apache Storm. http://storm.incubator.apache.org/

  58. Amazon Kinesis. http://aws.amazon.com/kinesis/

  59. Gualtieri, M., Curran, R.: The Forrester Wave: Big Data Streaming Analytics Platforms, Q3 2014, 17 July 2014

    Google Scholar 

  60. Tibco Streambase. http://www.streambase.com

  61. Ivanov, T., Niemann, R., Izberovic, S., Rosselli, M., Tolle, K., Zicari, R.V.: Performance evaluationi of enterprise big data platforms with HiBench. presented at the In: 9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE 2015), Helsinki, Finland, 20–22 Aug 2015

    Google Scholar 

  62. Ivanov, T., Beer, M.: Performance evaluation of spark SQL using BigBench. Presented at the In: 6th Workshop on Big Data Benchmarking (6th WBDB). Canada, Toronto, 16–17 June 2015

    Google Scholar 

  63. Rosselli, M., Niemann, R., Ivanov, T., Tolle, K., Zicari, R.V.: “Benchmarking the Availability and Fault Tolerance of Cassandra”, presented at the In 6th Workshop on Big Data Benchmarking (6th WBDB), June 16–17, 2015. Canada, Toronto (2015)

    Google Scholar 

  64. TPC, TPC-H - Homepage. http://www.tpc.org/tpch/ (2015). Accessed 15 July 2015

  65. TPC Big Data Working Group, TPC-BD - Homepage TPC Big Data Working Group. http://www.tpc.org/tpcbd/default.asp (2015). Accessed 15 July 2015

  66. BigData Top100, 2013. http://bigdatatop100.org/ (2015). Accessed 15 July 2015

  67. Big Data Benchmarking Community, Big Data Benchmarking | Center for Large-scale Data Systems Research, Big Data Benchmarking Community. http://clds.ucsd.edu/bdbc/ (2015). Accessed 15 July 2015

  68. Chen, Y.: We don’t know enough to make a big data benchmark suite-an academia-industry view. Proc. WBDB (2012)

    Google Scholar 

  69. Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., Bai, X., Li, Y., Xu, C.: A characterization of big data Benchmarks. In: Big Data. IEEE International Conference on 2013, 118–125 (2013)

    Google Scholar 

  70. Luo, C., Zhan, J., Jia, Z., Wang, L., Lu, G., Zhang, L., Xu, C.-Z., Sun, N.: CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications. Front. Comput. Sci. 6(4), 347–362 (2012)

    MathSciNet  Google Scholar 

  71. Chen, Y., Ganapathi, A., Griffith, R., Katz, R.: The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 390–399 (2011)

    Google Scholar 

  72. Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc. VLDB Endow. 5(12), 1802–1813 (2012)

    Article  Google Scholar 

  73. Qin, X., Zhou, X.: A survey on Benchmarks for big data and some more considerations. In: Intelligent Data Engineering and Automated Learning–IDEAL. Springer 2013, 619–627 (2013)

    Google Scholar 

  74. Wang, L., Zhan, J., Luo, C., Zhu, Y, Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S.: Bigdatabench: a big data benchmark suite from internet services. arXiv:14011406 (2014)

  75. AMP Lab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/ (2015). Accessed 15 July 2015

  76. Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., López, J., Gibson, G., Fuchs, A., Rinaldi, B.: Ycsb ++: benchmarking and performance debugging advanced features in scalable table stores. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 9 (2011)

    Google Scholar 

  77. Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. In: Proceedings of the 2013 international conference on Management of data, pp. 1185–1196 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marten Rosselli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Zicari, R.V. et al. (2016). Setting Up a Big Data Project: Challenges, Opportunities, Technologies and Optimization. In: Emrouznejad, A. (eds) Big Data Optimization: Recent Developments and Challenges. Studies in Big Data, vol 18. Springer, Cham. https://doi.org/10.1007/978-3-319-30265-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30265-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30263-8

  • Online ISBN: 978-3-319-30265-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics