Skip to main content

DIPAR: A Framework for Implementing Big Data Science in Organizations

  • Chapter
  • First Online:

Part of the book series: Computer Communications and Networks ((CCN))

Abstract

Cloud computing (CC) is a technology aimed at processing and storing very large amounts of data, which are also referred to as big data (BD). Although this is not the only aim of the cloud paradigm, one of the most important challenges in CC is how to process and deal with the BD. By the end of 2012, the amount of data generated was approximately 2.8 zettabytes (ZB), i.e., 2.8 trillion GB. One of the areas that contribute to the analysis of BD is referred to as data science. This new study area, also called big data science (BDS), has recently become an important topic in organizations because of the value it can generate, both for themselves and for their customers. One of the challenges in implementing BDS is the current lack of information to help in understanding this new study area. In this context, this chapter presents the define-ingest-preprocess-analyze-report (DIPAR) framework, which proposes a means to implement BDS in organizations and defines its requirements and elements. The framework consists of five stages define, ingest, preprocess, analyze, and report. It is based on the ISO 15939 Systems and Software Engineering—Measurement process standard, the purpose of which is to collect, analyze, and report data relating to the products to be developed.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. ISO/IEC (2011) ISO/IEC JTC 1 SC38: Study Group Report on Cloud Computing, International Organization for Standardization, Geneva, Switzerland

    Google Scholar 

  2. Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far East, IDC: Framingham, MA, USA, p 16

    Google Scholar 

  3. Press GA (2013) Very short history of data science. www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/. Accessed May 2013

  4. Tukey JW (1962) The future of data analysis. Ann Math Stat 33(1):1–67

    Article  MATH  MathSciNet  Google Scholar 

  5. Han J, Kamber M, Pei J (2012) Data mining, concepts and techniques. Elsevier, Waltham, Morgan Kaufmann, USA, 633 p

    Google Scholar 

  6. Lin J, Ryaboy D (2012) Scaling big data mining infrastructure: the Twitter experience. In: Goethals B (ed) Conference on knowledge discovery and data mining 2012. Association for Computing Machinery, Beijing, pp 6–19

    Google Scholar 

  7. Thusoo A et al (2010) Data warehousing and analytics infrastructure at Facebook. In: ACM SIGMOD international conference on the management of data 2010. Association for Computing Machinery, Indianapolis, Indiana, USA

    Google Scholar 

  8. ISO/IEC (2008) ISO/IEC 15939:2007 Systems and software engineering—measurement process, International Organization for Standardization, Geneva, Switzerland

    Google Scholar 

  9. Patil D (2012) Data Jujitsu: the art of turning data into product. O’Reilly Media, Inc., Sebastopol

    Google Scholar 

  10. A.F.S. (2012) Apache Flume. flume.apache.org/. Accessed 13 June 2013

    Google Scholar 

  11. Facebook (2012) Scribe. https://github.com/facebook/scribe/wiki. Accessed 13 June 2013

  12. Kandel S et al (2012) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST), 2012, Seattle, WA, USA, IEEE Xplore

    Google Scholar 

  13. Thusoo A et al (2010) Hive—a petabyte scale data warehouse using Hadoop. In: 26th international conference on data engineering, 2010, Long Beach, California, USA, IEEE Xplore

    Google Scholar 

  14. A.S.F (2012) What is Apache Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Overview. Accessed June 2013

  15. N.E.S.S.I (2012) Big data, a new world of opportunities. Networked European Software and Services Initiative, Madrid, Spain

    Google Scholar 

  16. Yau N (2009) Seeing your life in data. In: Segaran T, Hammerbacher J (eds) Beautiful data, the stories behind elegant data solutions. O’Reilly Media, Inc., Sebastopol, pp 1–16

    Google Scholar 

  17. Agrin N, Rabinowitz N (2013) Seven dirty secrets of data visualisation. February 18, 2013, www.netmagazine.com/features/seven-dirty-secrets-data-visualisation#null. Accessed June 2013

  18. Coulouris G et al (2011) Distributed systems concepts and design. 5th ed. Pearson Education, Edinburgh, Addison Wesley

    Google Scholar 

  19. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  20. Lin J, Dyer C (2010) Data-intensive text processing with MapReduce2010. University of Maryland, College Park: Manuscript of a book in the Morgan & Claypool Synthesis Lectures on Human Language Technologies

    Google Scholar 

  21. Xing L, Shrestha A (2005) Distributed computer systems reliability considering imperfect coverage and common-cause failures. In: 11th international conference on parallel and distributed systems, Fuduoka, Japan, IEEE Computer Society

    Google Scholar 

  22. A.S.F (2013) Apache HBase, the Hadoop database, a distributed, scalable, big data store. http://hbase.apache.org/. Accessed 6 June 2013

  23. Rabkin A, Katz R (2010) Chukwa: a system for reliable large-scale log collection. In: Proceedings of the 24th international conference on large installation system administration, USENIX Association, San Jose, CA, pp 1–15

    Google Scholar 

  24. Boulon J et al (2008) Chukwa, a large-scale monitoring system. In: Cloud Computing and its Applications (CCA ’08), Chicago, IL

    Google Scholar 

  25. Bautista L, Abran A, April A (2012) Design of a performance measurement framework for cloud computing. J Softw Eng Appl 5(2):69–75

    Article  Google Scholar 

  26. Bautista L, Abran A, Abran A (2013) A methodology for identifying the relationships between performance factors for cloud computing applications. In: Zaigham M, Saqib S (eds) Software engineering frameworks for the cloud computing paradigm. Springer, London, pp 111–117

    Google Scholar 

  27. A.S.F (2013) Apache pig. http://pig.apache.org/. Accessed 6 June 2013

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luis Eduardo Bautista Villalpando .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this chapter

Cite this chapter

Bautista Villalpando, L., April, A., Abran, A. (2014). DIPAR: A Framework for Implementing Big Data Science in Organizations. In: Mahmood, Z. (eds) Continued Rise of the Cloud. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-4471-6452-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-6452-4_8

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-6451-7

  • Online ISBN: 978-1-4471-6452-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics