Advertisement

DIPAR: A Framework for Implementing Big Data Science in Organizations

  • Luis Eduardo Bautista VillalpandoEmail author
  • Alain April
  • Alain Abran
Chapter
Part of the Computer Communications and Networks book series (CCN)

Abstract

Cloud computing (CC) is a technology aimed at processing and storing very large amounts of data, which are also referred to as big data (BD). Although this is not the only aim of the cloud paradigm, one of the most important challenges in CC is how to process and deal with the BD. By the end of 2012, the amount of data generated was approximately 2.8 zettabytes (ZB), i.e., 2.8 trillion GB. One of the areas that contribute to the analysis of BD is referred to as data science. This new study area, also called big data science (BDS), has recently become an important topic in organizations because of the value it can generate, both for themselves and for their customers. One of the challenges in implementing BDS is the current lack of information to help in understanding this new study area. In this context, this chapter presents the define-ingest-preprocess-analyze-report (DIPAR) framework, which proposes a means to implement BDS in organizations and defines its requirements and elements. The framework consists of five stages define, ingest, preprocess, analyze, and report. It is based on the ISO 15939 Systems and Software Engineering—Measurement process standard, the purpose of which is to collect, analyze, and report data relating to the products to be developed.

Keywords

Big data science Data cleaning DIPAR framework ISO 15939 Security System requirements 

References

  1. 1.
    ISO/IEC (2011) ISO/IEC JTC 1 SC38: Study Group Report on Cloud Computing, International Organization for Standardization, Geneva, SwitzerlandGoogle Scholar
  2. 2.
    Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far East, IDC: Framingham, MA, USA, p 16Google Scholar
  3. 3.
    Press GA (2013) Very short history of data science. www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/. Accessed May 2013
  4. 4.
    Tukey JW (1962) The future of data analysis. Ann Math Stat 33(1):1–67CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    Han J, Kamber M, Pei J (2012) Data mining, concepts and techniques. Elsevier, Waltham, Morgan Kaufmann, USA, 633 pGoogle Scholar
  6. 6.
    Lin J, Ryaboy D (2012) Scaling big data mining infrastructure: the Twitter experience. In: Goethals B (ed) Conference on knowledge discovery and data mining 2012. Association for Computing Machinery, Beijing, pp 6–19Google Scholar
  7. 7.
    Thusoo A et al (2010) Data warehousing and analytics infrastructure at Facebook. In: ACM SIGMOD international conference on the management of data 2010. Association for Computing Machinery, Indianapolis, Indiana, USAGoogle Scholar
  8. 8.
    ISO/IEC (2008) ISO/IEC 15939:2007 Systems and software engineering—measurement process, International Organization for Standardization, Geneva, SwitzerlandGoogle Scholar
  9. 9.
    Patil D (2012) Data Jujitsu: the art of turning data into product. O’Reilly Media, Inc., SebastopolGoogle Scholar
  10. 10.
    A.F.S. (2012) Apache Flume. flume.apache.org/. Accessed 13 June 2013Google Scholar
  11. 11.
    Facebook (2012) Scribe. https://github.com/facebook/scribe/wiki. Accessed 13 June 2013
  12. 12.
    Kandel S et al (2012) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST), 2012, Seattle, WA, USA, IEEE XploreGoogle Scholar
  13. 13.
    Thusoo A et al (2010) Hive—a petabyte scale data warehouse using Hadoop. In: 26th international conference on data engineering, 2010, Long Beach, California, USA, IEEE XploreGoogle Scholar
  14. 14.
    A.S.F (2012) What is Apache Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Overview. Accessed June 2013
  15. 15.
    N.E.S.S.I (2012) Big data, a new world of opportunities. Networked European Software and Services Initiative, Madrid, SpainGoogle Scholar
  16. 16.
    Yau N (2009) Seeing your life in data. In: Segaran T, Hammerbacher J (eds) Beautiful data, the stories behind elegant data solutions. O’Reilly Media, Inc., Sebastopol, pp 1–16Google Scholar
  17. 17.
    Agrin N, Rabinowitz N (2013) Seven dirty secrets of data visualisation. February 18, 2013, www.netmagazine.com/features/seven-dirty-secrets-data-visualisation#null. Accessed June 2013
  18. 18.
    Coulouris G et al (2011) Distributed systems concepts and design. 5th ed. Pearson Education, Edinburgh, Addison WesleyGoogle Scholar
  19. 19.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  20. 20.
    Lin J, Dyer C (2010) Data-intensive text processing with MapReduce2010. University of Maryland, College Park: Manuscript of a book in the Morgan & Claypool Synthesis Lectures on Human Language TechnologiesGoogle Scholar
  21. 21.
    Xing L, Shrestha A (2005) Distributed computer systems reliability considering imperfect coverage and common-cause failures. In: 11th international conference on parallel and distributed systems, Fuduoka, Japan, IEEE Computer SocietyGoogle Scholar
  22. 22.
    A.S.F (2013) Apache HBase, the Hadoop database, a distributed, scalable, big data store. http://hbase.apache.org/. Accessed 6 June 2013
  23. 23.
    Rabkin A, Katz R (2010) Chukwa: a system for reliable large-scale log collection. In: Proceedings of the 24th international conference on large installation system administration, USENIX Association, San Jose, CA, pp 1–15Google Scholar
  24. 24.
    Boulon J et al (2008) Chukwa, a large-scale monitoring system. In: Cloud Computing and its Applications (CCA ’08), Chicago, ILGoogle Scholar
  25. 25.
    Bautista L, Abran A, April A (2012) Design of a performance measurement framework for cloud computing. J Softw Eng Appl 5(2):69–75CrossRefGoogle Scholar
  26. 26.
    Bautista L, Abran A, Abran A (2013) A methodology for identifying the relationships between performance factors for cloud computing applications. In: Zaigham M, Saqib S (eds) Software engineering frameworks for the cloud computing paradigm. Springer, London, pp 111–117Google Scholar
  27. 27.
    A.S.F (2013) Apache pig. http://pig.apache.org/. Accessed 6 June 2013

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Luis Eduardo Bautista Villalpando
    • 1
    • 2
    Email author
  • Alain April
    • 2
  • Alain Abran
    • 2
  1. 1.Department of Electronic SystemsAutonomous University of AguascalientesAguascalientesMexico
  2. 2.Department of Software Engineering and Information TechnologyETS—University of QuebecMontrealCanada

Personalised recommendations