DIPAR: A Framework for Implementing Big Data Science in Organizations

Bautista Villalpando, Luis Eduardo; April, Alain; Abran, Alain

doi:10.1007/978-1-4471-6452-4_8

DIPAR: A Framework for Implementing Big Data Science in Organizations

Luis Eduardo Bautista Villalpando^3,4,
Alain April⁴ &
Alain Abran⁴

Chapter
First Online: 01 January 2014

1734 Accesses
1 Citations

Part of the book series: Computer Communications and Networks ((CCN))

Abstract

Cloud computing (CC) is a technology aimed at processing and storing very large amounts of data, which are also referred to as big data (BD). Although this is not the only aim of the cloud paradigm, one of the most important challenges in CC is how to process and deal with the BD. By the end of 2012, the amount of data generated was approximately 2.8 zettabytes (ZB), i.e., 2.8 trillion GB. One of the areas that contribute to the analysis of BD is referred to as data science. This new study area, also called big data science (BDS), has recently become an important topic in organizations because of the value it can generate, both for themselves and for their customers. One of the challenges in implementing BDS is the current lack of information to help in understanding this new study area. In this context, this chapter presents the define-ingest-preprocess-analyze-report (DIPAR) framework, which proposes a means to implement BDS in organizations and defines its requirements and elements. The framework consists of five stages define, ingest, preprocess, analyze, and report. It is based on the ISO 15939 Systems and Software Engineering—Measurement process standard, the purpose of which is to collect, analyze, and report data relating to the products to be developed.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

ISO/IEC (2011) ISO/IEC JTC 1 SC38: Study Group Report on Cloud Computing, International Organization for Standardization, Geneva, Switzerland
Google Scholar
Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far East, IDC: Framingham, MA, USA, p 16
Google Scholar
Press GA (2013) Very short history of data science. www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/. Accessed May 2013
Tukey JW (1962) The future of data analysis. Ann Math Stat 33(1):1–67
Article MATH MathSciNet Google Scholar
Han J, Kamber M, Pei J (2012) Data mining, concepts and techniques. Elsevier, Waltham, Morgan Kaufmann, USA, 633 p
Google Scholar
Lin J, Ryaboy D (2012) Scaling big data mining infrastructure: the Twitter experience. In: Goethals B (ed) Conference on knowledge discovery and data mining 2012. Association for Computing Machinery, Beijing, pp 6–19
Google Scholar
Thusoo A et al (2010) Data warehousing and analytics infrastructure at Facebook. In: ACM SIGMOD international conference on the management of data 2010. Association for Computing Machinery, Indianapolis, Indiana, USA
Google Scholar
ISO/IEC (2008) ISO/IEC 15939:2007 Systems and software engineering—measurement process, International Organization for Standardization, Geneva, Switzerland
Google Scholar
Patil D (2012) Data Jujitsu: the art of turning data into product. O’Reilly Media, Inc., Sebastopol
Google Scholar
A.F.S. (2012) Apache Flume. flume.apache.org/. Accessed 13 June 2013
Google Scholar
Facebook (2012) Scribe. https://github.com/facebook/scribe/wiki. Accessed 13 June 2013
Kandel S et al (2012) Enterprise data analysis and visualization: an interview study. In: IEEE visual analytics science & technology (VAST), 2012, Seattle, WA, USA, IEEE Xplore
Google Scholar
Thusoo A et al (2010) Hive—a petabyte scale data warehouse using Hadoop. In: 26th international conference on data engineering, 2010, Long Beach, California, USA, IEEE Xplore
Google Scholar
A.S.F (2012) What is Apache Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Overview. Accessed June 2013
N.E.S.S.I (2012) Big data, a new world of opportunities. Networked European Software and Services Initiative, Madrid, Spain
Google Scholar
Yau N (2009) Seeing your life in data. In: Segaran T, Hammerbacher J (eds) Beautiful data, the stories behind elegant data solutions. O’Reilly Media, Inc., Sebastopol, pp 1–16
Google Scholar
Agrin N, Rabinowitz N (2013) Seven dirty secrets of data visualisation. February 18, 2013, www.netmagazine.com/features/seven-dirty-secrets-data-visualisation#null. Accessed June 2013
Coulouris G et al (2011) Distributed systems concepts and design. 5th ed. Pearson Education, Edinburgh, Addison Wesley
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Lin J, Dyer C (2010) Data-intensive text processing with MapReduce2010. University of Maryland, College Park: Manuscript of a book in the Morgan & Claypool Synthesis Lectures on Human Language Technologies
Google Scholar
Xing L, Shrestha A (2005) Distributed computer systems reliability considering imperfect coverage and common-cause failures. In: 11th international conference on parallel and distributed systems, Fuduoka, Japan, IEEE Computer Society
Google Scholar
A.S.F (2013) Apache HBase, the Hadoop database, a distributed, scalable, big data store. http://hbase.apache.org/. Accessed 6 June 2013
Rabkin A, Katz R (2010) Chukwa: a system for reliable large-scale log collection. In: Proceedings of the 24th international conference on large installation system administration, USENIX Association, San Jose, CA, pp 1–15
Google Scholar
Boulon J et al (2008) Chukwa, a large-scale monitoring system. In: Cloud Computing and its Applications (CCA ’08), Chicago, IL
Google Scholar
Bautista L, Abran A, April A (2012) Design of a performance measurement framework for cloud computing. J Softw Eng Appl 5(2):69–75
Article Google Scholar
Bautista L, Abran A, Abran A (2013) A methodology for identifying the relationships between performance factors for cloud computing applications. In: Zaigham M, Saqib S (eds) Software engineering frameworks for the cloud computing paradigm. Springer, London, pp 111–117
Google Scholar
A.S.F (2013) Apache pig. http://pig.apache.org/. Accessed 6 June 2013

Download references

Author information

Authors and Affiliations

Department of Electronic Systems, Autonomous University of Aguascalientes, Av. Universidad 940, Ciudad Universitaria, Aguascalientes, AGS, Mexico
Luis Eduardo Bautista Villalpando
Department of Software Engineering and Information Technology, ETS—University of Quebec, 1100 Notre-Dame St., Montreal, Canada
Luis Eduardo Bautista Villalpando, Alain April & Alain Abran

Authors

Luis Eduardo Bautista Villalpando
View author publications
You can also search for this author in PubMed Google Scholar
Alain April
View author publications
You can also search for this author in PubMed Google Scholar
Alain Abran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Eduardo Bautista Villalpando .

Editor information

Editors and Affiliations

University of Derby, United Kingdom
Zaigham Mahmood

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bautista Villalpando, L., April, A., Abran, A. (2014). DIPAR: A Framework for Implementing Big Data Science in Organizations. In: Mahmood, Z. (eds) Continued Rise of the Cloud. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-4471-6452-4_8

Download citation

DOI: https://doi.org/10.1007/978-1-4471-6452-4_8
Published: 08 July 2014
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6451-7
Online ISBN: 978-1-4471-6452-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics