Skip to main content

The CRISP-DCW Method for Distributed Computing Workflows

  • Conference paper
  • First Online:
Research & Innovation Forum 2019 (RIIFORUM 2019)

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Included in the following conference series:

  • 1040 Accesses

Abstract

Big data analysis is increasingly becoming a crucial part of many organizations, popularizing the distributed computing paradigm. Within the emerging research field of Applied Data Science, multiple notable methods are available that help analysists and scientists to create their analytical processes. However, for distributed computing problems such methods are not available yet. Therefore, to support data analysts, scientists and software engineers in the creation of distributed computing processes, we present the CRoss-Industry Standard Process for Distributed Computing Workflows (CRISP-DCW) method. The CRISP-DCW method lets users create distributed computing workflows through following a predefined cycle and using reference manuals, where the critical elements of such a workflow are developed for the context at hand. Using our method’s reference manuals and predefined steps, data scientists can spend less time on developing big data processing workflows, thus increasing efficiency. Results were evaluated with experts and found to be satisfactory. Therefore, we argue that the CRISP-DCW method provides a good starting point for applied data scientists to develop and document their distributed computing workflow, making their processes both more efficient and effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harvard Bus. Rev. 90(10), 3–9 (2012)

    Google Scholar 

  2. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data : the next frontier for innovation, competition, and productivity (2011)

    Google Scholar 

  3. NIST Big Data Public Working Group: NIST Special Publication 1500-1—NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST Special Publication (Vol. 1). Gaithersburg. https://doi.org/10.6028/NIST.SP.1500-1 (2015)

  4. Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. Knowl. Eng. Rev. 21(1), 1–24 (2006). https://doi.org/10.1017/S0269888906000737

    Article  Google Scholar 

  5. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010). https://doi.org/10.1145/1721654.1721672

    Article  Google Scholar 

  6. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Market-oriented cloud computing: Vision, hype, and reality of delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009). https://doi.org/10.1109/CCGRID.2009.97

    Article  Google Scholar 

  7. Zhao, Y., Raicu, I., Foster, I.: Scientific workflow systems for 21st century, new bottle or new wine? In: IEEE Congress on Services—Part I, 2008, pp. 467–471. IEEE Computer Society, Washington. https://doi.org/10.1109/SERVICES-1.2008.79 (2008)

  8. Spruit, M., Jagesar, R.: Power to the people! Meta-algorithmic modelling in applied data science. In: Fred, A. et al. (eds.) Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 400–406. KDIR 2016, November 11–13, 2016. ScitePress, Porto, Portugal (2016)

    Google Scholar 

  9. Wieringa, R.: Design Science Methodology for information Systems and Software Engineering, vol. 2. Springer, Heidelberg, New York, Dordrecht, London. https://doi.org/10.1145/1810295.1810446 (2010)

  10. Spruit, M., Lytras, M.: Applied data science in patient-centric healthcare: adaptive analytic systems for empowering physicians and patients. Telematics Inform. 35(4), 643–653 (2018)

    Article  Google Scholar 

  11. Ooms, R., Spruit, M., Overbeek, S.: 3PM revisited: dissecting the three phases method for outsourcing knowledge discovery. Int. J. Bus. Intell. Res. 10(1), Article 5 (2019)

    Article  Google Scholar 

  12. Vleugel, A., Spruit, M., Van Daal, A.: Historical data analysis through data mining from an outsourcing perspective: the three-phases model. Int. J. Bus. 1(3), 24. https://doi.org/10.4018/jbir.2010070104 (2010)

    Article  Google Scholar 

  13. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0. In: CRISP-DM Consortium. https://doi.org/10.1109/ICETET.2008.239 (2000)

  14. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014). https://doi.org/10.1109/TKDE.2013.109

    Article  Google Scholar 

  15. Microsoft: Microsoft Azure. Retrieved from https://azure.microsoft.com/ (2017 November 24)

  16. Amazon Web Services Inc.: Amazon Web Services. Retrieved from Amazon Web Services: https://www.aws.amazon.com/ (2017 December 13)

  17. Leong, L., Bala, R., Lowery, C., Smith, D.: Magic Quadrant for Cloud Infrastructure as a Service, Worldwide (2017)

    Google Scholar 

  18. Van Steen, M., Tanenbaum, A.S.: Distributed Systems, Third, vol. 1. Maarten van Steen (2017)

    Google Scholar 

  19. Voorsluys, W., Broberg, J., Buyya, R.: Introduction to cloud computing. In: Buyya, R., Broberg, J., Goscinski, A. (eds.) Cloud Computing: Principles and Paradigms, 1st ed., pp. 3–41. Wiley (2011)

    Google Scholar 

  20. Apache Spark: Spark Overview. Retrieved from Apache Spark. https://spark.apache.org/docs/2.3.0/index.html(2018 April 17)

  21. The Apache Software Foundation: Apache Hadoop. Retrieved from Apache Hadoop: http://hadoop.apache.org/ (2017 November 28)

  22. The Apache Software Foundation: Documentation. Retrieved from Apache Kafka a distributed streaming platform: https://kafka.apache.org/documentation/#uses (2017 April 17)

  23. The Apache Software Foundation: Flume 1.8.0 User Guide. Retrieved from Apache Flume: https://flume.apache.org/FlumeUserGuide.html (2018 April 17)

  24. The Apache Software Foundation: User Guide. Retrieved from Apache Sqoop: http://sqoop.apache.org/docs/1.99.7/user.html (2018 April 17)

  25. Allen, R., Li, M.: Ranking Popular Distributed Computing Packages for Data Science. Retrieved from KDnuggets. https://www.kdnuggets.com/2018/03/top-distributed-computing-packages-data-science.html (2018, March 29)

  26. Apache Spark: SparkR (R on Spark). Retrieved from Apache Spark. https://spark.apache.org/docs/latest/sparkr.html (2018 April 17)

  27. Armbrust, M., Ghodsi, A., Zaharia, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data—SIGMOD ’15, pp. 1383–1394. https://doi.org/10.1145/2723372.2742797 (2015)

  28. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17, 1–7 (2016). 10.1145/2882903.2912565

    Google Scholar 

  29. White, T.: Hadoop: The Definitive Guide (Third). O’Reilly, Beijing, Cambridge, Farnham, Koln, Tokyo (2015)

    Google Scholar 

  30. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems (GRADES 2013), p. 6. https://doi.org/10.1145/2484425.2484427 (2013)

  31. Islam, M., Huang, A.K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Srinivasan, S., Peters, C., Neumann, A., Abdelnur, A.: Oozie: towards a scalable workflow management system for Hadoop. In: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies—SWEET ’12 (pp. 1–10). Scottsdale: ACM. https://doi.org/10.1145/2443416.2443420 (2012)

  32. Spotify AB: Luigi is now open source: build complex pipelines of tasks. Retrieved from Spotify Developer: https://developer.spotify.com/news-stories/2012/09/24/hello (2012 September 24)

  33. Van de Weerd, I., Brinkkemper, S.: Meta-modeling for situational analysis and design methods. In: Syed, M.R., Syed, S.N. (eds.) Handbook of Research on Modern Systems Analysis and Design Technologies and Applications, pp. 38–58. Information Science Reference, Hershey. https://doi.org/10.4018/978-1-59904-887-1.ch003 (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Spruit .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Spruit, M., Meijers, S. (2019). The CRISP-DCW Method for Distributed Computing Workflows. In: Visvizi, A., Lytras, M. (eds) Research & Innovation Forum 2019. RIIFORUM 2019. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-030-30809-4_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30809-4_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30808-7

  • Online ISBN: 978-3-030-30809-4

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics