Using Software Visualization for Supporting the Teaching of MapReduce
The increasing number of cybersecurity threats we are facing nowadays is fueling the development of new detection and contrast techniques based on the analysis of Big Data. In such a setting, the MapReduce paradigm has quickly become the de facto standard for carrying out this processing. This has led to a surge in the number of job offerings involving this skill. Moreover, we are experiencing a significant increase in the number of computer science courses covering this paradigm as well as its most popular implementations, Spark and Hadoop.
In this paper, it is presented a solution for supporting the teaching of MapReduce through the use of software visualization. The proposed solution has two main goals. The first is to help students in understanding how the MapReduce paradigm succeeds in solving a complex problem by decomposing it in simpler sub problems, where each of these is solved by means of map and/or reduce operations. The second is about the capability of showing how an input dataset is partitioned in blocks and processed in parallel by the different computing units of a distributed computing system. In both cases, the use of software visualization techniques with proper graphical metaphors helps the students in understanding what is going on, by providing them with a graphical representation that, on a side, describes how the considered algorithm works on an input dataset while, on the other side, illustrates the speed-up achieved thanks to the distributed approach.
KeywordsBig data security MapReduce Spark Software visualization
We are thankful to Francesco Palini for his help in developing a prototype of the proposed visualization system.
This work was supported in part by University of Rome - “La Sapienza” under project “Analisi, sviluppo e sperimentazione di algoritmi sperimentalmente efficienti”.
It was also supported in part by INdAM - GNCS under project “Algoritmi e tecniche efficienti per l’organizzazione, la gestione e l’analisi di Big Data in ambito biologico” (2017) and project “Elaborazione ed analisi di Big Data modellati come grafi in vari contesti applicativi” (2018).
- 1.Afrati, F.N., Fotakis, D., Ullman, J.D.: Enumerating subgraph instances using map-reduce. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 62–73. IEEE (2013)Google Scholar
- 3.Brown, E.R., Garrity, P., Yates, T., Northfield, M., Shoop, E., Saint Paul, M.: Teaching map-reduce parallel computing in CS1. In: Midwest Instruction and Computing Symposium (2011)Google Scholar
- 4.Carr, S., Fang, C., Jozwowski, T., Mayo, J., Shene, C.K.: Concurrent mentor: a visualization system for distributed programming education. In: 2003 International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1676–1682 (2003)Google Scholar
- 5.Castiglione, A., Cattaneo, G., De Maio, G., De Santis, A., Roscigno, G.: A novel methodology to acquire live big data evidence from the cloud. IEEE Trans. Big Data (2017, in press)Google Scholar
- 8.Cattaneo, G., Ferraro Petrillo, U., Nappi, M., Narducci, F., Roscigno, G.: An efficient implementation of the algorithm by Lukáš et al. on Hadoop. In: Au, M.H.A., Castiglione, A., Choo, K.-K.R., Palmieri, F., Li, K.-C. (eds.) GPC 2017. LNCS, vol. 10232, pp. 475–489. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57186-7_35CrossRefGoogle Scholar
- 9.Cembalo, M., Santis, A.D., Petrillo, U.F.: SAVI: a new system for advanced SQL visualization. In: Goda, B.S., Sobiesk, E., Connolly, R.W. (eds.) Proceedings of the 2011 Conference on Information Technology Education, SIGITE 2011, pp. 165–170. ACM, New York (2011). https://doi.org/10.1145/2047594.2047641
- 10.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI), vol. 6, pp. 137–150 (2004)Google Scholar
- 11.Eckroth, J.: Teaching big data with a virtual cluster. In: Proceedings of the 47th ACM Technical Symposium on Computing Science Education, pp. 175–180. ACM (2016)Google Scholar
- 12.Eckroth, J.: Teaching future big data analysts: curriculum and experience report. In: 2017 IEEE International Symposium on Parallel and Distributed Processing Workshops (IPDPSW), pp. 346–351. IEEE (2017)Google Scholar
- 13.Naps, T.L., Chan, E.E.: Using visualization to teach parallel algorithms. In: The Proceedings of the Thirtieth SIGCSE Technical Symposium on Computer Science Education, SIGCSE 1999, pp. 232–236. ACM, New York (1999). https://doi.org/10.1145/299649.299767
- 14.Ngo, L.B., Duffy, E.B., Apon, A.W.: Teaching HDFS/MapReduce systems concepts to undergraduates. In: 2014 IEEE International on Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp. 1114–1121. IEEE (2014)Google Scholar
- 15.O’Malley, O.: Terabyte Sort on Apache Hadoop, pp. 1–3. Yahoo (2008). http://sortbenchmark.org/YahooHadoop.pdf)
- 16.Shamsi, J.A., Durrani, N.M., Kafi, N.: Novelties in teaching high performance computing. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 772–778, May 2015. https://doi.org/10.1109/IPDPSW.2015.88
- 18.Woods, P.: The New Era of Big Data Security Analytics (2012). https://searchsecurity.techtarget.com/feature/The-new-era-of-big-data-security-analytics
- 19.Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, p. 10 (2010)Google Scholar