Appraising SPARK on Large-Scale Social Media Analysis

  • Loris Belcastro
  • Fabrizio Marozzo
  • Domenico Talia
  • Paolo Trunfio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

Software systems for social media analysis provide algorithms and tools for extracting useful knowledge from user-generated social media data. ParSoDA (Parallel Social Data Analytics) is a Java library for developing parallel data analysis applications based on the extraction of useful knowledge from social media data. This library aims at reducing the programming skills necessary to implement scalable social data analysis applications. This work describes how the ParSoDA library has been extended to execute applications on Apache Spark. Using a cluster of 12 workers, the Spark version of the library reduces the execution time of two case study applications exploiting social media data up to 42%, compared to the Hadoop version of the library.

Keywords

Social data analysis Scalability Spark Cloud computing Parallel library Big Data 

References

  1. 1.
    Amer-Yahia, S., Ibrahim, N., Kengne, C.K., Ulliana, F., Rousset, M.C.: Socle: towards a framework for data preparation in social applications. Ingénierie des Systèmes d’Information 19(3), 49–72 (2014)CrossRefGoogle Scholar
  2. 2.
    Anstead, N., O’Loughlin, B.: Social media analysis and public opinion: the 2010 UK general election. J. Comput.-Mediated Commun. 20(2), 204–220 (2015)CrossRefGoogle Scholar
  3. 3.
    Belcastro, L., Marozzo, F., Talia, D., Trunfio, P.: Big data analysis on clouds. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 101–142. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-49340-4_4. ISBN 978-3-319-49339-8CrossRefGoogle Scholar
  4. 4.
    Belcastro, L., Marozzo, F., Talia, D., Trunfio, P.: A parallel library for social media analytics. In: The 2017 International Conference on High Performance Computing and Simulation (HPCS 2017), Genoa, Italy, 17–21 July 2017Google Scholar
  5. 5.
    Cesario, E., Congedo, C., Marozzo, F., Riotta, G., Spada, A., Talia, D., Trunfio, P., Turri, C.: Following soccer fans from geotagged tweets at FIFA world Cup 2014. In: Proceedings of the 2nd IEEE Conference on Spatial Data Mining and Geographical Knowledge Services, Fuzhou, China, pp. 33–38, July 2015. ISBN 978-1- 4799-7748-2Google Scholar
  6. 6.
    Cesario, E., Iannazzo, A.R., Marozzo, F., Morello, F., Riotta, G., Spada, A., Talia, D., Trunfio, P.: Analyzing social media data to discover mobility patterns at expo 2015: methodology and results. In: The 2016 International Conference on High Performance Computing and Simulation (HPCS 2016), Innsbruck, Austria, 18–22 July 2016Google Scholar
  7. 7.
    Chodorow, K.: MongoDB: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2013)Google Scholar
  8. 8.
    Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 19, 281 (2007)Google Scholar
  9. 9.
    Cuesta, Á., Barrero, D.F., R-Moreno, M.D.: A framework for massive Twitter data extraction and analysis. Malays. J. Comput. Sci. 27, 1 (2014)Google Scholar
  10. 10.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004, Berkeley, USA, p. 10 (2004)Google Scholar
  11. 11.
    Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining Knowl. Discov. 8(1), 53–87 (2004)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Hussain, A., Vatrapu, R.: Social data analytics tool (SODATO). In: Tremblay, M.C., VanderMeer, D., Rothenberger, M., Gupta, A., Yoon, V. (eds.) DESRIST 2014. LNCS, vol. 8463, pp. 368–372. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-06701-8_27 CrossRefGoogle Scholar
  13. 13.
    Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: Parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, New York, NY, USA, pp. 107–114 (2008)Google Scholar
  14. 14.
    Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 797–808 (2013)Google Scholar
  15. 15.
    Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(12), 1–135 (2008)CrossRefGoogle Scholar
  16. 16.
    Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans. Knowl. Data Eng. 16(11), 1424–1440 (2004)CrossRefGoogle Scholar
  17. 17.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)Google Scholar
  18. 18.
    Talia, D., Trunfio, P., Marozzo, F.: Data Analysis in the Cloud. Elsevier, Amsterdam, October 2015Google Scholar
  19. 19.
    White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media Inc., Sebastopol (2009)Google Scholar
  20. 20.
    Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD Conference on Management of Data, pp. 13–24. ACM (2013)Google Scholar
  21. 21.
    You, L., Motta, G., Sacco, D., Ma, T.: Social data analysis framework in cloud and mobility analyzer for smarter cities. In: IEEE International Conference on Service Operations and Logistics, and Informatics, pp. 96–101, October 2014Google Scholar
  22. 22.
    Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Loris Belcastro
    • 1
  • Fabrizio Marozzo
    • 1
  • Domenico Talia
    • 1
  • Paolo Trunfio
    • 1
  1. 1.DIMESUniversity of CalabriaRendeItaly

Personalised recommendations