Skip to main content

Survey of Machine Learning Algorithms on Spark Over DHT-based Structures

  • Conference paper
  • First Online:
Algorithmic Aspects of Cloud Computing (ALGOCLOUD 2016)

Abstract

Several solutions have been proposed over the past few years on data storage, data management as well as data retrieval systems. These solutions can process massive amount of data stored in relational or distributed database management systems. In addition, decision making analytics and predictive computational statistics are some of the most common and well studied fields in computer science. In this paper, we demonstrate the implementation of machine learning algorithms over an open-source distributed database management system that can run in parallel on a cluster. In order to accomplish that, a system architecture scheme (e.g. Apache Spark) over Apache Cassandra is proposed. This paper also presents a survey of the most common machine learning algorithms and the results of the experiments performed over a Point-Of-Sales (POS) data set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://spark.apache.org/.

  2. 2.

    http://spark.apache.org/streaming/.

  3. 3.

    http://spark.apache.org/mllib/.

  4. 4.

    http://cassandra.apache.org/.

  5. 5.

    https://en.wikipedia.org/wiki/Naive_Bayes_classifier.

  6. 6.

    https://en.wikipedia.org/wiki/Support_vector_machine.

  7. 7.

    https://en.wikipedia.org/wiki/K-means_clustering.

  8. 8.

    https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm.

  9. 9.

    https://en.wikipedia.org/wiki/Association_rule_learning.

References

  1. Cho, Y.H., Kim, J.K., Kim, S.H.: A personalized recommender system based on web usage mining and decision tree induction. Expert Syst. Appl. 23(3), 329–342 (2002)

    Article  Google Scholar 

  2. Dickson, P.R., Sawyer, A.G.: The price knowledge and search of supermarket shoppers. J. Mark. 54, 42–53 (1990)

    Article  Google Scholar 

  3. Gourgaris, P., Kanavos, A., Makris, C., Perrakis, G.: Review-based entity-ranking refinement. In: Proceedings of the 11th International Conference on Web Information Systems and Technologies (WEBIST), pp. 402–410 (2015)

    Google Scholar 

  4. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)

    Google Scholar 

  5. Iakovou, S.A., Kanavos, A., Tsakalidis, A.: Customer behaviour analysis for recommendation of supermarket ware. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 471–480. Springer, Cham (2016). doi:10.1007/978-3-319-44944-9_41

    Chapter  Google Scholar 

  6. Jagadish, H.V., Ooi, B.C., Tan, K., Vu, Q.H., Zhang, R.: Speeding up search in peer-to-peer networks with a multi-way tree structure. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2006)

    Google Scholar 

  7. Jagadish, H.V., Ooi, B.C., Vu, Q.H.: BATON: A balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), pp. 661–672 (2005)

    Google Scholar 

  8. Kanavos, A., Kafeza, E., Makris, C.: Can we rank emotions? A brand love ranking system for emotional terms. In: 2015 IEEE International Congress on Big Data, pp. 71–78 (2015)

    Google Scholar 

  9. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. ACM Trans. Web (TWEB) 1(1), 5 (2007)

    Article  Google Scholar 

  10. Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys), pp. 107–114 (2008)

    Google Scholar 

  11. Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Explaining the product range effect in purchase data. In: Proceedings of the 2013 IEEE International Conference on Big Data, pp. 648–656 (2013)

    Google Scholar 

  12. Sioutas, S., Papaloukopoulos, G., Sakkopoulos, E., Tsichlas, K., Manolopoulos, Y.: A novel distributed P2P simulator architecture: D-P2P-sim. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), pp. 2069–2070 (2009)

    Google Scholar 

  13. Sioutas, S., Papaloukopoulos, G., Sakkopoulos, E., Tsichlas, K., Manolopoulos, Y., Triantafillou, P.: Brief announcement: Art: Sub-logarithmic decentralized range query processing with probabilistic guarantees. In: Proceedings of the 29th Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 118–119 (2010)

    Google Scholar 

  14. Sioutas, S., Triantafillou, P., Papaloukopoulos, G., Sakkopoulos, E., Tsichlas, K., Manolopoulos, Y.: ART: Sub-logarithmic decentralized range query processing with probabilistic guarantees. Distrib. Parallel Databases 31(1), 71–109 (2013)

    Article  Google Scholar 

  15. Weng, S., Liu, M.: Feature-based recommendations for one-to-one marketing. Expert Syst. Appl. 26(4), 493–508 (2004)

    Article  Google Scholar 

  16. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)

    Google Scholar 

  17. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. In. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  18. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 15–28 (2012)

    Google Scholar 

  19. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Kanavos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sioutas, S. et al. (2017). Survey of Machine Learning Algorithms on Spark Over DHT-based Structures. In: Sellis, T., Oikonomou, K. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2016. Lecture Notes in Computer Science(), vol 10230. Springer, Cham. https://doi.org/10.1007/978-3-319-57045-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57045-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57044-0

  • Online ISBN: 978-3-319-57045-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics