Abstract
Several solutions have been proposed over the past few years on data storage, data management as well as data retrieval systems. These solutions can process massive amount of data stored in relational or distributed database management systems. In addition, decision making analytics and predictive computational statistics are some of the most common and well studied fields in computer science. In this paper, we demonstrate the implementation of machine learning algorithms over an open-source distributed database management system that can run in parallel on a cluster. In order to accomplish that, a system architecture scheme (e.g. Apache Spark) over Apache Cassandra is proposed. This paper also presents a survey of the most common machine learning algorithms and the results of the experiments performed over a Point-Of-Sales (POS) data set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Cho, Y.H., Kim, J.K., Kim, S.H.: A personalized recommender system based on web usage mining and decision tree induction. Expert Syst. Appl. 23(3), 329–342 (2002)
Dickson, P.R., Sawyer, A.G.: The price knowledge and search of supermarket shoppers. J. Mark. 54, 42–53 (1990)
Gourgaris, P., Kanavos, A., Makris, C., Perrakis, G.: Review-based entity-ranking refinement. In: Proceedings of the 11th International Conference on Web Information Systems and Technologies (WEBIST), pp. 402–410 (2015)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2000)
Iakovou, S.A., Kanavos, A., Tsakalidis, A.: Customer behaviour analysis for recommendation of supermarket ware. In: Iliadis, L., Maglogiannis, I. (eds.) AIAI 2016. IAICT, vol. 475, pp. 471–480. Springer, Cham (2016). doi:10.1007/978-3-319-44944-9_41
Jagadish, H.V., Ooi, B.C., Tan, K., Vu, Q.H., Zhang, R.: Speeding up search in peer-to-peer networks with a multi-way tree structure. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1–12 (2006)
Jagadish, H.V., Ooi, B.C., Vu, Q.H.: BATON: A balanced tree structure for peer-to-peer networks. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB), pp. 661–672 (2005)
Kanavos, A., Kafeza, E., Makris, C.: Can we rank emotions? A brand love ranking system for emotional terms. In: 2015 IEEE International Congress on Big Data, pp. 71–78 (2015)
Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. ACM Trans. Web (TWEB) 1(1), 5 (2007)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y.: PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems (RecSys), pp. 107–114 (2008)
Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Explaining the product range effect in purchase data. In: Proceedings of the 2013 IEEE International Conference on Big Data, pp. 648–656 (2013)
Sioutas, S., Papaloukopoulos, G., Sakkopoulos, E., Tsichlas, K., Manolopoulos, Y.: A novel distributed P2P simulator architecture: D-P2P-sim. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), pp. 2069–2070 (2009)
Sioutas, S., Papaloukopoulos, G., Sakkopoulos, E., Tsichlas, K., Manolopoulos, Y., Triantafillou, P.: Brief announcement: Art: Sub-logarithmic decentralized range query processing with probabilistic guarantees. In: Proceedings of the 29th Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 118–119 (2010)
Sioutas, S., Triantafillou, P., Papaloukopoulos, G., Sakkopoulos, E., Tsichlas, K., Manolopoulos, Y.: ART: Sub-logarithmic decentralized range query processing with probabilistic guarantees. Distrib. Parallel Databases 31(1), 71–109 (2013)
Weng, S., Liu, M.: Feature-based recommendations for one-to-one marketing. Expert Syst. Appl. 26(4), 493–508 (2004)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. In. Syst. 14(1), 1–37 (2008)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 15–28 (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sioutas, S. et al. (2017). Survey of Machine Learning Algorithms on Spark Over DHT-based Structures. In: Sellis, T., Oikonomou, K. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2016. Lecture Notes in Computer Science(), vol 10230. Springer, Cham. https://doi.org/10.1007/978-3-319-57045-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-57045-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57044-0
Online ISBN: 978-3-319-57045-7
eBook Packages: Computer ScienceComputer Science (R0)