Abstract
With the rapid development of information technology, recommender systems have become critical components to solve information overload. As an important branch, weighted hybrid recommender systems are widely used in electronic commerce sites, social networks and video websites such as Amazon, Facebook and Netflix. In practice, developers typically set a weight for each recommendation algorithm by repeating experiments until obtaining better accuracy. Despite the method could improve accuracy, it overly depends on experience of developers and the improvements are poor. What worse, workload will be heavy if the number of algorithms rises. To further improve performance of recommender systems, we design an optimal hybrid recommender system on Spark. Experimental results show that the system can improve accuracy, reduce execution time and handle large-scale datasets. Accordingly, the hybrid recommender system balances accuracy and execution time.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Along with the popularization of the Internet, a sharp increase in the amount of data leads to information overload [1]. Thus, recommender systems [2] were proposed to relieve the stress of massive data. To improve recommender systems performance, researchers put forward the weighted hybrid method. Despite performance boost has been brought by the method, there are still several problems affecting performance, including weight setting and computation load. Hence, we implement a weighted hybrid recommender system on Spark. In the system, we design a new method to compute weights, using cluster analysis and user similarity. Besides, the execution time can be reduced by deploying the system on Spark.
1.1 Hybrid Recommender Systems
Hybrid recommender systems combine two or more recommendation algorithms to overcome weaknesses of each algorithm. It is generally classified as Switching, Mixed, Feature Combination, Meta-Level, and Weighted [3].
The weighted hybrid technique combines different algorithms with different weights [3]. The main idea is that the algorithm with better accuracy has a higher weight. At present, developers always set a weight for an algorithm manually and repeat experiments until achieving superior accuracy. Thus, the method depends on developers’ experience to determine accuracy of an algorithm in different datasets. Due to large-scale datasets, sparsity of rating data and the number of algorithms, it’s generally hard to obtain appropriate weights. Eventually the improvements of accuracy are poor.
In addition, to improve user experience, the system should return recommendation results efficiently. In other words, it has to quickly locate information which can appeal users in massive data. Thus, execution time is another evaluation standard of performance. However, the weighted hybrid technique needs to execute two or more algorithms and compute hybrid results, it’s tough to reduce execution time.
Apart from accuracy and execution time of the system, scalability is also an important consideration. With the increasing of data scale and the algorithm complexity, the system requires more storage space and computing resources. It’s difficult to meet the actual demand by only optimizing algorithms.
To address the above-mentioned issues, we design a hybrid recommender system on Spark. In the system, we propose an optimized method to improve accuracy. It computes weights and hybrid results based on cluster analysis and user similarity. Meanwhile, we deploy the system on Spark which is a fast and general engine for large-scale data processing [4] to accelerate the training process and improve scalability.
1.2 Work of Paper
The rest of this paper is organized as five sections. Section 2 reviews recommendation algorithms and introduces the Spark. Section 3 describes the design of the optimized method. Section 4 shows how we implement the system on Spark. Section 5 gives experimental results and our analysis. Section 6 presents our conclusions and future work.
2 Related Work
In this section, we first review and compare recommendation algorithms and recommender systems. Then, we briefly analyze predicting ratings of algorithms. Finally, we introduce the distributed computing platform Spark and compare Hadoop and Spark.
2.1 Recommender Systems
Recommendation algorithms are the basis of recommender systems. In this section, we first introduce several representative algorithms.
Collaborative recommendation is almost the most popular algorithm. Based on overlapped ratings, it computes similarities among users. And then, it uses similarities to predict the rating that the current user on an item [5]. Tapestry [6], Ringo [7] and GroupLens [8] are typical systems with the algorithm.
Content-based recommendation pays attention to connections between items. It analyses descriptions of items that have been rated by users [9] and calculates similarities between items. The representation of an item’s feature and the way to classify a new item are two important sub-problems [9].
Demographic-based recommendation [9] is a simple algorithm. It focuses on types of users that like a certain item. The technique identifies features of users such as age, gender, nationality, education, etc. It measures user similarity by taking those features into consideration. Table 1 shows strengths and weaknesses of each algorithm [5, 9].
As the simple and effective technique, the weighted hybrid recommender system has been widely used in numerous fields. P-Tango and Pazzani are two typical systems. P-Tango is an online news system. It combines collaborative and content-based recommendation algorithms. The system adjusts weights of algorithms in the process of operation. Until the system obtains the expected accuracy, it determines weights. Pazzani is the other weighted hybrid recommender system. It combines collaborative, content-based and demographic-based recommendation algorithms. The system uses voting to determine recommendation results.
2.2 Weight Analysis
As previously described in Sect. 1, we give the formalized representation of the weighted hybrid technique as follows:
where j represents the j’th algorithm, it ranges from 1 to n. \(\alpha _j\) corresponds to the weight of the j’th algorithm. \(r_{ui}^{j}\) is the predicting rating of user u on item i by the j’th algorithm. \(\widetilde{R_{ui}}\) indicates the final hybrid result. From the formula (1), we can recognize that each algorithm just has a certain weight. That means the technique presupposes that predicting ratings of an algorithm are all greater or less than their ratings. However, this condition evaluates to false. Here we give some empirical evidence. We implement the User-based Collaborative Filtering (User-CF) and the Alternating Least Squares (ALS) in Python2.7, and use MovieLens-100K as observed data.
In the Table 2, countH is the number of predicting ratings which are greater than real ratings. The countL is less than real ratings and countE is equivalent amounts. From the empirical results, we know that:
-
(1)
In these algorithms, there are little predicting ratings that equal to ratings.
-
(2)
A part of predicting ratings are greater than ratings, and another are less than ratings.
-
(3)
Only a weight for an algorithm may affect accuracy.
Thus, it is essential to optimize weights.
2.3 Spark
Spark is a fast and general-purpose cluster computing platforms for large-scale data processing [4] which is developed by UC Berkeley. In the environment of Spark, it includes Spark SQL [10], Spark Streaming [11], Mllib [12], GraphX [13], etc. Based on resilient distributed dataset (RDD) [14], it achieves memory-based computing, fault tolerance and scalability. Currently, Spark is deployed in Amazon, ebay and Yahoo! to process large-scale datasets.
For a hybrid recommender system, performance is affected by data scale, the number of algorithms and the complexity of algorithms. Deploy the system on Spark can mitigate above affects.
-
(1)
In the system, large-scale datasets could be stored in distributed storage.
-
(2)
Algorithms are independent with each other, they are supposed to be performed in parallel.
-
(3)
Intermediate values can be cached in memory to decrease execution time.
Therefore, in this paper, we design an optimized hybrid recommender system on Spark.
3 Design Overview
The empirical evidence from Sect. 2 suggests that accuracy still has chance to be improved. The predicting ratings are higher or lower than corresponding ratings. Thus, we use cluster analysis to obtain more accurate weights. The principle of cluster analysis is that according to the properties of samples, using mathematical methods to determine relationship between samples, and according to the relationship to cluster samples. Based on cluster analysis, we present an optimized method for calculating personalized weights. Now let us discuss the method in detail.
3.1 Objective Function
In this section, we first give explanations of several concepts. In the following statement:
-
(1)
Assume that there are n algorithms in the system and j is the j’th algorithm.
-
(2)
u for user, i for item and (u,i) represents the data item of u and i.
-
(3)
\(R_{ui}\) is the rating of u on i, \(r_{ui}^{j}\) is the predicting rating of u on i which is computed by the j’th algorithm.
-
(4)
For the j’th algorithm, the error between the rating and the predicting rating is: \(D_{ui}^{j} = R_{ui} - r_{ui}^{j}\). In order to reduce \(\sum _{j=1}^{n} \sum _{u,i}D_{ui}^{j}\), similar errors are expected to get same weights. Based on errors, we divide (u,i) into k clusters and design \(\varvec{C_{ui}} = (c_{1}, c_{2}, \cdots , c_{k})\) to reflect the cluster of (u,i). For the j’th algorithm, \(\varvec{\alpha _{j}} = (\alpha _{j1}, \alpha _{j2}, \cdots , \alpha _{jk})\) represents k weights of the algorithm. \(\varvec{\alpha _{j}}\) \(\varvec{C_{ui}^{T}}\) finally determines the weight for \(r_{ui}^{j}\).
According to our analysis, we define the objective function as formula (2):
3.2 Weight Calculation
According to \(D_{ui}^{j}\), the optimized method classifies all (u,i) into k clusters. For each (u,i), it has a vector \(\varvec{C_{ui}} = (c_{1}, c_{2}, \cdots , c_{k})\) and is initialized to \(\varvec{C_{ui}} = (0, 0, \cdots , 0)\). The value which corresponds to (u,i)’s cluster is set to 1. For instance, if (u,i) belongs to the cluster 2, \(\varvec{C_{ui}} = (0, 1, 0, \cdots , 0)\). The weight for \(r_{ui}^{j}\) is \(\alpha _{j2}\) which is computed by \(\varvec{\alpha _{j}}\varvec{C_{ui}}^{T}\). Therefore, \(\varvec{C}\) could map weights to predicting ratings and achieve multiple weights for an algorithm. Figure 1 shows the pipeline of the method.
After calculating \(\varvec{C}\), the optimized method requires to compute \(\varvec{\alpha _{j}}\). For the purpose of minimizing the objective function, we make use of the Lagrange theory and minimum theory [15, 16]. Based on formula (2), the method constructs the Lagrange function \(L(\varvec{\alpha })\).
For each j, let \(\frac{\partial {L}}{\partial {(\varvec{\alpha _{j}} \varvec{C_{ui}}^{T})}} = 0\). We can get an equation:
The Eq. (6) can be represented by matrix:
Thus the weight matrix X can be calculated by
The optimized method uses ratings which have already stored in the system to compute weights. However, these weights aren’t entirely appropriate for a new (u,i). We further introduce user similarity to compute weights. The user similarity is computed by cosine similarity:
where \(sim_{u,v}\) is the similarity between u and v. \(N_{(u)}\) means the number of items that u have rated. \(N_{(v)}\) is the same as \(N_{(u)}\). For the \((u^{'},I^{'})\), the optimized method calculates the hybrid result as:
The Eq. (10) is able to filter the interference of non similar weights and get a personalized weight for the \((u^{'},I^{'})\).
4 Implementation
According to the design overview, we deploy the hybrid recommender system on Spark. The system contains data storage, prediction, cluster, weight, model fusion and recommendation, totally 6 modules. Figure 2 shows the architecture of the system.
4.1 Modules
Data storage module is the basis of the system. It stores input data, including historical data and ratings. We use HDFS which is a distributed file system to store raw data [17]. The pre-processed data are put in the database such as HBase, Redis, Hive, etc. [18–20]. Topside modules read data from the database. Prediction module is used to compute predicting ratings. It performs recommendation algorithms in parallel. Outputs are predicting ratings.
The cluster module concentrates on errors of (u,i). It exploits k-means to classify (u,i). Output of the module is \(\varvec{C}\). The weight module accepts \(\varvec{C}\) to compute weights. With \(\varvec{C}\) and \(\varvec{\alpha }\), the module can get a weight for each \(r_{ui}^{j}\). Output of it is \(\varvec{\alpha }\).
The model fusion calculates hybrid results based on predicting ratings, \(\varvec{C}\), \(\varvec{\alpha }\) and user similarity. According to these parameters, it determines hybrid results by logistic regression [21]. Recommendation is used to recommend items for users. Based on hybrid results, it generates recommendation lists. Besides, it also outputs an evaluation for results.
4.2 Discussion
In the hybrid recommender system on Spark, data are translated into RDDs. Because of the characteristics of memory-based computing and parallel operations, RDDs can be processed in parallel to reduce execution time. The read-only and fault tolerance of RDD make the system more reliable. Besides, due to the distributed storage of Spark, the system is able to handle large-scale datasets. It improves scalability of the system. Therefore, deploy the hybrid recommender system on Spark could decrease execution time and further improve scalability.
5 Performance
5.1 Evaluation Index
Accuracy. The system accuracy is measured by root mean square error (RMSE) [22]. It is defined as:
where \(R_{ui}\) and \(\hat{r_{ui}}\) is the rating and the hytbrid result that u on i respectively. |T| denotes the number of \(\hat{r_{ui}}\).
Execution Time. The execution time includes time of algorithms, clustering, calculating weights and hybrid results. It is measured in minutes.
5.2 Experimental Setup
In this experiment, we choose Spark as our platform. All experiments were performed using a local cluster with 7 nodes (1 master and 6 worker nodes): each node has Xeon(R) dual-core 2.53Â GHz processor and 6Â GB memory.
Dataset. In Table 3, we list datasets that were used in the experiment. For each dataset, we divide it into 2 training sets and a test set randomly.
Algorithms. We implement 3 recommendation algorithms: User-CF, Item-based Collaborative Filtering (Item-CF) and ALS. We perform them in training sets and test sets to compute predicting ratings, weights and hybrid results.
Nodes. We compare execution time of the stand-alone system and the distributed system. For the former, we use the server with Xeon(R) dual-core 2.53Â GHz processor and 6Â GB memory. For the latter, we use a local cluster with 7 nodes (1 master and 6 worker nodes): each node has Xeon(R) dual-core 2.53Â GHz processor and 6Â GB memory.
5.3 Performance Comparision
In this section, we evaluate performance of the hybrid recommender system on Spark, including accuracy and execution time.
Figure 3 shows impacts of data scales on accuracy. In the experiment, we performe the combination of User-CF and ALS on four MovieLens datasets. In the Fig. 3, with the increasing of data scale, RMSE generally decreases. Due to the sparsity of MovieLens-700K, the hybrid recommender system obtains the best result. Compare with User-CF and ALS, the system improves accuracy of 8.21
Figure 4 gives the RMSE of different types of datasets. In the experiment, we performe the combination of User-CF and ALS on MovieLens-400K and BookCrossing. The Fig. 4 shows that the hybrid recommender system can improve accuracy of different types of datasets. And there are significant improvements on BookCrossing. The improvements demonstrate that the system is available for sparse datasets.
Figure 5 shows correlations between accuracy and combinations of algorithms. In the experiment, four combinations of algorithms are performed on MovieLens-100K, MovieLens-400K and BookCrossing respectively. In Fig. 5, the hybrid recommender system obtains better accuracy than single algorithm. When accuracy of single algorithm is favorable, the hybrid recommender system also obtains better accuracy.
Figure 6 compares execution time of stand-alone mode and local cluster mode. The experiment performs the combination of User-CF and ALS on MovieLens-100K to MovieLens-1M. For the stand-alone system, execution time increases sharply with the expansion of data scale. However, execution time of local cluster mode remains relatively constant. When the data scale is larger than MovieLens-900K, the stand-alone mode couldn’t handle it. The local cluster mode could handle MovieLens-10M or larger datasets. From Fig. 6, we can recognize that memory-based computing, parallel operations and distributed storage of Spark are helpful to decrease execution time and improve scalability.
6 Conclusion and Future Work
Improving performance of recommender systems is a crucial solution for information overload. This paper designs a new weighted hybrid recommender system to solve this problem. We are the first to compute weights by using cluster analysis, user similarity and minimum theory. Besides, we deploy the hybrid recommender system on Spark. The system improves accuracy by optimizing weights and reduces execution time by memory-based computing and parallel operations. And distributed storage of the system is helpful to improve scalability. The experiment results demonstrate the performance of our hybrid recommender system.
In future work, we will consider to improve and extend the system: expansion of algorithm to process more complex scenes. Further research on factors influencing weights to improve accuracy. Meanwhile, optimize the implementation of the system on Spark.
References
Eppler, M.J., Mengis, J.: The concept of information overload: a review of literature from organization science, accounting, marketing, mis, and related disciplines. Inf. Soc. 20(5), 325–344 (2004)
Cosley, D., Lam, S.K., Albert, I., Konstan, J.A., Riedl, J.: Is seeing believing?: how recommender system interfaces affect users’ opinions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 585–592. ACM (2003)
Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User-Adap. Inter. 12(4), 331–370 (2002)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)
Burke, R.: Hybrid systems for personalized recommendations. In: Mobasher, B., Anand, S.S. (eds.) ITWP 2003. LNCS (LNAI), vol. 3169, pp. 133–152. Springer, Heidelberg (2005)
Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to weave an information tapestry. Commun. ACM 35(12), 61–70 (1992)
Shardanand, U., Maes, P.: Social information filtering: algorithms for automating word of mouth. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 210–217. ACM Press/Addison-Wesley Publishing Co. (1995)
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: an open architecture for collaborative filtering of netnews. In: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, pp. 175–186. ACM (1994)
Pazzani, M.J.: A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev. 13(5–6), 393–408 (1999)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. Presented as Part of the (2012)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. arXiv preprint arXiv:1505.06807 (2015)
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM (2013)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)
Borneas, M.: On a generalization of the lagrange function. Am. J. Phys. 27(4), 265–267 (1959)
Mitra, N.J., Nguyen, A.: Estimating surface normals in noisy point cloud data. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pp. 322–328. ACM (2003)
Borthakur, D.: The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007), 21 (2007)
Zhang, D.W., Sun, F.Q., Cheng, X., Liu, C.: Research on hadoop-based enterprise file cloud storage system. In: 2011 3rd International Conference on Awareness Science and Technology (iCAST), pp. 434–437. IEEE (2011)
Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications (ICPCA), pp. 363–366. IEEE (2011)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Tsukimoto, H.: Logical regression analysis: from mathematical formulas to linguistic rules. In: Chu, W., Lin, T.Y. (eds.) Foundations and Advances in Data Mining. SFSC, vol. 180, pp. 21–61. Springer, Heidelberg (2005)
Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res. 30(1), 79 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 IFIP International Federation for Information Processing
About this paper
Cite this paper
Wang, J. et al. (2016). A Fast and Better Hybrid Recommender System Based on Spark. In: Gao, G., Qian, D., Gao, X., Chapman, B., Chen, W. (eds) Network and Parallel Computing. NPC 2016. Lecture Notes in Computer Science(), vol 9966. Springer, Cham. https://doi.org/10.1007/978-3-319-47099-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-47099-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47098-6
Online ISBN: 978-3-319-47099-3
eBook Packages: Computer ScienceComputer Science (R0)