A novel web page recommender using data automatic clustering and Markov process
- 64 Downloads
Abstract
As the web expands and data grow in the networks, finding useful knowledge from the World Wide Web becomes a major challenge. As intelligent systems, the recommender systems help users to find their favorite resource among a large body of information. This research aimed at providing a new page recommender system, which enhances the accuracy of the suggested pages based on the user interest, Automatic Clustering-based Genetic Algorithm (ACGA), and all modified 3th-order Markov Model. Therefore, the frequency and page visiting time by each user have been extracted from the log file of the web server, while the user interest in the page during a session has been computed by a linear combination. ACGA involved 2 phases of the cluster based on (1) similarity measure and (2) Genetic Algorithm. The novel cluster partitioned the vectorization sessions in separate clusters through a new fitness function. Moreover, all modified 3th-order Markov Model used all orders of Markov Model simultaneously, and predicted the next page in the page recommender system. The research did these experiments on a real CTI dataset. Experimental results and their comparison with other approaches showed superiority in the accuracy of the proposed system.
Keywords
Recommender system (RS) User interest Automatic clustering Genetic algorithm Markov process1 Introduction
With regard to the huge volume of information resources, the World Wide Web is considered as a great environment to distribute information. According to the studies conducted so far, information overflow takes place when a search is done by a browser. In addition, information management is provided by the recommender systems (RSs) for user’s access. Results showed that RS is a tool that offers useful products among many possible options to the users [25]. One of the most fundamental goals of the recommender systems is to collect various information about the user’s interests and items in the system. It has been found that there are different resources and methods for gathering such information, including explicit and implicit methods [19]. According to the results, the user explicitly expresses their interests in the explicit method of information collection. Put differently, an implicit method is slightly difficult; that is, the system should find the users’ interests by controlling and following their behaviors and activities [5]. Some researchers divided RSs into 5 main categories of content-based techniques, collaborative filtering techniques, demographic filtering techniques, knowledge-based techniques, and hybrid filtering techniques [2]. As reported by the researchers, these systems are used by most large companies such as Facebook and Google [28].
It is widely accepted that clustering is an algorithm with the highest utilization in the RS [10]. Therefore, clustering technique is applied to cluster data session into a number of similar clusters, which influences the RSs performance. In a web browser, the users’ activities are expressed as a session in order to navigate the desired pages. Hence, a user session is comprised of a number of pages that show the user’s sequential pattern. Modeling the users’ session is performed by clustering. A lot of RSs have been provided, which use k-means and modified K-means technique for clustering sessions [1, 8, 20, 23, 30, 31, 32]. According to the researchers, these algorithms require the number of clusters K as an input. Then, they randomly choose K number of initial seeds (cluster centers) from a dataset. However, it can be difficult to guess the number of cluster K in a big dataset with no prior knowledge. Moreover, it has high sensitivity to the quality of initial seeds. Therefore, this research proposed a novel clustering method in order to find the number of clusters automatically, and used high quality initial seeds. A main contribution of the novel RS includes vectorization of web user sessions using a linear combination. Web server log files have important information about user’s surfing on the web, and the research used frequency and duration of viewing pages by the user for calculating the users’ interest. Another major contribution of the RS is the clustering session based on GA called ACGA, which is capable of automatically finding clusters. As an optimization problem, ACGA intends to find the best value for the validation index and yields the most acceptable result. With the help of a novel fitness function, this research intended to find clusters with high similarities. This method could be applied as a fitness function and cluster evaluation technique. As it is known, the fitness function regards a clustering solution that has compact clusters and big separations or gaps among the clusters. However, since GA spends many times for running big data, this research proposed a cluster-based similarity for finding initial cluster to resolve the problem. The GA uses the average data of each cluster as the input. However, the output of the ACGA consists of a number of clusters that do not provide information about the probability of viewing pages in the future. Thus, it is necessary to use discovery models of sequential patterns, including Markov Model (MM) to predict the web pages. Markov models represent a method for discovering sequential patterns. Researchers have used various orders of MM for predicting the next pages in RSs [21, 30]. They found that MMs (1st- or 2nd-order) have low coverage and accuracy in prediction. Moreover, high orders require a large memory and have numerous states. All k-th order MM overcomes these problems. Authors [7, 16, 22] used this model in the prediction phase but they did not use different orders of MM simultaneously. Hence, the present research used a modified all k-th order MM for increasing the prediction accuracy by taking advantage of all orders of MM concurrently. This method highlighted the probability of the same pages in all order of MM.
This research compared its novel RS with Harmony Session Clustering Recommender (HSCR) [8], Harmony K-means Session Clustering Recommender (HKSCR) [8], Interleaved Harmony K-means Session Clustering Recommender (IHKSCR) [8], K-means, all k-th order MM (KMM), K-means, MM, and Popularity and Similarity-based Page Rank (KMMPSPR) [30] on CTI dataset for 3 evaluation criteria called accuracy, coverage, and F-measure. The proposed RS has better performance than other techniques in CTI dataset for evaluating the criteria. Moreover, the results of the ACGA have been compared with the Automatic Clustering-based on Differential Evolution (ACDE) [29] and K-means [14] algorithms through the Chou-Su (CS) [4] and novel cluster evaluation criteria of Compactness and Separation Measure of Clustering (CSMC), which indicated that it has a high performance.
Section 2 reviews and analyzes the previous studies in the field. Section 3 deals with the proposed algorithm and its steps. Section 4 discusses time complexity of the proposed algorithms. Section 5 evaluates and analyzes the output of the proposed recommender system. Section 6 concludes the research and provides a number of suggestions for further research.
2 Literature review
Clustering the web sessions is one of the important parts in the web usage mining, which is a technique to discover clusters from the web data. Sessions in web are represented in two vector and non-vector models. Many authors used vector model by binary representation for web sessions, while several web session clustering used non-vector model for clustering sessions. Discovering sequential patterns enclosed in the sessions has been used to predict the probability of the page viewing by various Markov models. The main goal of this research has been to propose a non-vector model for sessions. Since the frequency and page visiting time are connotations of the users’ interest, their combination could affect the recommendation accuracy. Another main objective of the research has been to design an automatic clustering session and use a modified all k-th order MM for predicting pages. To cluster sessions, ACGA method has been used for clustering through ACGA as an optimization problem that aimed to find an optimal part of the solution space, which optimized function and cluster sessions. Each solution in the population had a fitness value that relied on the function to be optimized. This research took the cluster evaluation criterion as fitness function. The CSMC method has been defined as a new validation index, and covered both compactness and separation factors in the clusters. As a sequential decision process, the recommendation process should be implemented by modifying all k-th order MM. This MM combined probability pages in different orders of MM and presented a recommendation with high probability. Here, the research provides a literature review of the web page recommender systems, clustering techniques based on the metaheuristic algorithm, and Markov model.
2.1 Web page recommender systems
The RS is a kind of decision-support and intelligent system. It could be used for suggesting the web pages to be visited next in a big website in order to guide the web users to find the relevant information they needed. This type of RS is called web page recommender system. These systems used various techniques for recommendation of pages for a user.
There are numerous studies related to the recommender systems that employed K-means algorithms for clustering sessions [1, 24, 31, 32]. Researches used them because of their simplicity; however, they had generally high sensitivity to the number of clusters and initial seeds in a huge dataset.
Selvi et al. [27] proposed a new RS that originates the Collaborative filtering approach in “A novel optimization algorithm for recommender system using modified fuzzy c-means clustering approach”. The rating users have been clustered with minimal error rate using a proposed modified fuzzy c-means (MFCM) clustering approach. Users in the cluster have been further optimized through the proposed modified cuckoo search (MCS) algorithm with less number of iterations. Using the combining MCS with MFCM, the presented RS reduced the recommendation error rate and provided a list of recommendations with high accuracy.
In their research under “Web user session clustering using modified K-means algorithm”, Poornalatha et al. [23] presented a new clustering method aiming at the clustering web users’ sessions. Since the users’ sessions had variable lengths, a new distance criterion defined as the Variable Length vector Distance (VLVD) has been defined. The proposed method modified the K-means method for clustering web sessions. This approach showed that the number of the iterations of this algorithm is less than that of the regular K-means, and the sessions are placed in appropriate clusters according to the similarity between them. However, two problems of the K-means technique have been not considered in this research.
Mishra et al. [17] conducted a study titled “A web recommendation system considering sequential information” and presented a recommender system using the similarity functions of Sequence and Set Similarity Measure (S^{3}M) and Singular Value Decomposition (SVD). After identifying different users, user clicks have been created by separate sequences, which have been then attained through the S^{3}M similarity function. Upon the entrance of a new user, the first M cluster with the most similarity has been found, and then the response matrix has been formed according to the clusters. The weight vector for the current user has been created according to the location of its pages. Before the page suggestions, the SVD algorithm first reduced the size, after which the pages with the highest weight have been displayed by the output of this step. Moreover, Mishra et al. considered sequential information in web navigation along with content information.
Thew published the “Web page access prediction based on an integrated approach” and provided a recommender system based on KMMPSPR [30]. The main idea in this research has been to solve the problem of the Markov model in providing suggestions when pages had the same probability. After clearing data and identifying users and sessions, this algorithm used the k-means algorithm for clustering sessions. Then, 1st- and 2nd-order Markov model has been applied on each cluster. Here, when the pages identified by the Markov model were likely to be identical PSPR applied to the proposed one, the algorithm decided which one to propose based on their rank. The limitation of this research is that, it considers only low orders of Markov model for predicting pages.
KMM is a method, which uses K-means and all-3th order MM for recommendation. In this algorithm, the user session is initially clustered by the K-means algorithm. Then, MMs 1 to 3 are applied to each of the clusters. Afterwards, in the prediction phase, the first cluster is determined upon the entrance of a new user, and the pages are suggested using the Markov model with all 3th order. This model for vectoring sessions uses only frequency. Moreover, the number of the clusters is defined by users as an input in K-means, but novel RS uses information in log file and does not need defining the number of clusters.
In their paper titled “An effective web page recommender using binary data clustering”, Forsati et al. [8] proposed a new recommender system called harmonic session clustering. The main contribution of this research is clustering user sessions as zero and one vectors using harmony search optimization algorithm. Algorithms such as HSC, HKSC, and IHKSC have been presented in this research. In the HSC, users’ sessions have been clustered by harmony search algorithm. The objective function has been the error least-squares in the HSC. In HKSC, the centers have been introduced as problem solutions, and the K-means algorithm has been implemented on each of them. The algorithm ultimately represented the best response as the top solution. The IHKSC consisted of two stages. Firstly, clustering has been performed by the HSC algorithm after N iterations. Then, the best generation, which is superior over others, as cluster centers, has been taken from a harmonic algorithm and introduced as an input to the K-means algorithm, which performed the clustering. The criterion for assessing the clustering system has been the Average Distance of Sessions to the Cluster (ADSC) and Visit-Coherence (VC). The criteria for evaluating the recommender system included accuracy, coverage, and F-measure. In each algorithm, the number of the clusters has been considered as input and they did not use information in the log file and instead only considered whether a page has been seen by the user or not.
Selvi et al. provided “Personal recommender system based on user interest community in social network model” as a novel time weighted score matrix, in which the users and items with higher correlation are clustered into the same community by using differential equations [3]. Firstly, users’ interest is calculated based on the rounding-Forgetting. Time and score matrix are used as inputs in this function. Secondly, difference equation is performed to follow the clustering evolutionary process and cluster the users and items into several communities. Finally, a recommendation list is proposed according to the predicted scores. Results indicated that the system could improve the recommender system based Collaborative filtering.
2.2 Clustering technique based on Metaheuristic algorithm
Maulik et al. [15] published “Genetic algorithms-based clustering technique” and designed a genetic algorithm for clustering data. They found that the number of the clusters as input is received from the user. Moreover, the K cluster centers have been randomly chosen from dataset. Each chromosome in the population represented the center of clusters. The researchers used inverses of Sum Squared Error (SSE) for Fitness function. This method could be run in artificial and real datasets, and has been compared by K-means algorithm, which showed more acceptable results.
Lin et al. [11] reported “An efficient GA-based clustering technique”, and proposed a GA-based clustering technique, which chooses cluster centers directly from the dataset so that the number of clusters is defined by the user. According to these researchers, the length of the chromosome equals to the size of the dataset, in which the ith gene of the chromosome represents the ith data point in the dataset. Therefore, for a data point of index i to be the candidate for the center of a cluster, the value of the corresponding ith gene in the chromosome is set to “1”; otherwise, it is set to “0”. Thus, each chromosome in the population is evaluated by inverses of Davies–Bouldin (DB) measure as the fitness function. This algorithm was tested on an artificial dataset.
Nonetheless, limitations in the “Genetic algorithms-based clustering technique” by Maulik et al. [15] and “An efficient GA-based clustering technique” by Lin et al. [11] include their requirement for a user input on the number of clusters and random selection of the initial seeds of dataset.
Rahman et al. [26] presented a new clustering technique in “A hybrid clustering technique combining a novel genetic with K-means”. This technique is a combination of k-means and GA together called GenClust technique. It aimed at reaching more acceptable quality clusters without any necessity for user inputs for the number of clusters. The genes are found by genetic algorithm both randomly and deterministically in the initial population. GenClust has been successful to automatically find the right number of clusters and identify the right genes via a novel initial population determination approach. To achieve even a higher quality clustering resolution, centers have been given K-means as starter seeds permitting the adjustment of initial seeds according to the requirements.
GenClust suffers from the following limitations. Firstly, its initial population selection procedure has a time complexity of O(n^{2}), which may be problematic for large datasets. Secondly, it uses a set of user defined radii of clusters to obtain the initial population. It is widely accepted that actual clusters possibly have radii values that vary from one dataset to another that is likely dependent on several factors, including the dataset dimension.
Swagatam et al. [29] developed a novel automatic clustering called ACDE in their study titled “Automatic clustering using an improved differential evolution algorithm”. This method employs an activation threshold (control genes) in the range between 0 and 1 for active or nonnative centers. Such an idea finds the number of the clusters automatically. Moreover, the cluster centroids are randomly fixed between X_{max} and X_{min}, which implies the highest and lowest numerical values of any property of the dataset under test. Furthermore, the response applied DB and CS measures for evaluation. According to the results, this method is capable of finding the number of clusters in various runs; however, accuracy in the clusters would not be perfect.
Thus, for evaluating the clustering strategies, the analysis would be done for standard dataset, namely Wifi-localization.^{1}
Comparison of clustering methods for Wifi-localization dataset
2.3 Markov model
The Markov models are robust mathematical ones that have been suggested to discover sequential patterns, study, and understand random procedures that have also been useful in the prediction of the web pages. Nonetheless, low-order (1st or 2nd order) Markov models cannot precisely predict the next page, which will be visited by the user, because they do not deal with the user’s history deeply enough. Put differently, disadvantages of higher order Markov models are a large number of states and low coverage.
“Mining longest repeating subsequence to predict World Wide Web surfing’ has been reported by Pitkow et al. They employed all k-th order MM to resolve the problem of low-order MM [22]. This technique applies higher order model for prediction. If this order is not able to cover this process, it will be repeated for lower order. The model does not utilize all orders of information simultaneously.
Moreover, Mamoun et al. [16] designed a new modified MM in their study titled “Prediction of user’s web-browsing behavior: application of markov model” where all sessions consist of the same pages, but they are considered the same on different order. The researchers argued that an action on the web may be performed by various paths irrespective of the order that users select. In addition, they decreased the prediction model dimension via elimination of the sessions, in which the pages would be iterated. This research declined the size of the model with no direct effects on the model accuracy.
İn addition, Dhyani et al. published an article titled “Modelling and predicting web page accesses using markov processes’. These researchers assumption has been that the probability of viewing pages would not modify with time in MM [7]. Therefore, the matrices by 2nd until k-order would be achieved by exponent a first-order matrix, after which they employ the all kth-order MM for suggesting pages for the user. Dhyani et al.’s study suffered from a major problem, which is related to the calculation method of probability. Moreover, they did not consider the viewing pages sequence in their research.
3 The proposed algorithm
3.1 Data pre-processing
According to the algorithm, web server logs have been regarded as one of the main sources of information to extract information in the field of web mining. Extensive studies have been considered preparing data for initial processing, collecting, and integrating these data sources for different analyses. However, preparing the applied data would have certain challenges resulting in algorithms and rational techniques for initial processing, including data combination and clearance, user and sessions identification, as well as the viewed page identification [6]. After clearing the data, identification of the users would be the most significant step in the data processing. When the users have been identified, it is necessary to identify the user sessions. According to some researchers, a user session is a set of pages visited by users during a certain visit to the website [9].
3.2 Making session vectors
If P ={p_{1},p_{2},…,p_{n}} is a set of the pages available by the users of a website, and each page is associated with a unique URL, then S ={s_{1},s_{2},…,s_{n}} is a set of web users’ sessions and every s_{i}ϵS is a subset of P. Each session (s_{i}) is represented by a n-dimensional vector as s_{i}={w(p_{1},s_{i}),w(p_{2},s_{i}),…,w(p_{n},s_{i})}, where w(p_{j},s_{i}) represents the weight defined for the j-th web page visited in the session s_{i}.
Ultimately, the identified sessions would be divided into 2 sets of ‘training sessions’ and ‘test sessions’. The training sessions set is employed to implement the suggested method, whereas the test sessions set is applied to evaluate it.
3.3 Automatic web session clustering
The ACGA is a GA for clustering sessions on the web. In general, this algorithm contains 2 phases so that clustering-based similarity is implemented in the first phase.
3.3.1 Clustering-based similarity
A sample dataset
A1 | A2 | A3 | A4 | A1 | A2 | A3 | A4 | ||
---|---|---|---|---|---|---|---|---|---|
R1 | 5.1 | 3.5 | 1.4 | 0.2 | R11 | 5.4 | 3.7 | 1.5 | 0.2 |
R2 | 4.9 | 3 | 1.4 | 0.2 | R12 | 4.8 | 3.4 | 1.6 | 0.2 |
R3 | 4.7 | 3.2 | 1.3 | 0.2 | R13 | 4.8 | 3 | 1.4 | 0.1 |
R4 | 4.6 | 3.1 | 1.5 | 0.2 | R14 | 4.3 | 3 | 1.1 | 0.1 |
R5 | 5 | 3.6 | 1.4 | 0.2 | R15 | 5.8 | 4 | 1.2 | 0.2 |
R6 | 5.4 | 3.9 | 1.7 | 0.4 | R16 | 5.7 | 4.4 | 1.5 | 0.4 |
R7 | 4.6 | 3.4 | 1.4 | 0.3 | R17 | 5.4 | 3.9 | 1.3 | 0.4 |
R8 | 5 | 3.4 | 1.5 | 0.2 | R18 | 5.1 | 3.5 | 1.4 | 0.3 |
R9 | 4.4 | 2.9 | 1.4 | 0.2 | R19 | 5.7 | 3.8 | 1.7 | 0.3 |
R10 | 4.9 | 3.1 | 1.5 | 0.1 | R20 | 5.1 | 3.8 | 1.5 | 0.3 |
Nearest data for each data
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18 | 13 | 4 | 3 | 1 | 19 | 3 | 1 | 4 | 2 | 20 | 8 | 2 | 9 | 17 | 15 | 11 | 1 | 6 | 5 |
3.3.2 Clustering-based genetic algorithm
As the clustering implemented in the first phase has been on the basis of just similarities and relationships between the data, using a technique for combining such clusters with an appropriate measure and extracting the clusters with the most acceptable combination would be crucial. The technique applied in this research is the genetic algorithm.
As it is known, the input of the genetic algorithm is the average of data of each cluster in the first phase. For above instance, according output of clustering-based similarity for 20 data in the first phase, the input genetic algorithm is a dataset with 4 data, each of them represents its cluster data. The respective benefits are the reduced complicatedness of the problem. Moreover, the data would not lose their clusters in the cluster integration procedure such that it would not break. Furthermore, implementation of the genetic algorithm is easier, and can be performed with less data.
3.3.2.1 Representation of the chromosomes
3.3.2.2 Fitness function
The fitness function would be used to calculate the value of each chromosome in the genetic algorithm so that the distance between each sample and each cluster center has been firstly computed. Each sample has been allocated to a cluster with the smallest distance to the center of that cluster. This research assesses the final clusters and provided a novel evaluation criterion. A suitable clustering evaluation criterion follows both compactness and separation parameters. Therefore, the research created its evaluation function accordingly. It also called the fitness function as CSMC.
Therefore, for calculating the separation of a cluster from the other cluster, the distance between all components has been computed in association with another cluster. Equation (6) represents the separation of 2 clusters. Ultimately, total distance has been achieved as the separation of a cluster in association with another one. \(\left| {c_i} \right|\) refers to the numbers of the components belonging to the cluster. \({\vec x_{c_i}}\) stands for the data that belong to the cluster \({c_i}\). K represents the number of active clusters in a chromosome, which would be obtained at the end. In fact, it is the number of active centers that is greater than 0.5.
Also, the smaller of the difference between the maximum distance and the minimum distance in each cluster indicates that the data in the cluster would have higher similarities.
3.3.2.3 Primary population production
It should be noted that a primary population should be existed for initiating the genetic algorithm. It is widely accepted that a random procedure is employed for producing primary population. A random technique possesses a high speed for inducing the primary population. Therefore, such a technique has been employed in this research.
3.3.2.4 Choosing parents
According to this research design, in this phase, some parent chromosomes have been chosen on the basis of the degree of fitness they have taken from the evaluation function so that these chromosomes establish the offspring after applying the genetic operators, including crossover and mutation. Roulette wheel method is one of the most convenient algorithms that should be chosen. According to the above technique, each selection probability value has been firstly sorted together. Afterwards, a random number has been generated within the interval from 0 to 1. Therefore, the interval selection is based on the fact that the sum of the probability values consistently equals 1. Thus, the chromosome index corresponding to the random number has been determined via comparing the random number with the roulette wheel interval.
3.3.2.5 Crossover operator
3.3.2.6 Mutation operator
3.3.2.7 Updating population
Result of the ACGA algorithm
Clustering technique | Number of cluster | Initial center of cluster | Execution time (s) | Accuracy (%) |
---|---|---|---|---|
ACGA | Automatic | Randomly | 78 | 76.2 |
According to Table 4, ACGA has shown the best accuracy in comparison to GA based SSE, GA based DB, GenClust and ACDE algorithms. The run time of ACGA algorithm is lower than GenClust and ACDE, but it has higher execution time in comparison to GA based SSE and GA based DB techniques.
3.4 The all k-th order Markov model for the clusters
In Eq. (9), \(1 \le j \le k\) where \(k\) stands for an equal order of Markov model. \(w_{{\left\{ {j_{1} ,j_{2} , \ldots ,j_{k} } \right\} \to i}}\) refers to the numbers of the times page i appear after set \(\left\{ {j_{1} ,j_{2} , \ldots ,j_{k} } \right\}\) across all sessions.
3.5 Suggestion of pages based on user behavior
The probability of pages after one to three sequences
P5 | P1(0.25),P2(0.5),P4(0.25) | P1(0.25/1 = 0.25) |
P3 P5 | P4(1) | P2(0.5/1 = 0.5) |
P4 P3 P5 | P4(1) | P4(2.25/3 = 0.75) |
Hence, page P4 with a probability of 0.75 would be offered to the user as the next page.
4 Time complexity analyses
\(S\) refers to the number of sessions.
\(MaxIteration\) denotes the highest number of iteration that the GA algorithm repeats.
\(NPOP\) stands for the number of population in GA algorithm.
\(N_{f}\) is the number of final clusters.
\(N_{T}\) represents the number of the testing data that recommender system recommends.
\(k\) indicates the number of orders.
\(N_{P}\) refers to the number of pages.
\(N_{R}\) stands for the number of pages recommended to a user.
Lemma 1
The time complexity of ACGA algorithm is \(O\left( {S + MaxIteration*NPOP} \right)\).
Proof
In the first part of ACGA, the clustering-based similarity algorithm complexity largely is dependent on the number of sessions, \(O\left( S \right)\). In the second part, the GA algorithm depends on the number of \(MaxIteration\) and \(NPOP\). Within a given iteration, a solution would be provided, the parameters should be filled for NPOP population. This algorithm implemented MaxIteration number of times. Hence, the complexity of ACGA algorithm is \(O\left( {S + MaxIteration*NPOP} \right)\).□
Lemma 2
The time complexity of modified all k-th order MM algorithm is \(O\left( {k*N_{f} *S} \right)\).
Proof
The running time of the modified all k-th order MM is linear because making each order of MM takes linear time with the number of session. Thus, to make all k-th order MM need \(O\left( {k*S} \right) = O\left( S \right)\). This algorithm is implemented for each clustering that is found with ACGA algorithm. Moreover, the time complexity of this algorithm depends on \({\text{N}}_{\text{f}}\). Hence, the complexity of modified all k-th order MM is \({\text{O}}\left( {k*N_{f} *S} \right)\).□
Lemma 3
The time complexity of the suggested recommender system is \(O\left( {ACGA} \right) + O\left( {modified all k - th order MM} \right) + O\left( {N_{P} *N_{R} *N_{T} } \right)\).
Proof
Prior to implementing the major recommendation procedure, it is necessary that the clustering sessions create probability matrix for each cluster. With regard to the clustering algorithm and Markov procedure timing, complexity would be summed with each other in order to analyze the recommender system’s timing. During the online phase, an active session should initially identify the best cluster. The clustering should make a comparison between 2 vectors of an active session and a centroid cluster. Hence, this comparison would depend on the number of pages in the system, with time complexity of \(O\left( {N_{P} } \right)\). After determining the cluster of an active session, the high probability pages must be recommended to user. This phase would provide the timing analysis as \(O\left( {N_{P} *N_{R} } \right)\). If the recommendation is iterated for \(N_{T}\) sessions, then the time complexity would be \(O\left( {N_{P} *N_{R} *N_{T} } \right)\).□
5 Evaluation and simulation results
Excel 2010 and MATLAB R2013a software have been applied for simulating the introduced system. The present research initially trained it via a training dataset. When the system has been trained, a set of test data have been employed that have not been contributed to the development of the sequential model for evaluating the suggested system and simulating the active user.
5.1 The CTI dataset
Dataset used in the experiment
Dataset | Period | Number of sessions | Number of pages |
---|---|---|---|
CTI | Two weeks period during April 2002 | 13,745 | 683 |
5.2 Effective parameters in evaluation
The low CS and high CSMC measures in the clustering algorithm revealed that the clustering could maintain acceptable compression and separation concurrently.
5.3 Results of clustering and evaluation of suggestion pages
The parameters value for ACGA and ACDE
ACGA | ACDE | ||
---|---|---|---|
Parameter | Value | Parameter | Value |
Population size | 30 | Population size | 10*dimension of dataset |
The number of iteration | 50 | The number of iteration | 30 |
Crossover probability (\(\mu_{\text{c}}\)) | 0.20 | Crossover rate (\(CR_{max}\)) | 1 |
Mutation probability \(\mu_{m}\) | 0.15 | Crossover rate (\(CR_{min}\)) | 0.5 |
\(K_{min}\) | 2 | \(K_{min}\) | 2 |
\(K_{max}\) | 20 | \(K_{max}\) | 20 |
Final solution for each algorithm with CS and CSMC measures
Algorithm | Avg no. of clusters found | CS measure | Avg no. of clusters found | CSMC measure |
---|---|---|---|---|
K-means | 11.00 | 2.1333 | 6.00 | 0.3202 |
ACDE | 2.4 | 1.6223 | 3.00 | 0.3764 |
ACGA | 8.8 | 1.5126 | 6.00 | 0.4029 |
Experiments on the effect of number of clusters (K) in K-means algorithm based CS and CSMC measures
Number of cluster | CS measure | CSMS measure | Number of cluster | CS measure | CSMS measure |
---|---|---|---|---|---|
K = 2 | 2.8421 | 0.2979 | K = 17 | 2.5892 | 0.2897 |
K = 3 | 3.1079 | 0.2927 | K = 18 | 2.5026 | 0.3058 |
K = 4 | 2.8392 | 0.2916 | K = 19 | 2.4960 | 0.3022 |
K = 5 | 2.9024 | 0.3000 | K = 20 | 2.2442 | 0.2980 |
K = 6 | 2.8468 | 0.3202 | K = 21 | 2.2653 | 0.3015 |
K = 7 | 2.7692 | 0.2843 | K = 22 | 2.3247 | 0.3008 |
K = 8 | 2.6648 | 0.2849 | K = 23 | 2.4944 | 0.2983 |
K = 9 | 2.3685 | 0.2842 | K = 24 | 2.4453 | 0.3126 |
K = 10 | 2.8997 | 0.3070 | K = 25 | 2.3090 | 0.3174 |
K = 11 | 2.1333 | 0.3031 | K = 26 | 2.3642 | 0.3136 |
K = 12 | 2.5040 | 0.2959 | K = 27 | 2.2449 | 0.3050 |
K = 13 | 2.5520 | 0.3047 | K = 28 | 2.1834 | 0.3160 |
K = 14 | 2.6190 | 0.2966 | K = 29 | 2.3111 | 0.2989 |
K = 15 | 2.2384 | 0.3201 | K = 30 | 2.2711 | 0.2972 |
K = 16 | 2.5795 | 0.3099 | Min(CS) = 2.1333 | Max(CSMC) = 0.3202 |
The major objective has been examining if the final clusters achieved by ACGA are more acceptable than other clustering methods with regard to the fitness function (CSMC) and cost function (CS) or not. Outputs revealed the average number of clusters and measures after five independent runs in 3 methods. It has been found that ACGA possessed less CS and greater CSMC values in various runs.
CS measure of ACGA for 10, 30, and 50 population size \(\left| {P_{s} } \right|\)
\(\left| {P_{s} } \right|\) | ACGA \(\left( {\left| {P_{s} } \right| = 10} \right)\) | ACGA \(\left( {\left| {P_{s} } \right| = 30} \right)\) | ACGA \(\left( {\left| {P_{s} } \right| = 50} \right)\) |
---|---|---|---|
CS measure | 2.1891 | 1.5126 | 1.2972 |
Avg no. of clusters found | 10.2 | 8.8 | 9.6 |
CSMC measure of ACGA for 10, 30, and 50 population size \(\left| {P_{s} } \right|\)
\(\left| {P_{s} } \right|\) | ACGA \(\left( {\left| {P_{s} } \right| = 10} \right)\) | ACGA \(\left( {\left| {P_{s} } \right| = 30} \right)\) | ACGA \(\left( {\left| {P_{s} } \right| = 50} \right)\) |
---|---|---|---|
CSMC measure | 0.3580 | 0.4029 | 0.4300 |
Avg no. of clusters found | 9.2 | 6.00 | 8.00 |
CS measure of ACGA for 20, 30, and 40 generations
\(N\) | ACGA \(\left( {N = 20} \right)\) | ACGA \(\left( {N = 30} \right)\) | ACGA \(\left( {N = 40} \right)\) |
---|---|---|---|
CS measure | 1.5563 | 1.5126 | 1.5208 |
Avg no. of clusters found | 7.6 | 8.8 | 10.00 |
CSMC measure of ACGA for 20, 30, and 40 generations
\(N\) | ACGA \(\left( {N = 20} \right)\) | ACGA \(\left( {N = 30} \right)\) | ACGA \(\left( {N = 40} \right)\) |
---|---|---|---|
CSMC measure | 0.4046 | 0.4029 | 0.4095 |
Avg no. of clusters found | 7.8 | 6.00 | 7.00 |
CS measure of ACGA for 10, 50, and 100 iterations
\(Iter\) | ACGA \(\left( {iter = 10} \right)\) | ACGA \(\left( {iter = 50} \right)\) | ACGA \(\left( {iter = 100} \right)\) |
---|---|---|---|
CS measure | 1.8542 | 1.5126 | 1.3218 |
Avg no. of clusters found | 10.2 | 8.8 | 7.2 |
CSMC measure of ACGA for 10, 50, and 100 iterations
\(Iter\) | ACGA \(\left( {iter = 10} \right)\) | ACGA \(\left( {iter = 50} \right)\) | ACGA \(\left( {iter = 100} \right)\) |
---|---|---|---|
CSMC measure | 0.3558 | 0.4029 | 0.4057 |
Avg no. of clusters found | 9.8 | 6.00 | 7.00 |
Results of different Markov models in page suggestions with 9 clusters in CS measure
Number of page recommendation | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1st-MM | Accuracy | 0.41 | 0.31 | 0.28 | 0.24 | 0.22 | 0.20 | 0.20 | 0.19 | 0.17 | 0.18 | 0.18 |
Coverage | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 | |
2nd-MM | Accuracy | 0.36 | 0.29 | 0.27 | 0.26 | 0.24 | 0.23 | 0.21 | 0.21 | 0.19 | 0.19 | 0.19 |
Coverage | 0.80 | 0.82 | 0.82 | 0.80 | 0.79 | 0.79 | 0.78 | 0.74 | 0.75 | 0.72 | 0.73 | |
3rd-MM | Accuracy | 0.34 | 0.28 | 0.27 | 0.25 | 0.23 | 0.23 | 0.20 | 0.22 | 0.20 | 0.19 | 0.18 |
Coverage | 0.57 | 0.55 | 0.51 | 0.48 | 0.47 | 0.48 | 0.44 | 0.44 | 0.41 | 0.41 | 0.39 | |
New All-3th-order MM | Accuracy | 0.44 | 0.32 | 0.29 | 0.26 | 0.23 | 0.22 | 0.22 | 0.20 | 0.18 | 0.19 | 0.18 |
Coverage | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 1.00 | 0.99 |
Results of different Markov models in page suggestions with 8 clusters in CSMC measure
Number of page recommendation | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1st-MM | Accuracy | 0.43 | 0.32 | 0.30 | 0.26 | 0.24 | 0.22 | 0.21 | 0.19 | 0.18 | 0.18 | 0.18 |
Coverage | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 1.00 | 0.98 | |
2nd-MM | Accuracy | 0.39 | 0.32 | 0.29 | 0.27 | 0.26 | 0.24 | 0.22 | 0.20 | 0.20 | 0.19 | 0.19 |
Coverage | 0.81 | 0.83 | 0.82 | 0.79 | 0.78 | 0.77 | 0.77 | 0.74 | 0.73 | 0.70 | 0.71 | |
3rd-MM | Accuracy | 0.35 | 0.29 | 0.27 | 0.26 | 0.25 | 0.23 | 0.22 | 0.22 | 0.19 | 0.19 | 0.18 |
Coverage | 0.57 | 0.56 | 0.50 | 0.47 | 0.46 | 0.47 | 0.43 | 0.43 | 0.39 | 0.40 | 0.39 | |
New all-3th-order MM | Accuracy | 0.45 | 0.34 | 0.30 | 0.28 | 0.25 | 0.24 | 0.23 | 0.21 | 0.21 | 0.19 | 0.19 |
Coverage | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 1.00 | 0.98 |
Notably, evaluation of the suggestions strongly depend on the number of the pages that must be presented to the current users, and each set of pages possesses distinct coverage and accuracy, which are associated with the number of pages. According to the result, the proposed technique enhances a set of suggested pages, and exactly identifies the users’ interests while keeping great accuracy. Inserting 2 factors of accuracy and coverage in the form of the F-measure harmonic function represents the importance of the proposed algorithm in comparison with other algorithms.
Also the complexity of IHKSCR is \(O\left( {g*T_{hs} *T_{KM} *d*K*S^{2} } \right) + O\left( {N_{P} *N_{R} *N_{T} } \right)\), where \(g\) is the number of generations, that recommender system proposes recommendation [8].
The complexity of the KMMPSPR algorithm is \(\left( {T_{KM} *d*K*S} \right) + O\left( {k*K*S} \right) + O\left( {P*k*K*S} \right) + O\left( {N_{P} *N_{R} *N_{T} } \right)\) [30], \(P\) is the number of page in data set. The complexity of KMM algorithm is complexity of \(O\left( {T_{KM} *d*K*S} \right) + O\left( {k*K*S} \right) + O\left( {N_{P} *N_{R} *N_{T} } \right)\).
The results indicate that the proposed algorithm is computationally more expensive than the KMMPSPR [30] and KMM algorithms, but it is better than HSCR [8], HKSCR [8] and IHKSCR [8].
6 Conclusion
The present research proposed a recommender system for offering web page suggestions to the web users. Particularly, the user sessions in the first phase have been vectored on the basis of the user interest. Moreover, an automatic clustering algorithm has been used to cluster them. In addition, the modified Markov models from order 1 to 3 have been applied for building the probability matrix. Furthermore, during the online phase, the recommender system identified the nearest cluster as soon as the user logged in, and then proposed new pages through the Markov model with all 3^{th} order. The suggested system has been eventually contrasted to the basic algorithm for coverage, accuracy, and F-measure. It has been found that the system possessed a greater accuracy based on the set of the introduced pages.There are two limitations against the advantages of the proposed recommender system. Creating a probability matrix in Markov models expands a big memory. Some columns in this matrix have zero probability and didn’t used. A possible solution is to use a hash table to store non-zero probability data. Another limitation is the random selection of the initial centers of the clusters in the second phase of ACGA algorithm. To increase the accuracy of clustering, a possible solution can be the selection of an initial centers based on the deterministic and random process. These problems will be improved in the future.
Footnotes
Notes
Acknowledgements
The researcher appreciates Dr Ali Harounabadi and Fateme Hamzeheian for helping in discussion of this research. I wish to sincerely thank to the honored reviewers for their comments that helped me for improving the research.
Compliance with ethical standards
Conflict of interest
The corresponding author states that there is no conflict of interest.
References
- 1.Almurtadha Y, Bin Sulaiman M, Mustapha N, Udzir N (2011) IPACT: improved web page recommendation system using profile aggregation based on clustering of transections. Am J Appl Sci 8(3):277–283CrossRefGoogle Scholar
- 2.Alyari F, Jafari Navimipour N (2018) Recommender systems: a systematic review of the state of the art literature and suggestions for future research. Kybernetes 47(5):985–1017CrossRefGoogle Scholar
- 3.Chen J, Wang B, Liji U, Ouyang Z (2019) Personal recommender system based on user interest community in social network model. Phys A 526:1–14MathSciNetCrossRefGoogle Scholar
- 4.Chou C, Su M, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7(2):205–220MathSciNetCrossRefGoogle Scholar
- 5.Ciobanu D, Dinuca CE (2012) Predicting the next page that will be visited by a web surfer using page rank algorithm. Int J Comput Commun 1(6):60–67Google Scholar
- 6.Deshpande M, Karypis G (2004) Selective markov models for predicting web page accesses. ACM Trans Internet Technol (TOIT). 4(2):163–184CrossRefGoogle Scholar
- 7.Dhyani D, Bhowmick SS, Ng W (2003) Modelling and predicting web page accesses using markov processes. In: 14th international workshop IEEE database and expert systems applications, pp 332–336Google Scholar
- 8.Forsati R, Moayedikia A, Shamsfard M (2015) An effective web page recommender using binary data clustering. Springer, New York, pp 1–48Google Scholar
- 9.Hu M, Liu B (2004) Mining and summarizing reviews. In: Proc of ACMSIGKDD Intl on Knowledge Discovery and Data Mining (KDD’ 04), pp 168–172Google Scholar
- 10.Isinkaye F, Folajimi Y, Ojokoh B (2015) Recommendation systems: principles, methods and evaluation. Egypt Inform J 16:261–273CrossRefGoogle Scholar
- 11.Lin H, Yang F, Kao Y (2005) An efficient GA-based clustering technique. Tamkang J Sci Eng 8(2):113–122Google Scholar
- 12.Liu H, Keŝelj V (2007) Combined mining of web server logs and web contents for classifying user navigation patterns and predictings user’s future requests. Data Know Eng 61:304–330CrossRefGoogle Scholar
- 13.Liu Y, Li Z, Xiong H, Gao X, Wu S (2013) Understanding and enhancement of internal clustering validation measures. IEEE Trans Cybern 43(3):982–993CrossRefGoogle Scholar
- 14.Lloyd SP (1992) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136MathSciNetCrossRefGoogle Scholar
- 15.Maulik U, Bandyopadhyay S (2000) Genetic algorithms-based clustering technique. Pattern Recogn 33:1455–1465CrossRefGoogle Scholar
- 16.Mamoun A, Khalil I (2012) Prediction of user’s web-browsing behavior: application of Markov model. IEEE Trans Syst Man Cybern Part B Cybern 42(2):1131–1142Google Scholar
- 17.Mishra R, Kumar P, Bhasker B (2015) A web recommendation system considering sequential information. Decis Support Syst 75:1–10CrossRefGoogle Scholar
- 18.Milovanĉevi NS, Graĉanac A (2019) Time and ontology for resource recommendation system. Phys A 525:752–760CrossRefGoogle Scholar
- 19.Modarresi K (2016) Recommendation system based on complete personalization. Proc Comput Sci 80:2190–2204CrossRefGoogle Scholar
- 20.Mulyawan B, Christani V, Wenas R (2019) Recommendation product based on customer categoration with K-means clustering method. In: IOP conference series materials science and engineering, vol 508, pp 1–7Google Scholar
- 21.Narvekar M, Banu SS (2015) Predicting user’s web navigation behavior using hybrid approach. Int Conf Adv Comput Technol Appl (ICACTA) 45:3–12Google Scholar
- 22.Pitkow J, Pirplli P (1999) Mining longest repeating subsequence to predict World Wide Web surfing. In: 2^{nd} USENIX symposium on internet technologies and systems, Boulder, COGoogle Scholar
- 23.Poornalatha G, Raghavendra RS (2011) Web user session clustering using modified K-means algorithm. In: Communications in computer and information science (CCIS), Advances in computing and communications (ACC 2011), vol 191, pp 243–252Google Scholar
- 24.Poonalatha G, Prakash SR (2017) Session based collaborative filtering for web page recommender (SCFR) system based on clustering. Int J Control Theory Appl (IJCTA) 10(8):679–687Google Scholar
- 25.Kumar P, Kumar V, Thakur RS (2019) A new approach for rating prediction system using collaborative filtering. Iran J Comput Sci 2:81CrossRefGoogle Scholar
- 26.Rahman A, Islam Z (2014) A hybrid clustering technique combining a novel genetic with K-means. Knowl-Based Syst 71:345–365CrossRefGoogle Scholar
- 27.Selvi C, Sivansankar E (2017) A novel optimization algorithm for recommender system using modified fuzzy c-means clustering approach. Soft Comput 23(6):1901–1916CrossRefGoogle Scholar
- 28.Singh M (2004) The practical handbook of internet computing, 1st edn. CRC Press, New YorkCrossRefGoogle Scholar
- 29.Swagatam D, Ajith A, Amit K (2008) Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst Man Cybern Part A Syst Hum 38(1):218–237CrossRefGoogle Scholar
- 30.Thew P (2014) Web page access prediction based on an integrated approach. Int J Comput Sci Bus Inform 12(1):55–64Google Scholar
- 31.Thiyagarajan R, Thangavel K, Rathipriya R (2014) Recommendation of web pages using weighted K-Means clustering. Int J Comput Appl 86(14):44–48Google Scholar
- 32.Vijaya Kumar T, Guruprasad HS (2015) Clustering of web usage data using hybrid K-means and PACT algorithm. Int J Inf Technol 7(2):871–876Google Scholar
- 33.Yang C, Thiphuong Quyen N (2018) Data analysis framework of sequential clustering and classification using non-dominated sorting genetic algorithm. Applied Soft Computing. Elsevier, New York, pp 1–15Google Scholar
- 34.Zheng N, Li Q (2011) A recommender system based on tag and time information for social tagging systems. Expert Syst Appl 38:4575–4587CrossRefGoogle Scholar