Keywords

1 Introduction

With the rapid development of mobile Internet and mobile terminal location technology in recent years, location-based service (LBS), which has been applied to various applications, is becoming the standard configuration of mobile Internet applications. All kinds of LBS applications, which can be mainly categorized as entertainment, social networking, life service, and business service, are merged into social networking services (SNS), thereby changing the way people socialize. For instance, the traditional relationship of sociality is based on friends or some kind of stable social relation constructed by interests. However, through the LBS, we can find a user who is engaged in the same activity, in the same place, and at the same time. As a result, we build a new social relation based on location. Moreover, as a dimension of information filter, location can be used to improve the validity and accuracy of a user’s fetching and sharing of information. Furthermore, relying on their significant relationship and interest spectra, location-based social networks (LBSNs) construct an online to offline (O2O) pattern [1] through releasing text, images, videos, and audios that possess location information. The O2O pattern improves the propagandist strength of a businessman as well as makes finding valuable locations easy for a user.

2 Background

As LBSNs develop, information overload appears inevitable because of the promotion of user-based and location information. To solve such a problem, a recommender system based on location service is proposed and is gradually becoming a popular subject of current studies. In current mainstream LBSNs, a user’s location preferences are mainly measured by the number of check-ins. For example, Berjani and Strufe [2] established a user-site score matrix by crawling Gowalla’s data, after which they gain results by using orthogonal matrix decomposition. Ying et al. [3] established UPOI-Mine, a recommender system driven by a prediction model based on a regression tree. Ye et al. [4] combined the geographic feature of social relation and location, and proposed the geo-measured friend-based collaborative filtering algorithm. While the above methods are only concerned with check-in records and ignore the semantic information between locations, which may easily lead to a problem of identifying similar locations, Lee and Chung [5] used semantic information of sites to compute the similarity of users for the first time. However, they neglected the effect of distance. Currently, research on LBSNs is mainly based on collaborative filtering [6], but the large data in LBSNs will result in sparsity of check-in matrix. Thus, Leung et al. [7] used community location model (CLM) to show the relations among users, activities, and locations. They used community-based agglomerative-divisive clustering to cluster CLM and reduce the sparsity. In addition, for a better universal method of reducing the dimension of user check-in matrix, Zhou et al. [8] proposed the use of PLSA topic model to determine the implicit subject between users and locations.

In LBSNs, each user’s check-in information of location becomes easy to obtain because of the openness of such information. By analyzing different users’ check-in behaviors, we can recommend users with similar location preferences and possible interesting locations, which makes finding valuable user and information in massive data effective for users as well as increases O2O business through location recommendation. SNS data are characterized by a large order of magnitude, authenticity, and instantaneity. Therefore, we believe it can reflect users’ behavior effectively.

3 RC Algorithm

In this paper, clustering all the sites is an important part of the location recommendation algorithm because it can reduce the dimension and improve the recommendation accuracy. Clustering, which is mainly based on the similarity of data objects, involves dividing data objects with similar characteristics based on certain rules into different subsets, which are usually called clusters or groups. The goals of clustering are to make object similarity in a cluster as high as possible and make the object similarity among clusters as low as possible [9]. The common clustering algorithm can be divided into partitioning methods [10], hierarchical methods [11], density-based methods [12], grid-based methods [13] and model-based methods.

The RC algorithm clusters sites based on the domain density according to the aggregation effect of geography so that it can reduce the dimension and avoid user similarity reduction because of check-in error. Subsequently, it calculates the signed location for each cluster from the clustering result and transfers a cluster that contains multiple locations into an abstract signed location with the geographic coordinate and hierarchical category. Then, it transfers the check-in vector of a user relative to the location into the check-in vector of the signed location of the location cluster and calculates the similarity among users. Finally, it selects users with Top-K similarity as a user recommendation, after which we use the recommended users for collaborative filtering calculation based on location preferences. We obtain the Top-N score locations as the location recommendation when we have finished adding the preference of the target user relative to this class of location to the calculation of unvisited sites.

Fig. 1.
figure 1

Hierarchical categories tree based on location cluster

Here i would like to introduce user similarity calculation based on hierarchical category tree. As shown in Fig. 1 we can construct a hierarchical tree based on the location cluster by transferring the category property to the location cluster. The bottom of the tree is the location cluster layer, while CategorySmall, CategoryMedium, and CategoryLarge are the small, middle, and large categories, respectively, to which the location cluster belongs. A high hierarchy corresponds to a large grain size and more location clusters. Constructing the hierarchical category tree improves the calculation of the similarity among users. For instance, user \( u_{a} \) always checked in location cluster \( c_{1} \), whose signed location is Commercial Street of Beijinglu, while user \( u_{b} \) always checked in the location cluster \( c_{2} \), whose signed location is Commercial Street of Shangxiajiu; the abovementioned two users will have no similarity because they have not checked in the same location cluster if we consider the cluster layer only. However, if we consider the small category property of \( c_{1} \) and \( c_{2} \), then we will find that both \( c_{1} \) and \( c_{2} \) belong to the characteristic commercial walking street. Thus, users \( u_{a} \) and \( u_{b} \) possess a similarity in the CategorySmall layer. For the other hierarchy, if users do not possess a common check-in category in a hierarchy, then we could consider a higher hierarchy. If users have a similarity in the lower hierarchy, then their similarity is higher. For example, users \( u_{a} \) and \( u_{b} \) have common check-in location clusters in the cluster layer, while users \( u_{a} \) and \( u_{c} \) do not possess a similarity in the same layer. But if users \( u_{a} \) and \( u_{c} \) have a common check-in category in the CategorySmall layer, then the similarity between users \( u_{a} \) and \( u_{b} \) is higher than the similarity between users \( u_{a} \) and \( u_{c} \). Therefore, to reduce the weight of hierarchy in the user similarity calculation, the hierarchy with a higher grain size should be multiplied by the corresponding coefficient.

4 DCC Algorithm

The RC algorithm contains different kinds of locations after clustering, and the category property of the signed location would cover the category property of other locations when using the signed location conversion algorithm. Therefore, the final check-in vector of the user to the location cluster cannot represent the true preference of the user in each category. To maintain the user’s original location category preference when clustering the location, we propose the DCC algorithm. DCC algorithm clusters the locations that are near each other or have similar categories to improve the similarity among locations in the cluster. As shown in Figs. 2 and 3, while the RC algorithm can find only conglomerated location clusters, which are not overlapped in the geographical space, the DCC algorithm can find the crossed location clusters, thereby obtaining higher flexibility in location clustering.

Fig. 2.
figure 2

location cluster result of RC algorithm

Fig. 3.
figure 3

Location cluster result of DCC algorithm

4.1 Similarity Based on Distance and Category

  1. (1)

    Similarity in location distance

    Locations have geographical position attributes, and they are represented by two points in a two-dimensional orthogonal coordinate system. In the two-dimensional space, we can use the Euclidean distance to measure the distance between two locations. However, we should use spherical distance formula to calculate the actual distance of two locations because Earth is a sphere. If location \( p_{i} \)’s geographic coordinate is \( (log_{i} ,lat_{i} ) \), location \( p_{j} \)’s geographic coordinate is \( (log_{j} ,lat_{j} ) \), and R = 6370856 (meters) is the rough radius of Earth, then the distance formula of the two locations can be shown as:

    $$ Dis\left( {p_{i} ,p_{j} } \right) = R\cdot\cos^{ - 1} (\sin \left( {lat_{i} } \right) \cdot \sin \left( {lat_{j} } \right) + \cos (lat_{i} ) \cdot \cos (lat_{j} ) \cdot \cos (lon_{i} - lon_{j} )) $$
    (1)

    According to the geographical agglomeration effect, the locations with closer distances will have higher similarities. Thus, location distance similarity is inversely related to their distance.

    $$ Sim_{distance} \left( {p_{i} ,p_{j} } \right) = \frac{1}{{1 + Dis\left( {p_{i} ,p_{j} } \right)}} $$
    (2)
  2. (2)

    Similarity in location category

    Aside from the similarity in distance, locations have similarity in category. According to Fig. 1, for independent trees “catering service” and “sports and leisure service,” calculating the similarity between the \( c_{3} \) in “catering service” and \( c_{4} \) in “sports and leisure service” is possible. However, it is not different from the common classification recommendation if we cannot calculate the similarity among different large categories. As a solution, we defined that it has similarities with different degrees with each location in the location model based on distance and category. Therefore, we add a root layer upon the CategoryLarge layer so that all the nodes in the hierarchical category tree possess a common ancestor node, that is, the root category. As a result, the hierarchical category tree becomes a connected graph.

    Intuitively, the category similarity of the two locations can be transferred into the distance between the two leaf nodes in the hierarchical category tree, and a higher distance, which is indicated by the number of edges between the two leaf nodes, corresponds to lower similarity. However, we cannot obtain a satisfactory accuracy if we use only the edge number between the two leaf nodes as the measurement standard of the similarity. For instance, the two leaf nodes, (cinema) and (Cantonese restaurant) have an edge number of eight. The nodes (cinema) and (colleges and universities) also have an edge number of eight, but obviously, we are more likely to have dinner in a nearby Cantonese restaurant after watching a movie. Therefore, to show this kind of similarity, we set different weights for each edge.

    1. (1)

      Division of the category of edges

      First, we divide the edges from the cluster layer to the CategoryLarge layer into two categories: one is the \( e_{cluster} \), the edges from the cluster layer to the CategorySmall layer, which is a known value; the others are the \( e_{categorysmall} , e_{categorymedium} , \) and \( e_{categorylarge} \), which correspond to each hierarchy in CategorySmall, and this kind of edges should be constructed from low to high.

    2. (2)

      Construction of the weights of \( e_{categorysmall} \)

      In this paper, we use the dataset crawled from Sina microblog to construct \( e_{categorysmall} \). We count the total number of check-ins of every category in the CategorySmall layer, and we label it \( CheckinNum_{j} \)

      $$ e_{categorysmall} = \frac{1}{{1 + \log_{10} CheckinNum_{j} }} $$
      (3)

      We use the log function to reduce the effect on the calculation of similarity because of significant data diversity. We add one to avoid a zero denominator. The \( e_{categorysmall} \) is inversely related to the check-in number, that is, a higher check-in number corresponds to lower weights of its edges. Therefore, clustering is simplified because of the close distance to other categories.

    3. (3)

      Construction of the weights of \( e_{categorymedium} \) and \( e_{categorylarge} \)

      We take the average of the edge weights of the lower class that is connected directly to the edge as the weights of \( e_{categorymedium} \) and \( e_{categorylarge} \).

      $$ w_{categorymedium} = \frac{{\mathop \sum \nolimits w_{categorysmall} }}{{n_{categorysmall} }} $$
      (4)
      $$ w_{categorylarge} = \frac{{\mathop \sum \nolimits w_{categorymedium} }}{{n_{categorymedium} }} $$
      (5)
    4. (4)

      Construction of the weights of \( e_{cluster} \)

      We define the weights of \( e_{cluster} \) as half of the minimum value in \( w_{categorysmall} \).

      $$ w_{cluster} = 0.5 \times { \hbox{min} }(w_{categorysmall} ) $$
      (6)
    5. (5)

      Category similarity calculation among the locations

      After finishing the construction of the weights of each hierarchy’s edge, we calculate the category similarity among the locations in the dataset. We define the category similarity by the following formula because it is inversely related to distance.

      $$ Sim_{category} \left( {p_{i} ,p_{j} } \right) = \frac{1}{{\mathop \sum \nolimits w_{cluster} + \mathop \sum \nolimits w_{category} }} $$
      (7)
  3. (3)

    Total similarity of locations

    Location similarity based on distance and category essentially involves clustering related locations through the balance of distance and category. If two locations \( p_{a} \) and \( p_{b} \) have a close range but they belong to two large categories with significant difference, then they cannot be clustered. By contrast, if the above locations have a far range but they belong to the same small category, then they are likely to be clustered.

    $$ Sim\left( {p_{i} ,p_{j} } \right) = \alpha \cdot Sim_{category} \left( {p_{i} ,p_{j} } \right) \cdot Sim_{distance} \left( {p_{i} ,p_{j} } \right) $$
    (8)

4.2 Location Clustering Based on Affinity Propagation (AP) Algorithm

We will obtain a location similarity matrix \( S \) after calculating the similarity of N locations. This section will cluster the location similarity matrix through the AP algorithm and select the clustering center automatically through the information delivered by the locations. Compared with the DBSCAN algorithm, the AP algorithm effectively confirms the parameters of the clustering algorithm, and we could obtain a satisfactory result if we use only the default value. In addition, selecting a signed location after clustering to ensure a practical clustering effect is unnecessary. Furthermore, the AP algorithm can handle a massive dataset in a short time and obtain an ideal result.

We add the similarity of distance and category into the definition of similarity among locations, after which we use the AP algorithm to select the most representative location as the typical point of clustering. Therefore, the nodes in the result set may have a close range or similar category. In conclusion, the RC algorithm has a better effect on location clustering than the algorithm based on space density.

4.3 Calculation of User Similarity and Recommendation Result

After clustering the location, we transfer the user check-in vector, calculate the location preference of each user, and calculate the Top-K similar users of each user. Finally, we calculate the prediction score of the unvisited location according to the location preference of the target recommendatory user and show the locations with Top-N prediction scores to the target user as the final recommendation.

5 Experimental Results and visualization

5.1 Evaluation criteria

We use an offline experiment method to evaluate the effect of the algorithm in this paper. The data for the experiment is crawled through the Sina microblog API. For each target user, we use 80 % of their check-in record as the training set and the remaining 20 % for the test set, which is used to verify the effect of the recommendation. In addition, we select precision and recall, which are the most common index in the evaluation of the recommender system performance.

$$ Precision = \frac{{\mathop \sum \nolimits_{u \in U} \left| {R(u)\mathop \cap \nolimits T(u)} \right|}}{{\mathop \sum \nolimits_{u \in U} \left| {R(u)} \right|}} $$
(9)
$$ Recall = \frac{{\mathop \sum \nolimits_{u \in U} \left| {R(u)\mathop \cap \nolimits T(u)} \right|}}{{\mathop \sum \nolimits_{u \in U} \left| {T(u)} \right|}} $$
(10)

Here, \( R\left( u \right) \) is the location recommendation based on the user’s check-in record in the training set, and \( T(u) \) is the user’s check-in record in the test set.

5.2 Visualization

To strengthen the ability to display the multidimensional information of this recommender system and improve the intuition of the recommendation, we created a visualization model for the user recommendation that is based on force-directed algorithm, and a visualization for location recommendation is implemented.

Fig. 4.
figure 4

Visualization of historic check-in record

We concentrated on elaborating the design of the recommendatory algorithm. Thus, we display only the interfaces of historic check-in record (Fig. 4), DCC user recommendation (Fig. 5), and DCC location recommendation (Fig. 6).

Fig. 5.
figure 5

Visualization of DCC user recommendation

Fig. 6.
figure 6

Visualization of DCC location recommendation

6 Conclusion and Future Work

Social networking based on location service is a network platform with multidimensional information. It reflects extensive information about an individual in society to facilitate a thorough understanding of problems and laws in our daily life through the study of LBSNs.

Except for distance and category, we did not include additional feature information in the business of location into location clustering, such as the law of check-in time, location tag, and user comments. We expect to enhance the effect of location clustering by including additional feature information in future research.