Advertisement

A new method for detection of clustering based on four zones Apollonius circle

  • Shahin PourbahramiEmail author
  • Sohrab Azimpour
Original Article
  • 37 Downloads

Abstract

In many fields of machine learning such as classifying and clustering, neighborhood construction algorithms are used to model local relationships between data samples and to build global structure from local information. In finding connection among the data points, neighborhood is undeniably useful for data processing. Therefore, a very major issue is to find a novel approach to locating neighborhood among data points. If the geometric relationships existing between the data points in the neighborhood area are accurately explored, it will be feasible to observe the behavioral rules as well as the similarities among the data. This will also help identify indirect and direct neighborhood ranges. This study aims to find neighborhood accurately by means of the Apollonius circle zones. The experimental validation against well-known k-nearest neighbor and ε-neighborhood is also an indication of the robustness of the method in real data sets.

Keywords

Apollonius circle zones Data mining Neighborhood connection Clustering 

1 Introduction

In any neighborhood, data points are usually located in a way that there is local mutual similarity among them, and the nature of relationship among them is a major way for defining them. One of the main features of clustering is data analysis and its categorization into similar groups through a thorough analysis of neighborhood. The notion of neighborhood can be used for constructing very accurate neighborhood models through making changes in the data features based on similar categories. By similarity, we mean the intrinsic dependence within the data.

One of the most well-known algorithms used to define and cluster neighborhood data by means of Euclidean distance is known as k-nearest neighbor (k-NN) [1]. The mentioned algorithm is very simple and is highly useful in determining neighborhood. Moreover, it is the only criterion used for determining neighborhood the distance as well as the geometric place of the points by the use of k-NN algorithms, in which the statistical rules are not taken into account.

The main criterion in such data sets is the extent of similarity within the data in each group [2]. Meanwhile, the data located in two different groups are not similar to each other. The value of k which is used as a parameter can be determined in a dynamic way, and it can be identified before starting the clustering algorithm. Therefore, in this proposed method, the shortcomings of the present methods including their stable neighborhood area and their highly complex nature were resolved. To be more specific, the structure of geometric Apollonius circle is used so that all direct/indirect relationships among the data points in a set are identified. Our algorithm is applied to clustering problems to yield its performance, and it is assumed that there is no prior information about the data set. Absence of fixed neighborhood zones in optimizing the process of finding neighborhood among data points is one of the advantages. The other advantage is related to the fact that tuning the parameters is not necessary.

Apollonius circle for neighborhood formation offers a suitable structure for geometric separation of the defined zone from the neighboring data to the next nearest zone, so that the most similar data would be located within the Apollonius zones defined for them. Neighborhood structures constructed by our algorithm work well where (1) there is no need to examine each individual point in the neighborhood data set, (2) density varies within a data set, and (3) there is extraction geometry structure in assessing the local similarities among the data points. The proposed algorithm was utilized since interesting zone connections could be extracted for locating neighborhood; it could also reduce the complexity of the previous algorithms. Moreover, the proposed algorithm is able to explore the outlier data in different noisy data sets.

Later in this paper, a review of literature about neighborhood construction is provided, and an overview of the proposed algorithm is presented. Section 4 is allotted to using the method with several pieces of real data as well as presenting the results and discussing them. The last section (Sect. 5) will elaborate on the conclusions obtained from this study.

2 Review of related literature

Finding data point neighborhood and connection is highly important for clustering the domains and grouping the data, identifying social network communities, and bundling interrelated edges in data visualization. Therefore, finding data point neighborhood is a challenging issue. Finding data point neighborhood among big data sets as well as data mining should generally lead to accurate grouping within the database. Different methods of constructing data point neighborhood and clustering is among the important issues that have been elaborated upon in different papers and help to analyze the data.

Several other methods have been offered for improving classification and clustering by means of graph and neighborhood. Graphs such as Gabriel graph [4, 5] are among the geometric methods used for examining direct and indirect relationships among the points. As shown in Fig. 1, the distance between point a and b (the circle diameter) is estimated by means of Gabriel’s graph so that the relationship between the two points can be measured; in this way, the two mentioned points would have a direct relationship provided that there is no other point in Gabriel’s neighborhood circle. However, if such a relationship between point a and point b cannot be identified, the number of points falling inside Gabriel’s circle is defined as the density between the two mentioned points. In the NC [3] algorithm, the geometric structure and density are used as the criteria for clustering using Gabriel’s graph.
Fig. 1

Gabriel graph [3]

Figure 2a shows the relative neighborhood graph (RNG) in which point a and point b are the centers of the two circles, and the distance between the two points is the radius of the circles. Point a and point b would be neighborhood points as long as no other points exist between them, but the problem is the fixed neighborhood area in both the proposed method and Gabriel’s graph. This prevents high efficiency in locating neighborhood points. In Fig. 2b, β-skeleton is displayed with different β values for neighborhood construction between a and b [6, 7]. The β-skeleton and RNG are used in case there is no need for parameter regulation by means of Gabriel’s graph and RNG to regulate the parameters.
Fig. 2

a RNG and b β-skeleton [6]

Data (information) classification in the framework of data mining has been used in different areas including data analysis [8], identification of communities in social networks [9, 10, 11], and bundling edges in data (information) visualization [12, 13].

ε-Neighborhood method necessitates that small radius and epsilon distance determine neighborhood [14]. In other words, small radius in the neighborhood and epsilon distance are used to determine the neighborhood by means of the simple epsilon (ε) algorithm. Label data points are identified in this neighborhood area. There will be lower efficiency epsilon in case of selecting unsuitable parameters. As a result, the obtained algorithm may be less strong and accurate in terms of neighborhood structure than other algorithms. However, using this algorithm, with small epsilon value, the point in a given radius may probably have no neighbors, and it may be taken as outlier by mistake. However, with big radius, different points may merge with each other and form a single group. In this method of sample clustering, there are three types of clusters: core samples, border samples, and noise samples.

K-associated optimal graph obtained from algorithm K of neighborhood determines the number of neighborhoods. This number starts with 1 leading to higher K value for the groups in the data points and helps estimate the purity value, which is calculated based on data point input and output and the number associated with them in each group [15]. While the K value increases in each step, the groups are likely to merge necessitating recalculation of purity value. Finally, if there is an increase in purity value again. If there is no increase in purity value, there is no integration of the two groups, the previous state is maintained, and a suitable combination is obtained.

Metaheuristic methods are used for clustering the social networks at large scale. Since these social networks are constructed according to their social behavior and social communications, hierarchical clustering, spectral clustering, and partition approaches are used in social networks graph. Therefore, the algorithms such as ACO and PSO have been used for extracting community structures. The position and speed of the particles have been used for analyzing the network topology and making sub-graphs, and the quality of sub-graphs is evaluated using modularity criterion [16, 17].

Neighborhood construction by Apollonius region (NCAR) initially finds high density points known as target points. For each target point, NCAR algorithm finds the farthest distant point, and according to the target points and distant points, the center and radius of Apollonius circles are determined leading to the formation of original Apollonius circles [18]. However, this algorithm is not highly accurate for high dimension of data sets.

3 Proposed method

Obtaining high levels of precision with the aim of locating data point neighborhood is highly important. This new method suggests a rather precise finding data point neighborhood method that does not have the problems of old methods (e.g., having fixed neighborhood area and being highly complex).

In addition, it extracts the patterns for direct and indirect connections and detects their intensity. By detecting the groups having direct and indirect connections, the algorithm can identify the groups that look similar and help locate precise data point neighborhood (i.e., outliers can be located). Moreover, it can create unique groups without any need for regulating neighborhood parameter and repetition of redundant groupings in the data; it also has higher reliability level. Due to the mentioned problems, in this paper a method for determining neighborhood is proposed which does not need fixed neighborhood areas and is not complicated. Moreover, based on the distance, the angle and intensity between the points, accurate decisions can be made. Apollonius circle which is a famous geometry problem is used to help solve the problems of the previous studies. The purpose is to construct unique neighborhoods. The newly proposed method of constructing data point neighborhood has been introduced as neighborhood construction four zones Apollonius (NCFZA) and is widely used in data point-based geometry. In the mentioned data points, the relationship among the points is not clear, and even in some of the databases they may lack topologic information. The proposed method can lead to more accurate neighborhood construction than k-nearest and ε-neighborhood. It examines all important neighborhood states comprehensively. It also involves a less amount of calculations than the mentioned methods, and in this way different neighborhood zones from different angles will be obtained.

What is Apollonius circle? The geometric location of the points existing in Euclidean surface is defined as the Apollonius circle in which the distance of all of the points with point A and B would be an established value of k (\( k \ne 0 \) and \( k \ne 1 \)) [18].

How is Apollonius circle defined in infinite domain? Apollonius circle is defined in infinite domain if in k = 1, point A is perpendicular to the line drawn between A and B in a way that Apollonius circle is located in a straight line with its circle in infinite domain.
  1. (a)

    α = 60, then d(A, B)= d(X, B) = d(A, X),

     
  2. (b)

    α >60, then d(A, B) ˃ d(X, B) = d(A, X); therefore A, B is indirect connection,

     
  3. (c)

    α < 60, then d(A, B) < d(X, B) = d(A, X); therefore A, B is direct connection.

     
Proof by geometric method Suppose there are two fixed points A and B, both located on surface ℙ; draw a straight line from point A to point B; then consider point C and point D on the obtained line AB in a way that the line is divided by k or \( \frac{CA}{CB} = \frac{DA}{DB} = k \). The circle whose diameter is line CD is the concerned geometric position. This means that geometric position is defined as the point which has a distance of k with point A and point B. In Fig. 3, the Apollonius circle is displayed.
Fig. 3

Apollonius circle

According to the angles mentioned above in Fig. 4, the Apollonius circle is divided into four zones based on angles \( \gamma ,\;\beta \;{\text{and}}\;\alpha \). By detecting the angles having direct and indirect connections, the algorithm can identify similar groups.
Fig. 4

Four zones neighborhood Apollonius circle

Note: angle \( {\text{A}}{\widehat{\text{M}}}{\text{B}} = \gamma \) is stated for the time when d (A, M)= d (A, B); therefore:
$$ \begin{aligned} d\left( {A,B} \right)^{2} & = d\left( {M,A} \right)^{2} + d\left( {M,B} \right)^{2} \\ & \quad - 2d\left( {A,M} \right)d\left( {M,B} \right)\cos \gamma , \\ \end{aligned} $$
(1)
$$ \cos \gamma = \frac{{d\left( {M,B} \right)}}{{2d\left( {A,M} \right)}} = \frac{1}{2k}. $$
(2)
Note: angle A \( {\text{A}}\widehat{\text{M}}^{\prime } {\text{B}} = \beta \) is stated for the time when d (A, B) = d (\( M^{\prime} \), B); therefore:
$$ \begin{aligned} d\left( {A,B} \right)^{2} & = d\left( {M^{\prime } ,A} \right)^{2} + d\left( {M^{\prime } ,B} \right)^{2} \\ & \quad - 2d\left( {A,M^{\prime } } \right)d\left( {M^{\prime } ,B} \right)\cos \beta . \\ \end{aligned} $$
(3)
$$ \cos \beta = \frac{{d\left( {A,M^{\prime } } \right)}}{{2d\left( {M^{\prime } ,B} \right)}} = \frac{k}{2}. $$
(4)
It can be concluded that using angle \( \frac{k}{2} < \,\cos \alpha \, < \frac{1}{2k} \) and the above-mentioned angles, four zones can be identified for Apollonius circle so that the direct and indirect connections of A, B, and M may be identified, and further examination can be conducted:
$$ {\text{if}}\,\,\,\,\alpha > \gamma \to d\left( {A,B} \right) > d\left( {A,M} \right) > d\left( {M,B} \right), $$
(5)
$$ {\text{if}}\,\,\,\gamma \le \alpha \le \beta \to d\left( {M,B} \right) \le d\left( {A,M} \right) \le d\left( {A,M} \right), $$
(6)
$$ {\text{if}}\,\,\,\alpha < \beta \to d\left( {A,B} \right) < d\left( {M,B} \right) < d\left( {A,M} \right). $$
(7)
The angles mentioned above and in Fig. 4 show that the Apollonius circle is divided into four zones \( M^{\prime\prime\prime}\,T\,N \), \( M\,P\,M^{\prime} \), \( M^{\prime}\,Q\,M^{\prime\prime} \), and \( M^{\prime\prime}\,R\,M^{\prime\prime\prime} \). Let X data point be on an Apollonius circle (Fig. 5), then:
Fig. 5

Data point X on Apollonius circle

Zone 1:
$$ {\text{if}}\,\,\,X \in M^{\prime } \,Q\,M^{\prime \prime } \to d\left( {A,B} \right) < d\left( {B,X} \right) < d\left( {A,X} \right) \equiv \alpha < \beta . $$
Zone 2:
$$ {\text{if}}\,\,\,X \in \,M\,P\,M^{\prime} \to d\left( {B,X} \right) \le d\left( {A,B} \right) \le d\left( {A,X} \right) \equiv \gamma \le \alpha \le \beta . $$
Zone 3:
$$ {\text{if}}\,\,\,X \in M^{\prime } \,Q\,M^{\prime \prime } \to d\left( {A,B} \right) < d\left( {B,X} \right) < d\left( {A,X} \right) \equiv \alpha < \beta . $$
Zone 4:
$$ {\text{if}}\,\,\,\,X \in M^{\prime \prime } \,R\,M^{\prime \prime \prime } \to d\left( {B,X} \right) \le d\left( {A,B} \right) \le d\left( {A,X} \right) \equiv \alpha > \beta . $$

The angles specified on the circle \( \left( {\gamma ,\beta \;{\text{and}}\;\alpha } \right) \) in zone 1 indicate that B has a direct relationship with the Xs (points) in this zone, so all points located in this zone form a new cluster since the formula indicates that the distance of Xs with B is less than their distance with B. A and B definitely have an indirect relationship with each other.

In zone 2, B has a direct relationship with the Xs in this zone. It also has a direct relationship with A, but Xs in this zone have indirect relationship with A. If the circle was located in the direction of A, all these relationships would be reversed, and points X and A would be located in the same cluster.

In zone 3, A and B have a direct relationship with each other, and in case the Xs in this zone approach another fixed point on the right, A and B would be located in the same cluster while the Xs in this zone can be located inside another cluster.

The specified formula and angles on the circle in zone 4 indicate that definitely the Xs in this zone have a direct relationship with B and form a cluster with this point. Most importantly, if the circle is located in side A, the mentioned points will form a new cluster with A.

This happens gradually in the following way. First, point A and B are randomly selected as head clusters from among all the points in the data set. Then all the other points are hierarchically ordered (from the least distant to most distant) based on their distance from the mentioned two points. As the second step, after drawing the Apollonius circle for each of the selected zones and identifying the position of the points on the Apollonius circle (points like point X) and the points inside the Apollonius circle, the Apollonius circles will be drawn as well. The points which have a direct relationship with point B are located in cluster B (according to the distances defined in different zones), and the points which have a direct relationship with point A are located in cluster A. This process continues for all points of the head clusters so that parameter k would change (for B number of points in k < 1) and new head clusters are defined. This process also continues for all other cluster heads, and the selection of head clusters and their neighborhood points in the third step is not independent of the points determined in the second step. In the final step, the points outside the circle radius are allotted to the cluster having the most number of common neighborhoods.

4 Experimental results

To evaluate the efficiency of NCFZA by means of two rather famous methods (K-NN and ε-neighborhood) on six real-world data sets (Table 1), researchers conducted several experiments. As the results in Fig. 6 show, NCFZA proved to be highly accurate for all cases excluding Sonar. As an illustration, in Wine data set, NCFZA had an accuracy index of 0.794,while in the same situation K-NN and ε-neighborhood had an accuracy index of 0.6452 and 0.5012, respectively.
Table 1

Details of real-world datasets taken from UCI

Dataset

#Instance

#Feature

#Cluster

Iris

150

4

3

Wine

178

13

3

Heart

270

13

2

Waveform

5000

21

3

Sonar

208

60

2

Glass

214

10

6

Fig. 6

Comparison of RI (Rand Index) values obtained by different clustering methods

The evaluation criteria proposed by Rand (1971) (e.g., Rand Index given in Eq. 8) can be utilized provided that neighborhood is defined by its clustering accuracy, similarities exist among the neighbors, and there is considerable distance among the data points in the neighborhood group.
$$ RI = \frac{a + d}{a + b + c + d}, $$
(8)
where a is the number of pair points with the same cluster label and the same class label; b is the number of pair points with the same cluster label and different class label; c is the number of pair points with the different cluster label and same class label; d is the number of pair points with the different cluster label and different class label.

By means of RI metric, researchers obtained similar results as shown in Fig. 6. As Fig. 6 shows, the results obtained from the method proposed here were much more significant than those from previous methods. In other words, the proposed method showed better performance than other methods in the waveform dataset.

5 Conclusion

Most of neighborhood detection methods are based on distance and learning methods which are parameter based, or with changes in their data point values. They do not have safe and changeable neighborhood zone. In other words, they do not have the feature of adaptability. The big advantage of the proposed method is that for studying the relationship among the data points, both their distance and the connection among the data points can be used. Moreover, it defines a flexible and changeable neighborhood area by means of the Apollonius circle.

Notes

References

  1. 1.
    Duda, R.: PE Hart and DG Stork, Pattern Classification. Wiley-Interscience, New York (2001)Google Scholar
  2. 2.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)zbMATHGoogle Scholar
  3. 3.
    İnkaya, T., Kayalıgil, S., Özdemirel, N.E.: An adaptive neighbourhood construction algorithm based on density and connectivity. Pattern Recogn. Lett. 52, 17–24 (2015)CrossRefGoogle Scholar
  4. 4.
    İnkaya, T.: A density and connectivity based decision rule for pattern classification. Expert Syst. Appl. 42, 906–912 (2015)CrossRefGoogle Scholar
  5. 5.
    İnkaya, T.: A parameter-free similarity graph for spectral clustering. Expert Syst. Appl. 42, 9489–9498 (2015)CrossRefGoogle Scholar
  6. 6.
    Langerman, J.C.S.C.S.: Empty region graphs. Comput. Geom. Theory Appl. 42, 183–195 (2009)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Kirkpatrick DG, Radke JD (1984) A Framework for Computational Morphology, vol 13, pp 43–72Google Scholar
  8. 8.
    Tsai, W.-P., Huang, S.-P., Shao, S.-T., Cheng, K.-T., Chang, F.-J.: A data-mining framework for exploring the multi-relation between fish species and water quality through self-organizing map. Sci. Total Environ. 8, 474–483 (2016)Google Scholar
  9. 9.
    Singh, K., Shakya, H.K., Biswas, B.: Clustering of people in social network based on textual similarity. Perspect. Sci. 8, 570–573 (2016)CrossRefGoogle Scholar
  10. 10.
    Zhou, H., Li, J., Li, J., Zhang, F., Cui, Y.A.: graph clustering method for community detection in complex networks. Phys. A 469, 551–562 (2016)CrossRefGoogle Scholar
  11. 11.
    Wang, M., Zuo, W., Wang, Y.: An improved density peaks-based clustering method for social circle discovery in social networks. Neurocomputing 179, 219–227 (2016)CrossRefGoogle Scholar
  12. 12.
    Arleo, A., Didimo, W., Liotta, G., Montecchiani, F.: Large graph visualizations using a distributed computing platform. Inf. Sci. 381, 124–141 (2016)CrossRefGoogle Scholar
  13. 13.
    Guo, H., Yu, Y., Skitmore, M.: Visualization technology-based construction safety management: a review. Autom. Constr. 73, 135–144 (2016)CrossRefGoogle Scholar
  14. 14.
    Pedrycz, W.: The design of cognitive maps: a study in synergy of granular computing and evolutionary optimization. Expert Syst. Appl. 37, 7288–7294 (2010)CrossRefGoogle Scholar
  15. 15.
    Mohammadi, M., Raahemi, B., Mehraban, S.A., Bigdeli, E., Akbari, A.: An enhanced noise resilient K-associated graph classifier. Expert Syst. Appl. 42, 8283–8293 (2015)CrossRefGoogle Scholar
  16. 16.
    Cai, Q., Gong, M., Ma, L., Ruan, S., Yuan, F., Jiao, L.: Greedy discrete particle swarm optimization for large-scale social network clustering. Inf. Sci. 316, 503–516 (2015)CrossRefGoogle Scholar
  17. 17.
    Gong, M., Cai, Q., Chen, X., Ma, L.: Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition. IEEE Trans. Evol. Comput. 18, 82–97 (2014)CrossRefGoogle Scholar
  18. 18.
    Pourbahrami, S., Khanli, L.M., Azimpour, S.: Novel and efficient data point neighborhood construction algorithm based on Apollonius circle. Expert Syst. Appl. 115, 57–67 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of Computer EngineeringUniversity of TabrizTabrizIran
  2. 2.Faculty of MathematicUniversity of FarhangianTehranIran

Personalised recommendations