# A new method for detection of clustering based on four zones Apollonius circle

- 37 Downloads

## Abstract

In many fields of machine learning such as classifying and clustering, neighborhood construction algorithms are used to model local relationships between data samples and to build global structure from local information. In finding connection among the data points, neighborhood is undeniably useful for data processing. Therefore, a very major issue is to find a novel approach to locating neighborhood among data points. If the geometric relationships existing between the data points in the neighborhood area are accurately explored, it will be feasible to observe the behavioral rules as well as the similarities among the data. This will also help identify indirect and direct neighborhood ranges. This study aims to find neighborhood accurately by means of the Apollonius circle zones. The experimental validation against well-known *k*-nearest neighbor and ε-neighborhood is also an indication of the robustness of the method in real data sets.

## Keywords

Apollonius circle zones Data mining Neighborhood connection Clustering## 1 Introduction

In any neighborhood, data points are usually located in a way that there is local mutual similarity among them, and the nature of relationship among them is a major way for defining them. One of the main features of clustering is data analysis and its categorization into similar groups through a thorough analysis of neighborhood. The notion of neighborhood can be used for constructing very accurate neighborhood models through making changes in the data features based on similar categories. By similarity, we mean the intrinsic dependence within the data.

One of the most well-known algorithms used to define and cluster neighborhood data by means of Euclidean distance is known as *k*-nearest neighbor (*k*-NN) [1]. The mentioned algorithm is very simple and is highly useful in determining neighborhood. Moreover, it is the only criterion used for determining neighborhood the distance as well as the geometric place of the points by the use of *k*-NN algorithms, in which the statistical rules are not taken into account.

The main criterion in such data sets is the extent of similarity within the data in each group [2]. Meanwhile, the data located in two different groups are not similar to each other. The value of *k* which is used as a parameter can be determined in a dynamic way, and it can be identified before starting the clustering algorithm. Therefore, in this proposed method, the shortcomings of the present methods including their stable neighborhood area and their highly complex nature were resolved. To be more specific, the structure of geometric Apollonius circle is used so that all direct/indirect relationships among the data points in a set are identified. Our algorithm is applied to clustering problems to yield its performance, and it is assumed that there is no prior information about the data set. Absence of fixed neighborhood zones in optimizing the process of finding neighborhood among data points is one of the advantages. The other advantage is related to the fact that tuning the parameters is not necessary.

Apollonius circle for neighborhood formation offers a suitable structure for geometric separation of the defined zone from the neighboring data to the next nearest zone, so that the most similar data would be located within the Apollonius zones defined for them. Neighborhood structures constructed by our algorithm work well where (1) there is no need to examine each individual point in the neighborhood data set, (2) density varies within a data set, and (3) there is extraction geometry structure in assessing the local similarities among the data points. The proposed algorithm was utilized since interesting zone connections could be extracted for locating neighborhood; it could also reduce the complexity of the previous algorithms. Moreover, the proposed algorithm is able to explore the outlier data in different noisy data sets.

Later in this paper, a review of literature about neighborhood construction is provided, and an overview of the proposed algorithm is presented. Section 4 is allotted to using the method with several pieces of real data as well as presenting the results and discussing them. The last section (Sect. 5) will elaborate on the conclusions obtained from this study.

## 2 Review of related literature

Finding data point neighborhood and connection is highly important for clustering the domains and grouping the data, identifying social network communities, and bundling interrelated edges in data visualization. Therefore, finding data point neighborhood is a challenging issue. Finding data point neighborhood among big data sets as well as data mining should generally lead to accurate grouping within the database. Different methods of constructing data point neighborhood and clustering is among the important issues that have been elaborated upon in different papers and help to analyze the data.

*a*and

*b*(the circle diameter) is estimated by means of Gabriel’s graph so that the relationship between the two points can be measured; in this way, the two mentioned points would have a direct relationship provided that there is no other point in Gabriel’s neighborhood circle. However, if such a relationship between point

*a*and point

*b*cannot be identified, the number of points falling inside Gabriel’s circle is defined as the density between the two mentioned points. In the NC [3] algorithm, the geometric structure and density are used as the criteria for clustering using Gabriel’s graph.

*a*and point

*b*are the centers of the two circles, and the distance between the two points is the radius of the circles. Point

*a*and point

*b*would be neighborhood points as long as no other points exist between them, but the problem is the fixed neighborhood area in both the proposed method and Gabriel’s graph. This prevents high efficiency in locating neighborhood points. In Fig. 2b,

*β*-skeleton is displayed with different

*β*values for neighborhood construction between

*a*and

*b*[6, 7]. The

*β*-skeleton and RNG are used in case there is no need for parameter regulation by means of Gabriel’s graph and RNG to regulate the parameters.

Data (information) classification in the framework of data mining has been used in different areas including data analysis [8], identification of communities in social networks [9, 10, 11], and bundling edges in data (information) visualization [12, 13].

ε-Neighborhood method necessitates that small radius and epsilon distance determine neighborhood [14]. In other words, small radius in the neighborhood and epsilon distance are used to determine the neighborhood by means of the simple epsilon (*ε*) algorithm. Label data points are identified in this neighborhood area. There will be lower efficiency epsilon in case of selecting unsuitable parameters. As a result, the obtained algorithm may be less strong and accurate in terms of neighborhood structure than other algorithms. However, using this algorithm, with small epsilon value, the point in a given radius may probably have no neighbors, and it may be taken as outlier by mistake. However, with big radius, different points may merge with each other and form a single group. In this method of sample clustering, there are three types of clusters: core samples, border samples, and noise samples.

*K*-associated optimal graph obtained from algorithm *K* of neighborhood determines the number of neighborhoods. This number starts with 1 leading to higher *K* value for the groups in the data points and helps estimate the purity value, which is calculated based on data point input and output and the number associated with them in each group [15]. While the *K* value increases in each step, the groups are likely to merge necessitating recalculation of purity value. Finally, if there is an increase in purity value again. If there is no increase in purity value, there is no integration of the two groups, the previous state is maintained, and a suitable combination is obtained.

Metaheuristic methods are used for clustering the social networks at large scale. Since these social networks are constructed according to their social behavior and social communications, hierarchical clustering, spectral clustering, and partition approaches are used in social networks graph. Therefore, the algorithms such as ACO and PSO have been used for extracting community structures. The position and speed of the particles have been used for analyzing the network topology and making sub-graphs, and the quality of sub-graphs is evaluated using modularity criterion [16, 17].

Neighborhood construction by Apollonius region (NCAR) initially finds high density points known as target points. For each target point, NCAR algorithm finds the farthest distant point, and according to the target points and distant points, the center and radius of Apollonius circles are determined leading to the formation of original Apollonius circles [18]. However, this algorithm is not highly accurate for high dimension of data sets.

## 3 Proposed method

Obtaining high levels of precision with the aim of locating data point neighborhood is highly important. This new method suggests a rather precise finding data point neighborhood method that does not have the problems of old methods (e.g., having fixed neighborhood area and being highly complex).

In addition, it extracts the patterns for direct and indirect connections and detects their intensity. By detecting the groups having direct and indirect connections, the algorithm can identify the groups that look similar and help locate precise data point neighborhood (i.e., outliers can be located). Moreover, it can create unique groups without any need for regulating neighborhood parameter and repetition of redundant groupings in the data; it also has higher reliability level. Due to the mentioned problems, in this paper a method for determining neighborhood is proposed which does not need fixed neighborhood areas and is not complicated. Moreover, based on the distance, the angle and intensity between the points, accurate decisions can be made. Apollonius circle which is a famous geometry problem is used to help solve the problems of the previous studies. The purpose is to construct unique neighborhoods. The newly proposed method of constructing data point neighborhood has been introduced as neighborhood construction four zones Apollonius (NCFZA) and is widely used in data point-based geometry. In the mentioned data points, the relationship among the points is not clear, and even in some of the databases they may lack topologic information. The proposed method can lead to more accurate neighborhood construction than k-nearest and ε-neighborhood. It examines all important neighborhood states comprehensively. It also involves a less amount of calculations than the mentioned methods, and in this way different neighborhood zones from different angles will be obtained.

**What is Apollonius circle**? The geometric location of the points existing in Euclidean surface is defined as the Apollonius circle in which the distance of all of the points with point *A* and *B* would be an established value of *k* (\( k \ne 0 \) and \( k \ne 1 \)) [18].

**How is Apollonius circle defined in infinite domain?**Apollonius circle is defined in infinite domain if in

*k*= 1, point

*A*is perpendicular to the line drawn between

*A*and

*B*in a way that Apollonius circle is located in a straight line with its circle in infinite domain.

- (a)
*α*= 60, then*d*(*A, B*)=*d*(*X*,*B*) =*d*(*A*,*X*), - (b)
*α*>60, then*d*(*A*,*B*) ˃*d*(*X*,*B*) =*d*(*A*,*X*); therefore*A, B*is indirect connection, - (c)
*α*<*60*, then*d*(*A*,*B*) <*d*(*X*,*B*) =*d*(*A*,*X*); therefore*A*,*B*is direct connection.

**Proof by geometric method**Suppose there are two fixed points

*A*and

*B*, both located on surface

*ℙ;*draw a straight line from point

*A*to point

*B*; then consider point

*C*and point

*D*on the obtained line AB in a way that the line is divided by

*k*or \( \frac{CA}{CB} = \frac{DA}{DB} = k \). The circle whose diameter is line

*CD*is the concerned geometric position. This means that geometric position is defined as the point which has a distance of

*k*with point

*A*and point

*B*. In Fig. 3, the Apollonius circle is displayed.

*d*(

*A*,

*M*)=

*d*(

*A*,

*B*); therefore:

**Note:**angle A \( {\text{A}}\widehat{\text{M}}^{\prime } {\text{B}} = \beta \) is stated for the time when

*d*(

*A*,

*B*) =

*d*(\( M^{\prime} \),

*B*); therefore:

*A*,

*B*, and

*M*may be identified, and further examination can be conducted:

*X*data point be on an Apollonius circle (Fig. 5), then:

The angles specified on the circle \( \left( {\gamma ,\beta \;{\text{and}}\;\alpha } \right) \) in zone 1 indicate that *B* has a direct relationship with the *Xs* (points) in this zone, so all points located in this zone form a new cluster since the formula indicates that the distance of *Xs* with *B* is less than their distance with *B*. *A* and *B* definitely have an indirect relationship with each other.

In zone 2, *B* has a direct relationship with the *Xs* in this zone. It also has a direct relationship with *A*, but *Xs* in this zone have indirect relationship with *A*. If the circle was located in the direction of *A*, all these relationships would be reversed, and points *X* and *A* would be located in the same cluster.

In zone 3, *A* and *B* have a direct relationship with each other, and in case the *Xs* in this zone approach another fixed point on the right, *A* and *B* would be located in the same cluster while the *Xs* in this zone can be located inside another cluster.

The specified formula and angles on the circle in zone 4 indicate that definitely the *Xs* in this zone have a direct relationship with *B* and form a cluster with this point. Most importantly, if the circle is located in side *A*, the mentioned points will form a new cluster with *A*.

This happens gradually in the following way. First, point *A* and *B* are randomly selected as head clusters from among all the points in the data set. Then all the other points are hierarchically ordered (from the least distant to most distant) based on their distance from the mentioned two points. As the second step, after drawing the Apollonius circle for each of the selected zones and identifying the position of the points on the Apollonius circle (points like point *X*) and the points inside the Apollonius circle, the Apollonius circles will be drawn as well. The points which have a direct relationship with point *B* are located in cluster *B* (according to the distances defined in different zones), and the points which have a direct relationship with point *A* are located in cluster *A*. This process continues for all points of the head clusters so that parameter *k* would change (for *B* number of points in *k* < 1) and new head clusters are defined. This process also continues for all other cluster heads, and the selection of head clusters and their neighborhood points in the third step is not independent of the points determined in the second step. In the final step, the points outside the circle radius are allotted to the cluster having the most number of common neighborhoods.

## 4 Experimental results

*K*-NN and ε-neighborhood) on six real-world data sets (Table 1), researchers conducted several experiments. As the results in Fig. 6 show, NCFZA proved to be highly accurate for all cases excluding Sonar. As an illustration, in Wine data set, NCFZA had an accuracy index of 0.794,while in the same situation

*K*-NN and ε-neighborhood had an accuracy index of 0.6452 and 0.5012, respectively.

Details of real-world datasets taken from UCI

Dataset | #Instance | #Feature | #Cluster |
---|---|---|---|

Iris | 150 | 4 | 3 |

Wine | 178 | 13 | 3 |

Heart | 270 | 13 | 2 |

Waveform | 5000 | 21 | 3 |

Sonar | 208 | 60 | 2 |

Glass | 214 | 10 | 6 |

*a*is the number of pair points with the same cluster label and the same class label;

*b*is the number of pair points with the same cluster label and different class label;

*c*is the number of pair points with the different cluster label and same class label;

*d*is the number of pair points with the different cluster label and different class label.

By means of RI metric, researchers obtained similar results as shown in Fig. 6. As Fig. 6 shows, the results obtained from the method proposed here were much more significant than those from previous methods. In other words, the proposed method showed better performance than other methods in the waveform dataset.

## 5 Conclusion

Most of neighborhood detection methods are based on distance and learning methods which are parameter based, or with changes in their data point values. They do not have safe and changeable neighborhood zone. In other words, they do not have the feature of adaptability. The big advantage of the proposed method is that for studying the relationship among the data points, both their distance and the connection among the data points can be used. Moreover, it defines a flexible and changeable neighborhood area by means of the Apollonius circle.

## Notes

## References

- 1.Duda, R.: PE Hart and DG Stork, Pattern Classification. Wiley-Interscience, New York (2001)Google Scholar
- 2.Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)zbMATHGoogle Scholar
- 3.İnkaya, T., Kayalıgil, S., Özdemirel, N.E.: An adaptive neighbourhood construction algorithm based on density and connectivity. Pattern Recogn. Lett.
**52**, 17–24 (2015)CrossRefGoogle Scholar - 4.İnkaya, T.: A density and connectivity based decision rule for pattern classification. Expert Syst. Appl.
**42**, 906–912 (2015)CrossRefGoogle Scholar - 5.İnkaya, T.: A parameter-free similarity graph for spectral clustering. Expert Syst. Appl.
**42**, 9489–9498 (2015)CrossRefGoogle Scholar - 6.Langerman, J.C.S.C.S.: Empty region graphs. Comput. Geom. Theory Appl.
**42**, 183–195 (2009)MathSciNetCrossRefGoogle Scholar - 7.Kirkpatrick DG, Radke JD (1984) A Framework for Computational Morphology, vol 13, pp 43–72Google Scholar
- 8.Tsai, W.-P., Huang, S.-P., Shao, S.-T., Cheng, K.-T., Chang, F.-J.: A data-mining framework for exploring the multi-relation between fish species and water quality through self-organizing map. Sci. Total Environ.
**8**, 474–483 (2016)Google Scholar - 9.Singh, K., Shakya, H.K., Biswas, B.: Clustering of people in social network based on textual similarity. Perspect. Sci.
**8**, 570–573 (2016)CrossRefGoogle Scholar - 10.Zhou, H., Li, J., Li, J., Zhang, F., Cui, Y.A.: graph clustering method for community detection in complex networks. Phys. A
**469**, 551–562 (2016)CrossRefGoogle Scholar - 11.Wang, M., Zuo, W., Wang, Y.: An improved density peaks-based clustering method for social circle discovery in social networks. Neurocomputing
**179**, 219–227 (2016)CrossRefGoogle Scholar - 12.Arleo, A., Didimo, W., Liotta, G., Montecchiani, F.: Large graph visualizations using a distributed computing platform. Inf. Sci.
**381**, 124–141 (2016)CrossRefGoogle Scholar - 13.Guo, H., Yu, Y., Skitmore, M.: Visualization technology-based construction safety management: a review. Autom. Constr.
**73**, 135–144 (2016)CrossRefGoogle Scholar - 14.Pedrycz, W.: The design of cognitive maps: a study in synergy of granular computing and evolutionary optimization. Expert Syst. Appl.
**37**, 7288–7294 (2010)CrossRefGoogle Scholar - 15.Mohammadi, M., Raahemi, B., Mehraban, S.A., Bigdeli, E., Akbari, A.: An enhanced noise resilient K-associated graph classifier. Expert Syst. Appl.
**42**, 8283–8293 (2015)CrossRefGoogle Scholar - 16.Cai, Q., Gong, M., Ma, L., Ruan, S., Yuan, F., Jiao, L.: Greedy discrete particle swarm optimization for large-scale social network clustering. Inf. Sci.
**316**, 503–516 (2015)CrossRefGoogle Scholar - 17.Gong, M., Cai, Q., Chen, X., Ma, L.: Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition. IEEE Trans. Evol. Comput.
**18**, 82–97 (2014)CrossRefGoogle Scholar - 18.Pourbahrami, S., Khanli, L.M., Azimpour, S.: Novel and efficient data point neighborhood construction algorithm based on Apollonius circle. Expert Syst. Appl.
**115**, 57–67 (2019)CrossRefGoogle Scholar