Data Clustering Based on Maximization of Outlier Factor
There exist many data clustering algorithms, but they can not adequately handle the number of clusters or cluster shapes. Their performance mainly depends on a choice of algorithm parameters. Our approach to data clustering and algorithm does not require the parameter choice; it can be treated as a natural adaptation to the existing structure of distances between data points. The outlier factor introduced by the author specifies a degree of being an outlier for each data point. The outlier factor notion is based on the difference between the frequency distribution of interpoint distances in a given dataset and the corresponding distribution of uniformly distributed points. Then data clusters can be determined by maximizing the outlier factor function. The data points in dataset are divided into clusters according to the attractor regions of local optima. An experimental evaluation of the proposed algorithm shows that the proposed method can identify complex cluster shapes. Key advantages of the approach are: good clustering properties for datasets with comparatively large amount of noise (an additional data points), and an absence of important parameters which adequate choice determines the quality of results.
Keywordsclustering global optimization local optimization outlier detection
Unable to display preview. Download preview PDF.
- 1.Brin, S. (1995), Near Neighbor Search in Large Metric Spaces. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), Zurich, Switzerland, Morgan Kaufmann, pp. 574–584.Google Scholar
- 2.Draper, N.R., Smith, H. 1966Applied Regression AnalysisWileyNew YorkGoogle Scholar
- 3.Ertoz, L., Steinbach, M. and Kumar, V. (2002), A new shared nearest neighbor clustering algorithm and its applications, AHPCRC, Technical Report 134.Google Scholar
- 4.Fisher, R.A. 1936The use of multiple measurements in taxonomy problemsAnnals of Eugenics7179188Google Scholar
- 6.Hinneburg, A. and Keim, D. (1998), An efficient approach to clustering large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD, New York, NY, pp. 58–65.Google Scholar
- 7.Jain, A.K., Dubes, R.C. 1988Algorithms for Clustering DataPrentice HallEnglewood Cliffs, NJGoogle Scholar
- 9.MacQueen, J. 1967Some methods for classification and analysis of multivariate observationsLe Cam, L.M.Neyman, J. eds. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume I: Statistics.University of California PressBerkeley and Los Angeles, CA281297Google Scholar
- 10.Saltenis, V. 2004Outlier detection based on the distribution of distances between data pointsInformatica15399410Google Scholar
- 11.Steinbach, M., Ertoz, L. and Kumar, V. (2003), Challenges of Clustering High Dimensional Data. New Vistas in Statistical Physics. Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, Berlin.Google Scholar