Journal of Global Optimization

, Volume 35, Issue 4, pp 625–635 | Cite as

Data Clustering Based on Maximization of Outlier Factor

  • Vydunas Saltenis


There exist many data clustering algorithms, but they can not adequately handle the number of clusters or cluster shapes. Their performance mainly depends on a choice of algorithm parameters. Our approach to data clustering and algorithm does not require the parameter choice; it can be treated as a natural adaptation to the existing structure of distances between data points. The outlier factor introduced by the author specifies a degree of being an outlier for each data point. The outlier factor notion is based on the difference between the frequency distribution of interpoint distances in a given dataset and the corresponding distribution of uniformly distributed points. Then data clusters can be determined by maximizing the outlier factor function. The data points in dataset are divided into clusters according to the attractor regions of local optima. An experimental evaluation of the proposed algorithm shows that the proposed method can identify complex cluster shapes. Key advantages of the approach are: good clustering properties for datasets with comparatively large amount of noise (an additional data points), and an absence of important parameters which adequate choice determines the quality of results.


clustering global optimization local optimization outlier detection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brin, S. (1995), Near Neighbor Search in Large Metric Spaces. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), Zurich, Switzerland, Morgan Kaufmann, pp. 574–584.Google Scholar
  2. 2.
    Draper, N.R., Smith, H. 1966Applied Regression AnalysisWileyNew YorkGoogle Scholar
  3. 3.
    Ertoz, L., Steinbach, M. and Kumar, V. (2002), A new shared nearest neighbor clustering algorithm and its applications, AHPCRC, Technical Report 134.Google Scholar
  4. 4.
    Fisher, R.A. 1936The use of multiple measurements in taxonomy problemsAnnals of Eugenics7179188Google Scholar
  5. 5.
    Hawkins, D.M., Bradu, D., Kass, G.V. 1984Location of several outliers in multiple regression data using elemental setsTechnometrics26197208CrossRefGoogle Scholar
  6. 6.
    Hinneburg, A. and Keim, D. (1998), An efficient approach to clustering large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD, New York, NY, pp. 58–65.Google Scholar
  7. 7.
    Jain, A.K., Dubes, R.C. 1988Algorithms for Clustering DataPrentice HallEnglewood Cliffs, NJGoogle Scholar
  8. 8.
    Jain, A., Murty, M.N., Flynn, P. 1999Data clustering: a reviewACM Computing Surveys31264323CrossRefGoogle Scholar
  9. 9.
    MacQueen, J. 1967Some methods for classification and analysis of multivariate observationsLe Cam, L.M.Neyman, J. eds. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume I: Statistics.University of California PressBerkeley and Los Angeles, CA281297Google Scholar
  10. 10.
    Saltenis, V. 2004Outlier detection based on the distribution of distances between data pointsInformatica15399410Google Scholar
  11. 11.
    Steinbach, M., Ertoz, L. and Kumar, V. (2003), Challenges of Clustering High Dimensional Data. New Vistas in Statistical Physics. Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, Berlin.Google Scholar

Copyright information

© Springer 2006

Authors and Affiliations

  1. 1.Institute of Mathematics and InformaticsVilniusLithuania

Personalised recommendations