Skip to main content

Outlier Detection and Clustering by Partial Mixture Modeling

  • Conference paper
COMPSTAT 2004 — Proceedings in Computational Statistics

Abstract

Clustering algorithms based upon nonparametric or semiparametric density estimation are of more theoretical interest than some of the distance-based hierarchical or ad hoc algorithmic procedures. However density estimation is subject to the curse of dimensionality so that care must be exercised. Clustering algorithms are sometimes described as biased since solutions may be highly influenced by initial configurations. Clusters may be associated with modes of a nonparametric density estimator or with components of a (normal) mixture estimator. Mode-finding algorithms are related to but different than gaussian mixture models. In this paper, we describe a hybrid algorithm which finds modes by fitting incomplete mixture models, or partial mixture component models. Problems with bias are reduced since the partial mixture model is fitted many times using carefully chosen random starting guesses. Many of these partial fits offer unique diagnostic information about the structure and features hidden in the data. We describe the algorithms and present some case studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aitkin M., Wilson, G.T. (1980). Mixture models, outliers, and the EM algorithm. Technometrics 22, 325–331.

    Article  MATH  Google Scholar 

  2. Azzalini A., Bowman A.W. (1990). A look at some data on the old faithful geyser. Applied Statistics 39, 357–365.

    Article  MATH  Google Scholar 

  3. Barnett V., Lewis T. (1994). Outliers in statistical data. John Wiley & Sons, New York.

    MATH  Google Scholar 

  4. Banfield J.D., Raftery A.E. (1993). Model-based Gaussian and nonGaussian clustering. Biometrics 49, 803–821.

    Article  MATH  MathSciNet  Google Scholar 

  5. Basu A., Harris I.R., Hjort H.L., Jones M.C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549–560.

    Article  MATH  MathSciNet  Google Scholar 

  6. Beran R. (1977), Robust location estimates. The Annals of Statistics 5, 431–444.

    Article  MATH  MathSciNet  Google Scholar 

  7. Beran R. (1984). Minimum distance procedures. In Handbook of Statistics Volume 4: Nonparametric Methods, pp. 741–754.

    Google Scholar 

  8. Bowman A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353–360.

    Article  MathSciNet  Google Scholar 

  9. Brown L.D., Hwang J.T.G. (1993). How to approximate a histogram by a normal density. The American Statistician 47, 251–255.

    MathSciNet  Google Scholar 

  10. Byers S., Raftery A.E. (1998). Nearest-neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association 93, 577–584.

    Article  MATH  Google Scholar 

  11. Cook R.D., Weisberg S. (1994). An introduction to regression graphics. Wiley, New York.

    Book  MATH  Google Scholar 

  12. Donoho D.L., Liu R.C. (1988). The ‘automatic’ robustness of minimum distance functional. The Annals of Statistics 16, 552–586.

    Article  MATH  MathSciNet  Google Scholar 

  13. Hjort H.L. (1994). Minimum L2 and robust Kullback-Leibler estimation. Proceedings of the 12th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, P. Lachout and J.Á. Vísek (eds.), Prague Academy of Sciences of the Czech Republic, pp. 102–105.

    Google Scholar 

  14. Huber P.J. (1981). Robust statistics. John Wiley & Sons, New York.

    Book  MATH  Google Scholar 

  15. MacQueen J.B. (1967). Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. Prob 5th Symposium 1, 281–297, Berkeley, CA.

    MathSciNet  Google Scholar 

  16. McLachlan G.J., Peel D. (2001). Finite mixture models. John Wiley & Sons, New York.

    Google Scholar 

  17. Rousseeuw P.J., Leroy A.M. (1987). Robust regression and outlier detection. John Wiley & Sons, New York.

    Book  MATH  Google Scholar 

  18. Rudemo M. (1982). Empirical choice of histogram and kernel density estimators. Scandinavian Journal of Statistics 9, 65–78.

    MATH  MathSciNet  Google Scholar 

  19. Scott D.W. (1992). Multivariate density estimation: theory, practice, and visualization. John Wiley, New York.

    Book  MATH  Google Scholar 

  20. Scott D.W. (1998). On fitting and adapting of density estimates. Computing Science and Statistics, S. Weisberg (Ed.) 30, 124–133.

    Google Scholar 

  21. Scott D.W. (1999). Remarks on fitting and interpreting mixture models. Computing Science and Statistics, K. Berk and M. Pourahmadi, (Eds.) 31, 104–109.

    Google Scholar 

  22. Scott D.W. (2001). Parametric statistical modeling by minimum integrated square error. Technometrics 43, 274–285.

    Article  MathSciNet  Google Scholar 

  23. Scott D.W., Szewczyk W.F. (2001). The stochastic mode tree and clustering. Journal of Computational and Graphical Statistics, under revision.

    Google Scholar 

  24. Swayne D.F., Cook D., Buja A. (1998). XGobi: Interactive dynamic data visualization in the X Window system. Journal of Computational and Graphical Statistics 7, 113–130.

    Google Scholar 

  25. Terrell G.R. (1990). Linear density estimates. Proceedings of the Statistical Computing Section, American Statistical Association, 297–302.

    Google Scholar 

  26. Wang N., Raftery A.E. (2002). Nearest-neighbor variance estimation: Robust covariance estimation via nearest-neighbor cleaning. Journal of the American Statistical Association 97, 994–1019.

    Article  MATH  MathSciNet  Google Scholar 

  27. Weisberg S. (1985). Applied linear regression. John Wiley, New York.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Scott, D.W. (2004). Outlier Detection and Clustering by Partial Mixture Modeling. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-7908-2656-2_37

  • Publisher Name: Physica, Heidelberg

  • Print ISBN: 978-3-7908-1554-2

  • Online ISBN: 978-3-7908-2656-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics