Outlier Detection and Clustering by Partial Mixture Modeling

Scott, David W.

doi:10.1007/978-3-7908-2656-2_37

David W. Scott²

690 Accesses
15 Citations

Abstract

Clustering algorithms based upon nonparametric or semiparametric density estimation are of more theoretical interest than some of the distance-based hierarchical or ad hoc algorithmic procedures. However density estimation is subject to the curse of dimensionality so that care must be exercised. Clustering algorithms are sometimes described as biased since solutions may be highly influenced by initial configurations. Clusters may be associated with modes of a nonparametric density estimator or with components of a (normal) mixture estimator. Mode-finding algorithms are related to but different than gaussian mixture models. In this paper, we describe a hybrid algorithm which finds modes by fitting incomplete mixture models, or partial mixture component models. Problems with bias are reduced since the partial mixture model is fitted many times using carefully chosen random starting guesses. Many of these partial fits offer unique diagnostic information about the structure and features hidden in the data. We describe the algorithms and present some case studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aitkin M., Wilson, G.T. (1980). Mixture models, outliers, and the EM algorithm. Technometrics 22, 325–331.
Article MATH Google Scholar
Azzalini A., Bowman A.W. (1990). A look at some data on the old faithful geyser. Applied Statistics 39, 357–365.
Article MATH Google Scholar
Barnett V., Lewis T. (1994). Outliers in statistical data. John Wiley & Sons, New York.
MATH Google Scholar
Banfield J.D., Raftery A.E. (1993). Model-based Gaussian and nonGaussian clustering. Biometrics 49, 803–821.
Article MATH MathSciNet Google Scholar
Basu A., Harris I.R., Hjort H.L., Jones M.C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549–560.
Article MATH MathSciNet Google Scholar
Beran R. (1977), Robust location estimates. The Annals of Statistics 5, 431–444.
Article MATH MathSciNet Google Scholar
Beran R. (1984). Minimum distance procedures. In Handbook of Statistics Volume 4: Nonparametric Methods, pp. 741–754.
Google Scholar
Bowman A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353–360.
Article MathSciNet Google Scholar
Brown L.D., Hwang J.T.G. (1993). How to approximate a histogram by a normal density. The American Statistician 47, 251–255.
MathSciNet Google Scholar
Byers S., Raftery A.E. (1998). Nearest-neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association 93, 577–584.
Article MATH Google Scholar
Cook R.D., Weisberg S. (1994). An introduction to regression graphics. Wiley, New York.
Book MATH Google Scholar
Donoho D.L., Liu R.C. (1988). The ‘automatic’ robustness of minimum distance functional. The Annals of Statistics 16, 552–586.
Article MATH MathSciNet Google Scholar
Hjort H.L. (1994). Minimum L2 and robust Kullback-Leibler estimation. Proceedings of the 12th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, P. Lachout and J.Á. Vísek (eds.), Prague Academy of Sciences of the Czech Republic, pp. 102–105.
Google Scholar
Huber P.J. (1981). Robust statistics. John Wiley & Sons, New York.
Book MATH Google Scholar
MacQueen J.B. (1967). Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. Prob 5th Symposium 1, 281–297, Berkeley, CA.
MathSciNet Google Scholar
McLachlan G.J., Peel D. (2001). Finite mixture models. John Wiley & Sons, New York.
Google Scholar
Rousseeuw P.J., Leroy A.M. (1987). Robust regression and outlier detection. John Wiley & Sons, New York.
Book MATH Google Scholar
Rudemo M. (1982). Empirical choice of histogram and kernel density estimators. Scandinavian Journal of Statistics 9, 65–78.
MATH MathSciNet Google Scholar
Scott D.W. (1992). Multivariate density estimation: theory, practice, and visualization. John Wiley, New York.
Book MATH Google Scholar
Scott D.W. (1998). On fitting and adapting of density estimates. Computing Science and Statistics, S. Weisberg (Ed.) 30, 124–133.
Google Scholar
Scott D.W. (1999). Remarks on fitting and interpreting mixture models. Computing Science and Statistics, K. Berk and M. Pourahmadi, (Eds.) 31, 104–109.
Google Scholar
Scott D.W. (2001). Parametric statistical modeling by minimum integrated square error. Technometrics 43, 274–285.
Article MathSciNet Google Scholar
Scott D.W., Szewczyk W.F. (2001). The stochastic mode tree and clustering. Journal of Computational and Graphical Statistics, under revision.
Google Scholar
Swayne D.F., Cook D., Buja A. (1998). XGobi: Interactive dynamic data visualization in the X Window system. Journal of Computational and Graphical Statistics 7, 113–130.
Google Scholar
Terrell G.R. (1990). Linear density estimates. Proceedings of the Statistical Computing Section, American Statistical Association, 297–302.
Google Scholar
Wang N., Raftery A.E. (2002). Nearest-neighbor variance estimation: Robust covariance estimation via nearest-neighbor cleaning. Journal of the American Statistical Association 97, 994–1019.
Article MATH MathSciNet Google Scholar
Weisberg S. (1985). Applied linear regression. John Wiley, New York.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Rice University, MS-138, 1892, Houston, TX, 77251-1892, USA
David W. Scott

Authors

David W. Scott
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Mathematics and Physics Department of Statistics and Probability, Charles University, Sokolovská 83, 18675, Prague 8 - Karlin, Czech Republic
Jaromir Antoch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scott, D.W. (2004). Outlier Detection and Clustering by Partial Mixture Modeling. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_37

Download citation

DOI: https://doi.org/10.1007/978-3-7908-2656-2_37
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-1554-2
Online ISBN: 978-3-7908-2656-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics