Abstract
Clustering algorithms based upon nonparametric or semiparametric density estimation are of more theoretical interest than some of the distance-based hierarchical or ad hoc algorithmic procedures. However density estimation is subject to the curse of dimensionality so that care must be exercised. Clustering algorithms are sometimes described as biased since solutions may be highly influenced by initial configurations. Clusters may be associated with modes of a nonparametric density estimator or with components of a (normal) mixture estimator. Mode-finding algorithms are related to but different than gaussian mixture models. In this paper, we describe a hybrid algorithm which finds modes by fitting incomplete mixture models, or partial mixture component models. Problems with bias are reduced since the partial mixture model is fitted many times using carefully chosen random starting guesses. Many of these partial fits offer unique diagnostic information about the structure and features hidden in the data. We describe the algorithms and present some case studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aitkin M., Wilson, G.T. (1980). Mixture models, outliers, and the EM algorithm. Technometrics 22, 325–331.
Azzalini A., Bowman A.W. (1990). A look at some data on the old faithful geyser. Applied Statistics 39, 357–365.
Barnett V., Lewis T. (1994). Outliers in statistical data. John Wiley & Sons, New York.
Banfield J.D., Raftery A.E. (1993). Model-based Gaussian and nonGaussian clustering. Biometrics 49, 803–821.
Basu A., Harris I.R., Hjort H.L., Jones M.C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549–560.
Beran R. (1977), Robust location estimates. The Annals of Statistics 5, 431–444.
Beran R. (1984). Minimum distance procedures. In Handbook of Statistics Volume 4: Nonparametric Methods, pp. 741–754.
Bowman A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353–360.
Brown L.D., Hwang J.T.G. (1993). How to approximate a histogram by a normal density. The American Statistician 47, 251–255.
Byers S., Raftery A.E. (1998). Nearest-neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association 93, 577–584.
Cook R.D., Weisberg S. (1994). An introduction to regression graphics. Wiley, New York.
Donoho D.L., Liu R.C. (1988). The ‘automatic’ robustness of minimum distance functional. The Annals of Statistics 16, 552–586.
Hjort H.L. (1994). Minimum L2 and robust Kullback-Leibler estimation. Proceedings of the 12th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, P. Lachout and J.Á. Vísek (eds.), Prague Academy of Sciences of the Czech Republic, pp. 102–105.
Huber P.J. (1981). Robust statistics. John Wiley & Sons, New York.
MacQueen J.B. (1967). Some methods for classification and analysis of multivariate observations. Proc. Symp. Math. Statist. Prob 5th Symposium 1, 281–297, Berkeley, CA.
McLachlan G.J., Peel D. (2001). Finite mixture models. John Wiley & Sons, New York.
Rousseeuw P.J., Leroy A.M. (1987). Robust regression and outlier detection. John Wiley & Sons, New York.
Rudemo M. (1982). Empirical choice of histogram and kernel density estimators. Scandinavian Journal of Statistics 9, 65–78.
Scott D.W. (1992). Multivariate density estimation: theory, practice, and visualization. John Wiley, New York.
Scott D.W. (1998). On fitting and adapting of density estimates. Computing Science and Statistics, S. Weisberg (Ed.) 30, 124–133.
Scott D.W. (1999). Remarks on fitting and interpreting mixture models. Computing Science and Statistics, K. Berk and M. Pourahmadi, (Eds.) 31, 104–109.
Scott D.W. (2001). Parametric statistical modeling by minimum integrated square error. Technometrics 43, 274–285.
Scott D.W., Szewczyk W.F. (2001). The stochastic mode tree and clustering. Journal of Computational and Graphical Statistics, under revision.
Swayne D.F., Cook D., Buja A. (1998). XGobi: Interactive dynamic data visualization in the X Window system. Journal of Computational and Graphical Statistics 7, 113–130.
Terrell G.R. (1990). Linear density estimates. Proceedings of the Statistical Computing Section, American Statistical Association, 297–302.
Wang N., Raftery A.E. (2002). Nearest-neighbor variance estimation: Robust covariance estimation via nearest-neighbor cleaning. Journal of the American Statistical Association 97, 994–1019.
Weisberg S. (1985). Applied linear regression. John Wiley, New York.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Scott, D.W. (2004). Outlier Detection and Clustering by Partial Mixture Modeling. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_37
Download citation
DOI: https://doi.org/10.1007/978-3-7908-2656-2_37
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-1554-2
Online ISBN: 978-3-7908-2656-2
eBook Packages: Springer Book Archive