Scalable Probabilistic Clustering
The Expectation-Maximization (EM) algorithm is a popular approach to probabilistic database clustering. A database of observations is clustered by identifying k sub-populations and summarizing each sub- population with a model or probability density function. The EM algorithm is an approach that iteratively estimates the memberships of the observations in each cluster and the parameters of the k density functions for each cluster. Typical EM implementations require a full database scan at each iteration and the number of iterations required to converge is arbitrary. For large databases, these scans become prohibitively expensive. We present a scalable implementation of the EM algorithm based upon identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited main memory buffer. Data resolution is preserved to the extent possible based upon the size of the memory buffer and the fit of the current clustering model to the data. We extend the framework to update multiple cluster models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based and incremental approaches — the straightforward alternatives to “scaling” existing traditional in-memory implementations to large databases.
KeywordsMixture Model Gaussian Mixture Model Buffer Size Mathematical Program With Equilibrium Constraint Model Update
Unable to display preview. Download preview PDF.
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACM SIGMOD Intl. Conf. on Management of Data (SIGMOD’98), Seattle, WA, pages 94–105. ACM Press, 1998.Google Scholar
- K. P. Bennett, U. M. Fayyad, and D. Geiger. Density-based indexing for approximate nearest neighbor queries. In Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD99), pages 233–243, New York, 1999. ACM Press.Google Scholar
- C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, New York, 1995.Google Scholar
- P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering to large databases. In Proc. 4th Intl. Conf. on Knowledge Discovery and Data Mining, KDD98, pages 9–15, Menlo Park, CA, 1998. AAAI Press.Google Scholar
- P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling EM clustering to large databases. Technical Report MSR-TR-98–35, Microsoft Research, Redmond, WA, 1998.Google Scholar
- P. S. Bradley, O. L. Mangasarian, and W. N. Street. Clustering via concave minimization. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Process-ing Systems p., pages 368–374, Cambridge, MA, 1997. MIT Press. http://www.cs.wisc.edu/math-prog/tech-reports/96–03.ps.Z.Google Scholar
- P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining, pages 153–180, Menlo Park, CA, 1996. AAAI Press.Google Scholar
- D. M. Chickering and D. Heckerman. Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. Technical Report MSR-TR-96–08, Microsoft Research, Redmond, Washington 98052, 1996.Google Scholar
- J. L. Devore. Probability and Statistics for Engineering and the Sciences. Wadsworth Publishing Company, Belmont, California, 1995. Fourth Edition.Google Scholar
- U. M. Fayyad, S. G. Djorgovski, and N. Weir. Application of classification and clustering to sky survey cataloging and analysis. In E. Wegmen and S. Azen, editors, Computing Science and Statistics, volume 29(2), pages 178–186, Fairfax, VA, USA, 1997. Interface Foundation of North America.Google Scholar
- U. M. Fayyad, D. Haussler, and P. Stolorz. KDD for science data analysis: Issues and examples. In Proceedings, Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996.Google Scholar
- U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996.Google Scholar
- L. Kaufman and P. Rousseeuw. Finding Groups in Data. John Wiley and Sons, New York, NY, 1989.Google Scholar
- D. Lewis, Reuters Inc., and Carnegie Group. Reuters corpus. http://www.research.att.com/~lewis/reuters21578.html.
- R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. 20th Int. Conf. on Very Large Databases (VLDB′94), Santiago de Chile, Chile, pages 144–155. Morgan Kaufmann, 1994.Google Scholar
- T. Olson, J.-S. Pang, and C. Priebe. A likelihood-mpec approach to target classification. Mathematical Programming, submitted, February 2000.Google Scholar
- J. Shanmugusundaram, U. M. Fayyad, and P. S. Bradley. Compressed data cubes for olap aggregate query approximation on continuous dimensions. In Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD99), pages 223–232, New York, 1999. ACM Press.Google Scholar
- P. Smyth. Clustering using monte carlo cross-validation. In E. Simoudis, J. Han, and U. Fayyad, editors, Proc. 2nd Intl. Conf. on Knowledge Discovery and Data Mining (KDD96), pages 126–133, Portland, OR, August 1996. AAAI Press.Google Scholar
- D. M. Tittering, A. F. M. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, Chichester, UK, 1985.Google Scholar