Scalable Probabilistic Clustering

  • P. S. Bradley
  • U. M. Fayyad
  • C. A. Reina
Part of the Applied Optimization book series (APOP, volume 50)


The Expectation-Maximization (EM) algorithm is a popular approach to probabilistic database clustering. A database of observations is clustered by identifying k sub-populations and summarizing each sub- population with a model or probability density function. The EM algorithm is an approach that iteratively estimates the memberships of the observations in each cluster and the parameters of the k density functions for each cluster. Typical EM implementations require a full database scan at each iteration and the number of iterations required to converge is arbitrary. For large databases, these scans become prohibitively expensive. We present a scalable implementation of the EM algorithm based upon identifying regions of the data that are compressible and regions that must be maintained in memory. The approach operates within the confines of a limited main memory buffer. Data resolution is preserved to the extent possible based upon the size of the memory buffer and the fit of the current clustering model to the data. We extend the framework to update multiple cluster models simultaneously. Computational tests indicate that this scalable scheme outperforms sampling-based and incremental approaches — the straightforward alternatives to “scaling” existing traditional in-memory implementations to large databases.


Mixture Model Gaussian Mixture Model Buffer Size Mathematical Program With Equilibrium Constraint Model Update 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACM SIGMOD Intl. Conf. on Management of Data (SIGMOD’98), Seattle, WA, pages 94–105. ACM Press, 1998.Google Scholar
  2. [2]
    M. R. Anderberg. Cluster Analysis for Applications. Academic Press, New York, 1973.zbMATHGoogle Scholar
  3. [3]
    J. Banfield and A. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821, 1993.MathSciNetzbMATHCrossRefGoogle Scholar
  4. [4]
    K. P. Bennett, U. M. Fayyad, and D. Geiger. Density-based indexing for approximate nearest neighbor queries. In Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD99), pages 233–243, New York, 1999. ACM Press.Google Scholar
  5. [5]
    C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, New York, 1995.Google Scholar
  6. [6]
    R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis. Industrial applications of data mining and knowledge discovery. Communications of the ACM, 39(ll):42–48, 1996.CrossRefGoogle Scholar
  7. [7]
    P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering to large databases. In Proc. 4th Intl. Conf. on Knowledge Discovery and Data Mining, KDD98, pages 9–15, Menlo Park, CA, 1998. AAAI Press.Google Scholar
  8. [8]
    P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling EM clustering to large databases. Technical Report MSR-TR-98–35, Microsoft Research, Redmond, WA, 1998.Google Scholar
  9. [9]
    P. S. Bradley, O. L. Mangasarian, and W. N. Street. Clustering via concave minimization. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Process-ing Systems p., pages 368–374, Cambridge, MA, 1997. MIT Press.– Scholar
  10. [10]
    P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining, pages 153–180, Menlo Park, CA, 1996. AAAI Press.Google Scholar
  11. [11]
    D. M. Chickering and D. Heckerman. Efficient approximations for the marginal likelihood of incomplete data given a Bayesian network. Technical Report MSR-TR-96–08, Microsoft Research, Redmond, Washington 98052, 1996.Google Scholar
  12. [12]
    A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977.MathSciNetzbMATHGoogle Scholar
  13. [13]
    J. L. Devore. Probability and Statistics for Engineering and the Sciences. Wadsworth Publishing Company, Belmont, California, 1995. Fourth Edition.Google Scholar
  14. [14]
    R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973.zbMATHGoogle Scholar
  15. [15]
    U. M. Fayyad, S. G. Djorgovski, and N. Weir. Application of classification and clustering to sky survey cataloging and analysis. In E. Wegmen and S. Azen, editors, Computing Science and Statistics, volume 29(2), pages 178–186, Fairfax, VA, USA, 1997. Interface Foundation of North America.Google Scholar
  16. [16]
    U. M. Fayyad, D. Haussler, and P. Stolorz. KDD for science data analysis: Issues and examples. In Proceedings, Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996.Google Scholar
  17. [17]
    U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurasamy. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996.Google Scholar
  18. [18]
    K. Fukunaga. Statistical Pattern Recognition. Academic Press, NY, 1990.zbMATHGoogle Scholar
  19. [19]
    L. Kaufman and P. Rousseeuw. Finding Groups in Data. John Wiley and Sons, New York, NY, 1989.Google Scholar
  20. [20]
    D. Lewis, Reuters Inc., and Carnegie Group. Reuters corpus.
  21. [21]
    R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368, Dordrecht, 1998. Kluwer Academic Publishers.CrossRefGoogle Scholar
  22. [22]
    R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. 20th Int. Conf. on Very Large Databases (VLDB′94), Santiago de Chile, Chile, pages 144–155. Morgan Kaufmann, 1994.Google Scholar
  23. [23]
    T. Olson, J.-S. Pang, and C. Priebe. A likelihood-mpec approach to target classification. Mathematical Programming, submitted, February 2000.Google Scholar
  24. [24]
    J. Sander, M. Ester, H. Kriegel, and X. Xu. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 2(2):169–194, 1998.CrossRefGoogle Scholar
  25. [25]
    S. C. Sclove. Aplication of the conditional population mixture model to image segmentation. IEEE Trans. Patt. Anal. Mach. Intell., PAMI-5:428–433, 1983.CrossRefGoogle Scholar
  26. [26]
    D. W. Scott. Multivariate Density Estimation. John Wiley and Sons, New York, 1992.zbMATHCrossRefGoogle Scholar
  27. [27]
    J. Shanmugusundaram, U. M. Fayyad, and P. S. Bradley. Compressed data cubes for olap aggregate query approximation on continuous dimensions. In Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining (KDD99), pages 223–232, New York, 1999. ACM Press.Google Scholar
  28. [28]
    B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London, 1986.zbMATHGoogle Scholar
  29. [29]
    P. Smyth. Clustering using monte carlo cross-validation. In E. Simoudis, J. Han, and U. Fayyad, editors, Proc. 2nd Intl. Conf. on Knowledge Discovery and Data Mining (KDD96), pages 126–133, Portland, OR, August 1996. AAAI Press.Google Scholar
  30. [30]
    D. M. Tittering, A. F. M. Smith, and U. E. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, Chichester, UK, 1985.Google Scholar
  31. [31]
    T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2): 141–182, 1997.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2001

Authors and Affiliations

  • P. S. Bradley
    • 1
  • U. M. Fayyad
    • 2
  • C. A. Reina
    • 1
  1. 1.Microsoft ResearchMicrosoft CorporationRedmondUSA
  2. 2.digiMine.comKirklandUSA

Personalised recommendations