Skip to main content

Clustering in Massive Data Sets

  • Chapter
Handbook of Massive Data Sets

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far back as the 1960s still very often remain topical. More recent work is also covered in this article. This includes a solution for the statistical question of how many clusters there are in a dataset. We also look at one line of inquiry in the use of clustering for human-computer user interfaces. Finally, the visualization of data leads to the consideration of data arrays as images, and we speculate on future results to be expected here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • D. Allard and C. Fraley: Non-parametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation. Journal of the American Statistical Association, 92: 1485–1493, 1997.

    MATH  Google Scholar 

  • P. Arabie and L. J. Hubert: An overview of combinatorial data analysis, pages 5–63. In, Arabie et al. (1996), 1996.

    Google Scholar 

  • P. Arabie, L. J. Hubert, and G. De Soete, editors. Clustering and Classification. Singapore: World Scientific, 1996.

    MATH  Google Scholar 

  • S. Banerjee and A. Rosenfeld: Model-based cluster analysis. Pattern Recognition, 26: 963–974, 1993.

    Article  Google Scholar 

  • J. D. Banfield and A. E. Raftery: Model-based Gaussian and non-Gaussian clustering. Biometrics, 49: 803–821, 1993.

    Article  MathSciNet  MATH  Google Scholar 

  • K. P. Bennett, U. Fayyad, and D. Geiger: Density-based indexing for approximate nearest neighbor queries. Technical report, Microsoft, 1999. Microsoft Research Technical Report MSR-TR-98–58.

    Google Scholar 

  • J. L. Bentley and J. H. Friedman: Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27: 97–105, 1978.

    Google Scholar 

  • J. L. Bentley, B. W. Weide, and A. C. Yao: Optimal expected time algorithms for closest point problems. ACM Transactions on Mathematical Software, 6: 563–580, 1980.

    Article  MathSciNet  MATH  Google Scholar 

  • M. W. Berry, Z. Drmac, and E. R. Jessup: Matrices, vector spaces, and information retrieval. SIAM Review, 41: 335–362, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  • M. W. Berry, B. Hendrickson, and P. Raghavan: Sparse matrix reordering schemes for browsing hypertext. In J. Renegar, M. Shub, and S. Smale, editors, Lectures in Applied Mathematics (LAM) Vol. 32: The Mathematics of Numerical Analysis, pages 99–123. American Mathematical Society, 1996.

    Google Scholar 

  • K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft: When is nearest neighbor meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, 1999. in press.

    Google Scholar 

  • A. Borodin, R. Ostrovsky, and Y. Rabani: Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In Proc. 31st ACM Symposium on Theory of Computing (STOC-99), 1999.

    Google Scholar 

  • A. J. Broder: Strategies for efficient incremental nearest neighbor search. Pattern Recognition, 23: 171–178, 1990.

    Article  Google Scholar 

  • A. Z. Broder: On the resemblance and containment of documents, pages 21–29. IEEE Computer Society, 1998.

    Google Scholar 

  • A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig: Syntactic clustering of the web. In Proc. Sixth International World Wide Web Conference, pages 391–404, 1997.

    Google Scholar 

  • M. Bruynooghe: Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données, (3):24–42, 1977.

    Google Scholar 

  • W. A. Burkhard and R. M. Keller: Some approaches to best-match file searching. Communications of the ACM, 16: 230–236, 1973.

    Article  MATH  Google Scholar 

  • S. D. Byers and A. E. Raftery: Nearest neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association, 93: 577–584, 1998.

    Article  MATH  Google Scholar 

  • J. G. Campbell, C. Fraley, D. Stanford, F. Murtagh, and A. E. Raftery: Model-based methods for textile fault detection. International Journal of Imaging Science and Technology, 10: 339–346, 1999.

    Article  Google Scholar 

  • J. G. Campbell, C. Fraley, D. Stanford, F. Murtagh, and A. E. Raftery: Cartia. Mapping the information landscape, client-server software system, 1999. Caria, Inc.

    Google Scholar 

  • D. Cheriton and D. E. Tarjan: Finding minimum spanning trees. SIAM Journal on Computing, 5: 724–742, 1976.

    Article  MathSciNet  MATH  Google Scholar 

  • K. W. Church and J. I. Helfman: Dotplot: a program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2: 153–174, 1993.

    Google Scholar 

  • W. B. Croft: Clustering large files of documents using the single-link method. Journal of the American Society for Information Science, 28: 341–344, 1977.

    Article  Google Scholar 

  • C. Darken, J. Chang, and J. Moody: Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2, Proceedings of the 1992 IEEE Workshop, Piscataway, 1992. IEEE Press.

    Google Scholar 

  • C. Darken and J. Moody: Note on learning rate schedules for stochastic optimization. In Lippmann, Moody, and Touretzky, editors, Advances in Neural Information Processing Systems 3. Morgan Kaufmann, Palo Alto, 1991.

    Google Scholar 

  • C. Darken and J. Moody: Towards faster stochastic gradient search. In Hanson Moody and Lippmann, editors, Advances in Neural Information Processing Systems 4, San Mateo, 1992. Morgan Kaufman.

    Google Scholar 

  • B. V. Dasarathy: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, New York, 1991.

    Google Scholar 

  • A. Dasgupta and A. E. Raftery: Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93: 294–302, 1998.

    Article  MATH  Google Scholar 

  • W. H. E. Day and H. Edelsbrunner: Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1: 7–24, 1984.

    Article  MATH  Google Scholar 

  • C. de Rham: La classification hiérarchique ascendante selon la méthode des voisins réciproques. Les Cahiers de l’Analyse des Données, V: 135–144, 1980.

    Google Scholar 

  • D. Defays: An efficient algorithm for a complete link method. Computer Journal, 20: 364–366, 1977.

    Article  MathSciNet  MATH  Google Scholar 

  • C. Delannoy: Un algorithme rapide de recherche de plus proches voisins. RAIRO Informatique/Computer Science, 14: 275–286, 1980.

    Google Scholar 

  • A. R. Dempster, N. M. Laird, and D. B. Rubin: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39: 1–22, 1977.

    MathSciNet  MATH  Google Scholar 

  • A. Dobrzycki, H. Ebeling, K. Glotfelty, P. Freeman, F. Damiani, M. Elvis, and T. Calderwood: Chandra Detect 1.0 User Guide. Chandra X-Ray Center, 1999. Smithsonian Astrophysical Observatory, Version 0. 9.

    Google Scholar 

  • L. B. Doyle: Semantic road maps for literature searchers. Journal of the ACM, 8: 553–578, 1961.

    Article  MATH  Google Scholar 

  • C. M. Eastman and S. F. Weiss: Tree structures for high dimensionality nearest neighbor searching. Information Systems, 7: 115–122, 1982.

    Article  MATH  Google Scholar 

  • H. Ebeling and G. Wiedenmann: Detecting structure in two dimensions combining voronoi tessellation and percolation. Physical Review E, 47: 704–714, 1993.

    Article  Google Scholar 

  • E. Forgy: Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics, 21: 768, 1965.

    Google Scholar 

  • C. Fraley: Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal of Scientific Computing, 20: 270–281, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  • C. Fraley and A. E. Raftery: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41: 578–588, 1999.

    Article  MATH  Google Scholar 

  • J. H. Friedman, F. Baskett, and L. J. Shustek: An algorithm for finding nearest neighbors. IEEE Transactions on Computers, C-24: 1000–1006, 1975.

    Google Scholar 

  • J. H. Friedman, J. L. Bentley, and R. A. Finkel: An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3: 209–226, 1977.

    Article  MATH  Google Scholar 

  • B. Fritzke: Some competitive learning methods, 1997.

    Google Scholar 

  • K. Fukunaga and P. M. Narendra: A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, C-24: 750–753, 1975.

    Google Scholar 

  • V. J. Gillet, D. J. Wild, P. Willett, and J. Bradshaw: Similarity and dissimilarity methods for processing chemical structure databases. The Computer Journal, 41: 547–558, 1998.

    Article  MATH  Google Scholar 

  • A. D. Gordon: Classification. Champman and Hall, 2 edition, 1999.

    Google Scholar 

  • A. Griffiths, L. A. Robinson, and P. Willett: Hierarchic agglomerative clustering methods for automatic document classification. Journal of Documentation, 40: 175–205, 1984.

    Article  Google Scholar 

  • D. Guillaume and F. Murtagh: Clustering of XML documents. Computer Physics Communications, 1999. submitted.

    Google Scholar 

  • M. E. Hodgson: Reducing the computational requirements of the minimum-distance classifier. Remote Sensing of Environment, 25: 117–128, 1988.

    Article  Google Scholar 

  • E. Horowitz and S. Sahni: Fundamentals of Computer Algorithms. London: Pitman, 1979.

    MATH  Google Scholar 

  • A. K. Jain and R. C. Dubes: Algorithms for Clustering Data. Englewood Cliffs: Prentice-Hall, 1988.

    MATH  Google Scholar 

  • G. Jammal and A. Bijaoui: Multiscale image restoration for photon imaging systems. In SPIE Conference on Signal and Image Processing: Wavelet Applications in Signal and Image Processing, V II, July 1999.

    Google Scholar 

  • J. Juan: Programme de classification hiérarchique par l’algorithme de la recherche en chaîne des voisins réciproques. Les Cahiers de l’Analyse des Données, VII: 219–225, 1982.

    MATH  Google Scholar 

  • R. E. Kass and A. E. Raftery: Bayes factors. Journal of the American Statistical Association, 90: 773–795, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  • J. Kittler: A method for determining k-nearest neighbors. Kybernetes, 7: 313–315, 1978.

    Article  MATH  Google Scholar 

  • E. D. Kolaczyk: Nonparametric estimation of gamma-ray burst intensi- ties using haar wavelets. Astrophysical Journal, 483: 340–349, 1997.

    Article  Google Scholar 

  • E. Kushilevitz, R. Ostrovsky, and Y. Rabani: Efficient search for approximate nearest neighbors in high-dimensional spaces. In Proc. of 30th ACM Symposium on Theory of Computing (STOC-30), 1998.

    Google Scholar 

  • P. Lloyd: Least squares quantization in pcm. IEEE Transactions on Information Theory, 1982. Technical note, Bell Laboratories, 1957.

    Google Scholar 

  • J. MacQueen: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297, Berkeley, 1976. University of California Press.

    Google Scholar 

  • R. B. Marimont and M. B. Shapiro: Nearest neighbor searches and the curse of dimensionality. Journal of the Institute of Mathematics and Its Applications, 24: 59–70, 1979.

    Article  MATH  Google Scholar 

  • L. Micó, J. Oncina, and E. Vidal: An algorithm for finding nearest neighbors in constant average time with a linear space complexity. In The 11th International Conference on Pattern Recognition, volume II, pages 557–560, New York, 1992. IEEE Computer Science Press.

    Google Scholar 

  • A. Moore: Very fast EM-based mixture model clustering using multiresolution kd-trees. Neural Information Processing Systems, December 1998.

    Google Scholar 

  • R. Motwani and P. Raghavan: Randomized Algorithms. Cambridge University Press, 1995.

    Google Scholar 

  • S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and A. Raftery: Three types of gamma-ray bursts. The Astrophysical Journal, 508: 314–327, 1998.

    Article  Google Scholar 

  • F. Murtagh: A very fast, exact nearest neighbor algorithm for use in information retrieval. Information Technology, 1: 275–283, 1982.

    Google Scholar 

  • F. Murtagh: Expected time complexity results for hierarchic clustering algorithms which use cluster centers. Information Processing Letters, 16: 237–241, 1983.

    Article  MathSciNet  MATH  Google Scholar 

  • F. Murtagh: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly, 1: 101–113, 1984.

    MATH  Google Scholar 

  • F. Murtagh: Multidimensional Clustering Algorithms. Würzburg: Physica-Verlag, 1985.

    MATH  Google Scholar 

  • F. Murtagh: Comments on `parallel algorithms for hierarchical clustering and cluster validity’. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14: 1056–1057, 1992.

    Article  Google Scholar 

  • F. Murtagh: Multivariate methods for data analysis. In A. Sandqvist and T.P. Ray, editors, Central Activity in Galaxies: From Observational Data to Astrophysical Diagnostics, pages 209–235, Berlin, 1993a. Springer-Verlag.

    Chapter  Google Scholar 

  • F. Murtagh: Search algorithms for numeric and quantitative data. In A. Heck and F. Murtagh, editors, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, pages 49–80, Dordrecht, 1993b. Kluwer Academic.

    Google Scholar 

  • F. Murtagh: Foreword to the special issue on clustering and classification. The Computer Journal, 41: 517, 1998.

    Article  MATH  Google Scholar 

  • F. Murtagh and A. Heck: Multivariate Data Analysis. Dordrecht: Kluwer Academic, 1987.

    Book  MATH  Google Scholar 

  • F. Murtagh and M. H. Pajares: The Kohonen self-organizing map method: an assessment. Journal of Classification, 12: 165–190, 1995.

    Article  MATH  Google Scholar 

  • F. Murtagh and A.E. Raftery: Fitting straight lines to point patterns. Pattern Recognition, 17: 479–483, 1984.

    Article  Google Scholar 

  • F. Murtagh and J. L. Starck: Pattern clustering based on noise modeling in wavelet space. Pattern Recognition, 31: 847–855, 1998.

    Article  Google Scholar 

  • F. Murtagh, J. L. Starck, and M. Berry: Overcoming the curse of dimensionality in clustering by means of the wavelet transform. The Computer Journal, 43: 107–120, 2000.

    Article  Google Scholar 

  • R. Neal and G. Hinton: A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. Jordan, editor, Learning in Graphical Models, pages 355–371, Dordrecht, 1998. Kluwer Academic Publisher.

    Chapter  Google Scholar 

  • H. Niemann and R. Goppert: An efficient branch-and-bound nearest neighbor classifier. Pattern Recognition Letters, 7: 67–72, 1988.

    Article  Google Scholar 

  • B. K. Parsi and L. N. Kanal: An improved branch and bound algorithm for computing k-nearest neighbors. Pattern Recognition Letters, 3: 7–12, 1985.

    Article  Google Scholar 

  • D. Pelleg and A. Moore: Accelerating exact k-means algorithms with geometric reasoning. In Proceedings KDD-99, Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, August 1999.

    Google Scholar 

  • S. A. Perry and P. Willett: A review of the use of inverted files for best match searching in information retrieval systems. Journal of Information Science, 6: 59–66, 1983.

    Article  Google Scholar 

  • P. Poinçot, S. Lesteven, and F. Murtagh: A spatial user interface to the astronomical literature. Astronomy and Astrophysics Supplement, 130: 183–191, 1998.

    Article  Google Scholar 

  • P. Poinçot, S. Lesteven, and F. Murtagh: Maps of information spaces: assessments from astronomy. Journal of the American Society for Information Science, 1999. submitted.

    Google Scholar 

  • V. Ramasubramanian and K. K. Paliwal: An efficient approximation-algorithm for fast nearest-neighbor search based on a spherical distance coordinate formulation. Pattern Recognition Letters, 13: 47 1480, 1992.

    Google Scholar 

  • M. Richetin, G. Rives, and M. Naranjo: Algorithme rapide pour la détérmination des k plus proches voisins. RAIRO Informatique/Computer Science, 14: 369–378, 1980.

    MATH  Google Scholar 

  • F. J. Rohlf: Algorithm 76: Hierarchical clustering using the minimum spanning tree. The Computer Journal, 16: 93–95, 1973.

    Google Scholar 

  • F. J. Rohlf: A probabilistic minimum spanning tree algorithm. Information Processing Letters, 7: 44–48, 1978.

    Article  MathSciNet  MATH  Google Scholar 

  • F. J. Rohlf: Single link clustering algorithms. In P. R. Krishnaiah and L.N. Kanal, editors, Handbook of Statistics, volume 2, pages 267–284, Amsterdam, 1982. North-Holland.

    Google Scholar 

  • E. V. Ruiz: An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letters, 4: 145–157, 1986.

    Article  Google Scholar 

  • G. Salton and M. J. McGill: Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983.

    MATH  Google Scholar 

  • M. Sato and S. Ishii: Reinforcement learning based on on-line EM algorithm. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 1052–1058, Cambridge, 1999. MIT Press.

    Google Scholar 

  • T. Schreiber: Efficient search for nearest neighbors. In A.S. Weigend and N.A Gershenfeld, editors, Predicting the Future and Understanding the Past: A Comparison of Approaches, New York, 1993. Addison-Wesley.

    Google Scholar 

  • G. Schwarz: Estimating the dimension of a model. The Annals of Statistics, 6: 461–464, 1978.

    Article  MathSciNet  MATH  Google Scholar 

  • SDSS. Sloan digital sky survey, 1999.

    Google Scholar 

  • M. Shapiro: The choice of reference points in best-match file searching. Communications of the ACM, 20: 339–343, 1977.

    Article  Google Scholar 

  • R. Sibson: Slink: an optimally efficient algorithm for the single link cluster method. The Computer Journal, 16: 30–34, 1973.

    Article  MathSciNet  Google Scholar 

  • A. F. Smeaton and C. J. van Rijsbergen: The nearest neighbor problem in information retrieval: an algorithm using upperbounds. ACM SIGIR Forum, 16: 83–87, 1981.

    Article  Google Scholar 

  • P. H. A. Sneath and R. R. Sokal: Numerical Taxonomy. San Francisco: W. H. Freeman, 1973.

    MATH  Google Scholar 

  • H. Späth: Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. Chichester: Ellis Horwood, 1985.

    MATH  Google Scholar 

  • J. L. Starck, F. Murtagh, and A. Bijaoui: Image and Data Analysis: The Multiscale Approach. New York: Cambridge University Press, 1998.

    MATH  Google Scholar 

  • R. E. Tarjan: An improved algorithm for hierarchical clustering using strong components. Information Processing Letters, 17: 37–41, 1983.

    Article  MathSciNet  MATH  Google Scholar 

  • B. Thiesson, C. Meek, and D. Heckerman: Accelerating EM for large databases. Technical report, Microsoft, 1999. Microsoft Research Technical Report MST-TR-99–31.

    Google Scholar 

  • S. F. Weiss: A probabilistic algorithm for nearest neighbor searching. In R. N. Oddy and et al., editors, Information Retrieval Research, pages 325–333, London, 1981. Butterworths.

    Google Scholar 

  • H. D. White and K. W. McCain: Visualization of literatures. Annual Review of Information Science and Technology (ARIST), 32: 99–168, 1997.

    Google Scholar 

  • P. Willett: Efficiency of hierarchic agglomerative clustering using the icl distributed array processor. Journal of Documentation, 45: 1–45, 1989.

    Article  Google Scholar 

  • A. C. Yao: An o(lel log log ivi) algorithm for finding minimum spanning trees. Information Processing Letters, 4: 21–23, 1975.

    Article  Google Scholar 

  • T. P. Yunck: A technique to identify nearest neighbors. IEEE Transactions on Systems, Man, and Cybernetics, 6: 678–683, 1976.

    Article  MathSciNet  MATH  Google Scholar 

  • C. T. Zahn: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20: 68–86, 1971.

    Article  MATH  Google Scholar 

  • G. Zheng, J. L. Starck, J.G. Campbell, and F. Murtagh: Multiscale transforms for filtering financial data streams, Journal of Computational Intelligence in Finance, 7, 10 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Murtagh, F. (2002). Clustering in Massive Data Sets. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_14

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics