Clustering in Massive Data Sets

Murtagh, Fionn

doi:10.1007/978-1-4615-0005-6_14

Fionn Murtagh³

Part of the book series: Massive Computing ((MACO,volume 4))

519 Accesses
14 Citations

Abstract

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far back as the 1960s still very often remain topical. More recent work is also covered in this article. This includes a solution for the statistical question of how many clusters there are in a dataset. We also look at one line of inquiry in the use of clustering for human-computer user interfaces. Finally, the visualization of data leads to the consideration of data arrays as images, and we speculate on future results to be expected here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 629.00; Price excludes VAT (USA)

Softcover Book: USD 799.99; Price excludes VAT (USA)

Hardcover Book: USD 799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

D. Allard and C. Fraley: Non-parametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation. Journal of the American Statistical Association, 92: 1485–1493, 1997.
MATH Google Scholar
P. Arabie and L. J. Hubert: An overview of combinatorial data analysis, pages 5–63. In, Arabie et al. (1996), 1996.
Google Scholar
P. Arabie, L. J. Hubert, and G. De Soete, editors. Clustering and Classification. Singapore: World Scientific, 1996.
MATH Google Scholar
S. Banerjee and A. Rosenfeld: Model-based cluster analysis. Pattern Recognition, 26: 963–974, 1993.
Article Google Scholar
J. D. Banfield and A. E. Raftery: Model-based Gaussian and non-Gaussian clustering. Biometrics, 49: 803–821, 1993.
Article MathSciNet MATH Google Scholar
K. P. Bennett, U. Fayyad, and D. Geiger: Density-based indexing for approximate nearest neighbor queries. Technical report, Microsoft, 1999. Microsoft Research Technical Report MSR-TR-98–58.
Google Scholar
J. L. Bentley and J. H. Friedman: Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27: 97–105, 1978.
Google Scholar
J. L. Bentley, B. W. Weide, and A. C. Yao: Optimal expected time algorithms for closest point problems. ACM Transactions on Mathematical Software, 6: 563–580, 1980.
Article MathSciNet MATH Google Scholar
M. W. Berry, Z. Drmac, and E. R. Jessup: Matrices, vector spaces, and information retrieval. SIAM Review, 41: 335–362, 1999.
Article MathSciNet MATH Google Scholar
M. W. Berry, B. Hendrickson, and P. Raghavan: Sparse matrix reordering schemes for browsing hypertext. In J. Renegar, M. Shub, and S. Smale, editors, Lectures in Applied Mathematics (LAM) Vol. 32: The Mathematics of Numerical Analysis, pages 99–123. American Mathematical Society, 1996.
Google Scholar
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft: When is nearest neighbor meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, 1999. in press.
Google Scholar
A. Borodin, R. Ostrovsky, and Y. Rabani: Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In Proc. 31st ACM Symposium on Theory of Computing (STOC-99), 1999.
Google Scholar
A. J. Broder: Strategies for efficient incremental nearest neighbor search. Pattern Recognition, 23: 171–178, 1990.
Article Google Scholar
A. Z. Broder: On the resemblance and containment of documents, pages 21–29. IEEE Computer Society, 1998.
Google Scholar
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig: Syntactic clustering of the web. In Proc. Sixth International World Wide Web Conference, pages 391–404, 1997.
Google Scholar
M. Bruynooghe: Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données, (3):24–42, 1977.
Google Scholar
W. A. Burkhard and R. M. Keller: Some approaches to best-match file searching. Communications of the ACM, 16: 230–236, 1973.
Article MATH Google Scholar
S. D. Byers and A. E. Raftery: Nearest neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association, 93: 577–584, 1998.
Article MATH Google Scholar
J. G. Campbell, C. Fraley, D. Stanford, F. Murtagh, and A. E. Raftery: Model-based methods for textile fault detection. International Journal of Imaging Science and Technology, 10: 339–346, 1999.
Article Google Scholar
J. G. Campbell, C. Fraley, D. Stanford, F. Murtagh, and A. E. Raftery: Cartia. Mapping the information landscape, client-server software system, 1999. Caria, Inc.
Google Scholar
D. Cheriton and D. E. Tarjan: Finding minimum spanning trees. SIAM Journal on Computing, 5: 724–742, 1976.
Article MathSciNet MATH Google Scholar
K. W. Church and J. I. Helfman: Dotplot: a program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2: 153–174, 1993.
Google Scholar
W. B. Croft: Clustering large files of documents using the single-link method. Journal of the American Society for Information Science, 28: 341–344, 1977.
Article Google Scholar
C. Darken, J. Chang, and J. Moody: Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2, Proceedings of the 1992 IEEE Workshop, Piscataway, 1992. IEEE Press.
Google Scholar
C. Darken and J. Moody: Note on learning rate schedules for stochastic optimization. In Lippmann, Moody, and Touretzky, editors, Advances in Neural Information Processing Systems 3. Morgan Kaufmann, Palo Alto, 1991.
Google Scholar
C. Darken and J. Moody: Towards faster stochastic gradient search. In Hanson Moody and Lippmann, editors, Advances in Neural Information Processing Systems 4, San Mateo, 1992. Morgan Kaufman.
Google Scholar
B. V. Dasarathy: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, New York, 1991.
Google Scholar
A. Dasgupta and A. E. Raftery: Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93: 294–302, 1998.
Article MATH Google Scholar
W. H. E. Day and H. Edelsbrunner: Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1: 7–24, 1984.
Article MATH Google Scholar
C. de Rham: La classification hiérarchique ascendante selon la méthode des voisins réciproques. Les Cahiers de l’Analyse des Données, V: 135–144, 1980.
Google Scholar
D. Defays: An efficient algorithm for a complete link method. Computer Journal, 20: 364–366, 1977.
Article MathSciNet MATH Google Scholar
C. Delannoy: Un algorithme rapide de recherche de plus proches voisins. RAIRO Informatique/Computer Science, 14: 275–286, 1980.
Google Scholar
A. R. Dempster, N. M. Laird, and D. B. Rubin: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39: 1–22, 1977.
MathSciNet MATH Google Scholar
A. Dobrzycki, H. Ebeling, K. Glotfelty, P. Freeman, F. Damiani, M. Elvis, and T. Calderwood: Chandra Detect 1.0 User Guide. Chandra X-Ray Center, 1999. Smithsonian Astrophysical Observatory, Version 0. 9.
Google Scholar
L. B. Doyle: Semantic road maps for literature searchers. Journal of the ACM, 8: 553–578, 1961.
Article MATH Google Scholar
C. M. Eastman and S. F. Weiss: Tree structures for high dimensionality nearest neighbor searching. Information Systems, 7: 115–122, 1982.
Article MATH Google Scholar
H. Ebeling and G. Wiedenmann: Detecting structure in two dimensions combining voronoi tessellation and percolation. Physical Review E, 47: 704–714, 1993.
Article Google Scholar
E. Forgy: Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics, 21: 768, 1965.
Google Scholar
C. Fraley: Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal of Scientific Computing, 20: 270–281, 1999.
Article MathSciNet MATH Google Scholar
C. Fraley and A. E. Raftery: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41: 578–588, 1999.
Article MATH Google Scholar
J. H. Friedman, F. Baskett, and L. J. Shustek: An algorithm for finding nearest neighbors. IEEE Transactions on Computers, C-24: 1000–1006, 1975.
Google Scholar
J. H. Friedman, J. L. Bentley, and R. A. Finkel: An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3: 209–226, 1977.
Article MATH Google Scholar
B. Fritzke: Some competitive learning methods, 1997.
Google Scholar
K. Fukunaga and P. M. Narendra: A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, C-24: 750–753, 1975.
Google Scholar
V. J. Gillet, D. J. Wild, P. Willett, and J. Bradshaw: Similarity and dissimilarity methods for processing chemical structure databases. The Computer Journal, 41: 547–558, 1998.
Article MATH Google Scholar
A. D. Gordon: Classification. Champman and Hall, 2 edition, 1999.
Google Scholar
A. Griffiths, L. A. Robinson, and P. Willett: Hierarchic agglomerative clustering methods for automatic document classification. Journal of Documentation, 40: 175–205, 1984.
Article Google Scholar
D. Guillaume and F. Murtagh: Clustering of XML documents. Computer Physics Communications, 1999. submitted.
Google Scholar
M. E. Hodgson: Reducing the computational requirements of the minimum-distance classifier. Remote Sensing of Environment, 25: 117–128, 1988.
Article Google Scholar
E. Horowitz and S. Sahni: Fundamentals of Computer Algorithms. London: Pitman, 1979.
MATH Google Scholar
A. K. Jain and R. C. Dubes: Algorithms for Clustering Data. Englewood Cliffs: Prentice-Hall, 1988.
MATH Google Scholar
G. Jammal and A. Bijaoui: Multiscale image restoration for photon imaging systems. In SPIE Conference on Signal and Image Processing: Wavelet Applications in Signal and Image Processing, V II, July 1999.
Google Scholar
J. Juan: Programme de classification hiérarchique par l’algorithme de la recherche en chaîne des voisins réciproques. Les Cahiers de l’Analyse des Données, VII: 219–225, 1982.
MATH Google Scholar
R. E. Kass and A. E. Raftery: Bayes factors. Journal of the American Statistical Association, 90: 773–795, 1995.
Article MathSciNet MATH Google Scholar
J. Kittler: A method for determining k-nearest neighbors. Kybernetes, 7: 313–315, 1978.
Article MATH Google Scholar
E. D. Kolaczyk: Nonparametric estimation of gamma-ray burst intensi- ties using haar wavelets. Astrophysical Journal, 483: 340–349, 1997.
Article Google Scholar
E. Kushilevitz, R. Ostrovsky, and Y. Rabani: Efficient search for approximate nearest neighbors in high-dimensional spaces. In Proc. of 30th ACM Symposium on Theory of Computing (STOC-30), 1998.
Google Scholar
P. Lloyd: Least squares quantization in pcm. IEEE Transactions on Information Theory, 1982. Technical note, Bell Laboratories, 1957.
Google Scholar
J. MacQueen: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297, Berkeley, 1976. University of California Press.
Google Scholar
R. B. Marimont and M. B. Shapiro: Nearest neighbor searches and the curse of dimensionality. Journal of the Institute of Mathematics and Its Applications, 24: 59–70, 1979.
Article MATH Google Scholar
L. Micó, J. Oncina, and E. Vidal: An algorithm for finding nearest neighbors in constant average time with a linear space complexity. In The 11th International Conference on Pattern Recognition, volume II, pages 557–560, New York, 1992. IEEE Computer Science Press.
Google Scholar
A. Moore: Very fast EM-based mixture model clustering using multiresolution kd-trees. Neural Information Processing Systems, December 1998.
Google Scholar
R. Motwani and P. Raghavan: Randomized Algorithms. Cambridge University Press, 1995.
Google Scholar
S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and A. Raftery: Three types of gamma-ray bursts. The Astrophysical Journal, 508: 314–327, 1998.
Article Google Scholar
F. Murtagh: A very fast, exact nearest neighbor algorithm for use in information retrieval. Information Technology, 1: 275–283, 1982.
Google Scholar
F. Murtagh: Expected time complexity results for hierarchic clustering algorithms which use cluster centers. Information Processing Letters, 16: 237–241, 1983.
Article MathSciNet MATH Google Scholar
F. Murtagh: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly, 1: 101–113, 1984.
MATH Google Scholar
F. Murtagh: Multidimensional Clustering Algorithms. Würzburg: Physica-Verlag, 1985.
MATH Google Scholar
F. Murtagh: Comments on `parallel algorithms for hierarchical clustering and cluster validity’. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14: 1056–1057, 1992.
Article Google Scholar
F. Murtagh: Multivariate methods for data analysis. In A. Sandqvist and T.P. Ray, editors, Central Activity in Galaxies: From Observational Data to Astrophysical Diagnostics, pages 209–235, Berlin, 1993a. Springer-Verlag.
Chapter Google Scholar
F. Murtagh: Search algorithms for numeric and quantitative data. In A. Heck and F. Murtagh, editors, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, pages 49–80, Dordrecht, 1993b. Kluwer Academic.
Google Scholar
F. Murtagh: Foreword to the special issue on clustering and classification. The Computer Journal, 41: 517, 1998.
Article MATH Google Scholar
F. Murtagh and A. Heck: Multivariate Data Analysis. Dordrecht: Kluwer Academic, 1987.
Book MATH Google Scholar
F. Murtagh and M. H. Pajares: The Kohonen self-organizing map method: an assessment. Journal of Classification, 12: 165–190, 1995.
Article MATH Google Scholar
F. Murtagh and A.E. Raftery: Fitting straight lines to point patterns. Pattern Recognition, 17: 479–483, 1984.
Article Google Scholar
F. Murtagh and J. L. Starck: Pattern clustering based on noise modeling in wavelet space. Pattern Recognition, 31: 847–855, 1998.
Article Google Scholar
F. Murtagh, J. L. Starck, and M. Berry: Overcoming the curse of dimensionality in clustering by means of the wavelet transform. The Computer Journal, 43: 107–120, 2000.
Article Google Scholar
R. Neal and G. Hinton: A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. Jordan, editor, Learning in Graphical Models, pages 355–371, Dordrecht, 1998. Kluwer Academic Publisher.
Chapter Google Scholar
H. Niemann and R. Goppert: An efficient branch-and-bound nearest neighbor classifier. Pattern Recognition Letters, 7: 67–72, 1988.
Article Google Scholar
B. K. Parsi and L. N. Kanal: An improved branch and bound algorithm for computing k-nearest neighbors. Pattern Recognition Letters, 3: 7–12, 1985.
Article Google Scholar
D. Pelleg and A. Moore: Accelerating exact k-means algorithms with geometric reasoning. In Proceedings KDD-99, Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, August 1999.
Google Scholar
S. A. Perry and P. Willett: A review of the use of inverted files for best match searching in information retrieval systems. Journal of Information Science, 6: 59–66, 1983.
Article Google Scholar
P. Poinçot, S. Lesteven, and F. Murtagh: A spatial user interface to the astronomical literature. Astronomy and Astrophysics Supplement, 130: 183–191, 1998.
Article Google Scholar
P. Poinçot, S. Lesteven, and F. Murtagh: Maps of information spaces: assessments from astronomy. Journal of the American Society for Information Science, 1999. submitted.
Google Scholar
V. Ramasubramanian and K. K. Paliwal: An efficient approximation-algorithm for fast nearest-neighbor search based on a spherical distance coordinate formulation. Pattern Recognition Letters, 13: 47 1480, 1992.
Google Scholar
M. Richetin, G. Rives, and M. Naranjo: Algorithme rapide pour la détérmination des k plus proches voisins. RAIRO Informatique/Computer Science, 14: 369–378, 1980.
MATH Google Scholar
F. J. Rohlf: Algorithm 76: Hierarchical clustering using the minimum spanning tree. The Computer Journal, 16: 93–95, 1973.
Google Scholar
F. J. Rohlf: A probabilistic minimum spanning tree algorithm. Information Processing Letters, 7: 44–48, 1978.
Article MathSciNet MATH Google Scholar
F. J. Rohlf: Single link clustering algorithms. In P. R. Krishnaiah and L.N. Kanal, editors, Handbook of Statistics, volume 2, pages 267–284, Amsterdam, 1982. North-Holland.
Google Scholar
E. V. Ruiz: An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letters, 4: 145–157, 1986.
Article Google Scholar
G. Salton and M. J. McGill: Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983.
MATH Google Scholar
M. Sato and S. Ishii: Reinforcement learning based on on-line EM algorithm. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 1052–1058, Cambridge, 1999. MIT Press.
Google Scholar
T. Schreiber: Efficient search for nearest neighbors. In A.S. Weigend and N.A Gershenfeld, editors, Predicting the Future and Understanding the Past: A Comparison of Approaches, New York, 1993. Addison-Wesley.
Google Scholar
G. Schwarz: Estimating the dimension of a model. The Annals of Statistics, 6: 461–464, 1978.
Article MathSciNet MATH Google Scholar
SDSS. Sloan digital sky survey, 1999.
Google Scholar
M. Shapiro: The choice of reference points in best-match file searching. Communications of the ACM, 20: 339–343, 1977.
Article Google Scholar
R. Sibson: Slink: an optimally efficient algorithm for the single link cluster method. The Computer Journal, 16: 30–34, 1973.
Article MathSciNet Google Scholar
A. F. Smeaton and C. J. van Rijsbergen: The nearest neighbor problem in information retrieval: an algorithm using upperbounds. ACM SIGIR Forum, 16: 83–87, 1981.
Article Google Scholar
P. H. A. Sneath and R. R. Sokal: Numerical Taxonomy. San Francisco: W. H. Freeman, 1973.
MATH Google Scholar
H. Späth: Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. Chichester: Ellis Horwood, 1985.
MATH Google Scholar
J. L. Starck, F. Murtagh, and A. Bijaoui: Image and Data Analysis: The Multiscale Approach. New York: Cambridge University Press, 1998.
MATH Google Scholar
R. E. Tarjan: An improved algorithm for hierarchical clustering using strong components. Information Processing Letters, 17: 37–41, 1983.
Article MathSciNet MATH Google Scholar
B. Thiesson, C. Meek, and D. Heckerman: Accelerating EM for large databases. Technical report, Microsoft, 1999. Microsoft Research Technical Report MST-TR-99–31.
Google Scholar
S. F. Weiss: A probabilistic algorithm for nearest neighbor searching. In R. N. Oddy and et al., editors, Information Retrieval Research, pages 325–333, London, 1981. Butterworths.
Google Scholar
H. D. White and K. W. McCain: Visualization of literatures. Annual Review of Information Science and Technology (ARIST), 32: 99–168, 1997.
Google Scholar
P. Willett: Efficiency of hierarchic agglomerative clustering using the icl distributed array processor. Journal of Documentation, 45: 1–45, 1989.
Article Google Scholar
A. C. Yao: An o(lel log log ivi) algorithm for finding minimum spanning trees. Information Processing Letters, 4: 21–23, 1975.
Article Google Scholar
T. P. Yunck: A technique to identify nearest neighbors. IEEE Transactions on Systems, Man, and Cybernetics, 6: 678–683, 1976.
Article MathSciNet MATH Google Scholar
C. T. Zahn: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20: 68–86, 1971.
Article MATH Google Scholar
G. Zheng, J. L. Starck, J.G. Campbell, and F. Murtagh: Multiscale transforms for filtering financial data streams, Journal of Computational Intelligence in Finance, 7, 10 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, The Queen’s University of Belfast, Belfast, BT7 1NN, Northern Ireland
Fionn Murtagh

Authors

Fionn Murtagh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AT&T Labs Research, USA
James Abello & Mauricio G. C. Resende &
University of Florida, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Murtagh, F. (2002). Clustering in Massive Data Sets. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_14

Download citation

DOI: https://doi.org/10.1007/978-1-4615-0005-6_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics