Skip to main content

The Challenges of Clustering High Dimensional Data

  • Chapter
New Directions in Statistical Physics

Abstract

Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan: `Automatic subspace clustering of high-dimensional data for data mining applications’, In: ACM SIG-MOD Conference on Management of Data ( ACM Press, New York 1998 )

    Google Scholar 

  2. R. Agrawal, R. Srikant• `Fast Algorithms for Mining Association Rules’, In: Proceedings of the 20 th VLDB Conference, ( Santiago, Chile 1997 ) pp. 487–499

    Google Scholar 

  3. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, Jong Park: `Fast algorithms for projected clustering’, In: ACM SIGMOD Conference, ( ACM Press, New York 1999 )

    Google Scholar 

  4. M.R. Anderberg: Cluster Analysis for Applications ( Academic Press, New York and London 1973 )

    MATH  Google Scholar 

  5. R. Bellman: Adaptive Control Processes: A Guided Tour, ( Princeton University Press, Princeton 1961 )

    MATH  Google Scholar 

  6. S. Brin: `Near Neighbor Search in Large Metric Spaces’, Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), ( Morgan Kaufmann, Los Gatos 1995 ) pp. 574–584

    Google Scholar 

  7. K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft: `When is `nearest neighbor’ meaningful?’, In Proceedings of 7th International Conference on Database Theory (ICDT-1999) ( Jerusalem, Israel 1999 ) pp. 217–235

    Google Scholar 

  8. T.H. Cormen, C.E. Leiserson, R.L. Rivest: Introduction to Algorithms ( Prentice Hall, Englewood Cliffs 1990 )

    MATH  Google Scholar 

  9. R.O. Duda, P.E. Hart, D.G. Stork: Pattern Recognition ( Wiley, New York 2000 )

    Google Scholar 

  10. I.S. Dhillon, D.S. Modha: Machine Learning 42 143 (2001)

    Article  MATH  Google Scholar 

  11. D.L. Donoho: `High Dimensional Data Analysis: The Curses and Blessings of Dimensionality’ American Math. Society Conference: Mathematical Challenges of the 21st Century Los Angeles CA August 6–11 (2000). (Currently only available on the Web at http://www-stat.stanford.edu/ donoho/Lectures/AMS2000/AMS2000.html)

    Google Scholar 

  12. M. Ester, H.P. Kriegel, J. Sander, X. Xu: ‘A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’, In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96) (Portland, Oregon 1996) pp. 226–231

    Google Scholar 

  13. L. Ertöz, M. Steinbach, V. Kumar: `Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach’, In Proceeding of Text Mining Workshop First International SIAM Data Mining Conference (Chicago, IL 2001)

    Google Scholar 

  14. A. EI-Hamdouchi, P. Willet: The Computer Journal 32 (3) (1989)

    Google Scholar 

  15. C. Fraley, A.E. Raferty: ‘How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis’, Technical Report No. 329, Department of Statistics, University of Washington, Seattle, Washington (1998)

    Google Scholar 

  16. K.C. Gowda, G. Krishna: Pattern Recognition 10, 105 (1978)

    Article  MATH  Google Scholar 

  17. S. Guha, R. Rastogi, K. Shim: `ROCK: A Robust Clustering Algorithm for Categorical Attributes’, In Proceedings of the 15th International Conference on Data Engineering (ICDE ‘89) (1999) pp. 512–521

    Google Scholar 

  18. A. Hinneburg, C. Aggarwal, D.A. Keim: `What is the nearest neighbor in high dimensional spaces?’ In Proceedings 26th International Conference on Very Large Data Bases (VLDB-2000)(Morgan Kaufmann, San Francisco 2000) pp. 506–515

    Google Scholar 

  19. A. Hinneburg, D.A. Keim: `An Efficient Approach to Clustering in Large Multimedia Databases with Noise’, In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining(New York 1998) pp. 58–65

    Google Scholar 

  20. A. Hinneburg, D.A. Keim: `Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering’, In Proceedings of 25th International Conference on Very Large Data Bases (VLDB-1999)(Morgan Kaufmann, San Francisco 1999) pp. 506–517

    Google Scholar 

  21. E.-H. Han, G. Karypis, V. Kumar, B. Mobasher: `Clustering In a High-Dimensional Space Using Hypergraph Models’, Technical Report TR-97–063, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota (1997)

    Google Scholar 

  22. F. Hoppner, F. Klawonn, R. Kruse, T. Runkler Fuzzy Cluster Analysis: Methods for Classification Data Analysis and Image Recognition John Wiley and Sons, New York 1999

    Google Scholar 

  23. A.K. Jain, R.C. Dubes: Algorithms for Clustering Data ( Prentice Hall, Englewood Cliffs 1988 )

    MATH  Google Scholar 

  24. A.K. Jain, M.N. Murty, P.J. Flynn: ACM Computing Surveys 31 264 (1999)

    Article  Google Scholar 

  25. R.A. Jarvis, E.A. Patrick: IEEE Transactions on Computers, C-22, 1025 (1973)

    Google Scholar 

  26. G. Karypis, E.-H. Han: `Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval and Categorization’, In: Ninth International Conference on Information and Knowledge Management (CIKM 2000) (McLean 2000) 7. G. Karypis, E.-H. Han, V. Kumar: IEEE Computer 32, 68 (1999)

    Google Scholar 

  27. G. Karypis, V. Kumar: `hMETIS 1.5: A hypergraph partitioning package’, Technical report, Department of Computer Science, University of Minnesota (1998)

    Google Scholar 

  28. L. Kaufman, P.J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis ( John Wiley and Sons, New York 1990 )

    Book  Google Scholar 

  29. T. Mitchell: Machine Learning ( McGraw Hill, New York 1997 )

    MATH  Google Scholar 

  30. F. Murtagh, J.-L. Starck, M.W. Berry: The Computer Journal 43, 107 (2000)

    Article  Google Scholar 

  31. H. Nagesh, S. Goil, Alok Choudhary: `MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets’, Technical Report Number CPDC-TR9906–019, Center for Parallel and Distributed Computing, Northwestern University (1999)

    Google Scholar 

  32. C.J. Van Rijsbergen: Information Retrieval 2nd Ed. ( Butterworth, London 1979 )

    Google Scholar 

  33. S.M. Savaresi, D.L. Boley: `On the Performance of Bisecting K-Means and PDDP’, In Proceedings of the First International SIAM Data Mining Conference, ( Chicago, IL 2001 )

    Google Scholar 

  34. G. Strang: Linear Algebra and its Applications third edition ( Harcourt Brace Jovanovich, New York 1986 )

    Google Scholar 

  35. G. Sheikholeslami, S. Chatterjee, Aidong Zhang: `Wavecluster: A multi-resolution clustering approach for very large spatial databases’, In Proceedings of the 24th VLDB Conference (1998)

    Google Scholar 

  36. M. Steinbach, G. Karypis, V. Kumar: A Comparison of Document Clustering Algorithms’, In Proceedings of the Text Mining Workshop for The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2000) ( Boston, MA 2000 )

    Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Steinbach, M., Ertöz, L., Kumar, V. (2004). The Challenges of Clustering High Dimensional Data. In: Wille, L.T. (eds) New Directions in Statistical Physics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-08968-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-08968-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-07739-5

  • Online ISBN: 978-3-662-08968-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics