The Challenges of Clustering High Dimensional Data

  • Michael Steinbach
  • Levent Ertöz
  • Vipin Kumar
Chapter

Abstract

Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach.

Keywords

Covariance Bors Lution Plague 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan: `Automatic subspace clustering of high-dimensional data for data mining applications’, In: ACM SIG-MOD Conference on Management of Data ( ACM Press, New York 1998 )Google Scholar
  2. 2.
    R. Agrawal, R. Srikant• `Fast Algorithms for Mining Association Rules’, In: Proceedings of the 20 th VLDB Conference, ( Santiago, Chile 1997 ) pp. 487–499Google Scholar
  3. 3.
    C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, Jong Park: `Fast algorithms for projected clustering’, In: ACM SIGMOD Conference, ( ACM Press, New York 1999 )Google Scholar
  4. 4.
    M.R. Anderberg: Cluster Analysis for Applications ( Academic Press, New York and London 1973 )MATHGoogle Scholar
  5. 5.
    R. Bellman: Adaptive Control Processes: A Guided Tour, ( Princeton University Press, Princeton 1961 )MATHGoogle Scholar
  6. 6.
    S. Brin: `Near Neighbor Search in Large Metric Spaces’, Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), ( Morgan Kaufmann, Los Gatos 1995 ) pp. 574–584Google Scholar
  7. 7.
    K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft: `When is `nearest neighbor’ meaningful?’, In Proceedings of 7th International Conference on Database Theory (ICDT-1999) ( Jerusalem, Israel 1999 ) pp. 217–235Google Scholar
  8. 8.
    T.H. Cormen, C.E. Leiserson, R.L. Rivest: Introduction to Algorithms ( Prentice Hall, Englewood Cliffs 1990 )MATHGoogle Scholar
  9. 9.
    R.O. Duda, P.E. Hart, D.G. Stork: Pattern Recognition ( Wiley, New York 2000 )Google Scholar
  10. 10.
    I.S. Dhillon, D.S. Modha: Machine Learning 42 143 (2001)MATHCrossRefGoogle Scholar
  11. 11.
    D.L. Donoho: `High Dimensional Data Analysis: The Curses and Blessings of Dimensionality’ American Math. Society Conference: Mathematical Challenges of the 21st Century Los Angeles CA August 6–11 (2000). (Currently only available on the Web at http://www-stat.stanford.edu/ donoho/Lectures/AMS2000/AMS2000.html)Google Scholar
  12. 12.
    M. Ester, H.P. Kriegel, J. Sander, X. Xu: ‘A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise’, In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96) (Portland, Oregon 1996) pp. 226–231Google Scholar
  13. 13.
    L. Ertöz, M. Steinbach, V. Kumar: `Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach’, In Proceeding of Text Mining Workshop First International SIAM Data Mining Conference (Chicago, IL 2001) Google Scholar
  14. 14.
    A. EI-Hamdouchi, P. Willet: The Computer Journal 32 (3) (1989)Google Scholar
  15. 15.
    C. Fraley, A.E. Raferty: ‘How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis’, Technical Report No. 329, Department of Statistics, University of Washington, Seattle, Washington (1998)Google Scholar
  16. 16.
    K.C. Gowda, G. Krishna: Pattern Recognition 10, 105 (1978)MATHCrossRefGoogle Scholar
  17. 17.
    S. Guha, R. Rastogi, K. Shim: `ROCK: A Robust Clustering Algorithm for Categorical Attributes’, In Proceedings of the 15th International Conference on Data Engineering (ICDE ‘89) (1999) pp. 512–521 Google Scholar
  18. 18.
    A. Hinneburg, C. Aggarwal, D.A. Keim: `What is the nearest neighbor in high dimensional spaces?’ In Proceedings 26th International Conference on Very Large Data Bases (VLDB-2000)(Morgan Kaufmann, San Francisco 2000) pp. 506–515 Google Scholar
  19. 19.
    A. Hinneburg, D.A. Keim: `An Efficient Approach to Clustering in Large Multimedia Databases with Noise’, In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining(New York 1998) pp. 58–65 Google Scholar
  20. 20.
    A. Hinneburg, D.A. Keim: `Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering’, In Proceedings of 25th International Conference on Very Large Data Bases (VLDB-1999)(Morgan Kaufmann, San Francisco 1999) pp. 506–517 Google Scholar
  21. 21.
    E.-H. Han, G. Karypis, V. Kumar, B. Mobasher: `Clustering In a High-Dimensional Space Using Hypergraph Models’, Technical Report TR-97–063, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota (1997)Google Scholar
  22. 22.
    F. Hoppner, F. Klawonn, R. Kruse, T. Runkler Fuzzy Cluster Analysis: Methods for Classification Data Analysis and Image Recognition John Wiley and Sons, New York 1999Google Scholar
  23. 23.
    A.K. Jain, R.C. Dubes: Algorithms for Clustering Data ( Prentice Hall, Englewood Cliffs 1988 )MATHGoogle Scholar
  24. 24.
    A.K. Jain, M.N. Murty, P.J. Flynn: ACM Computing Surveys 31 264 (1999)CrossRefGoogle Scholar
  25. 25.
    R.A. Jarvis, E.A. Patrick: IEEE Transactions on Computers, C-22, 1025 (1973)Google Scholar
  26. 26.
    G. Karypis, E.-H. Han: `Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval and Categorization’, In: Ninth International Conference on Information and Knowledge Management (CIKM 2000) (McLean 2000) 7. G. Karypis, E.-H. Han, V. Kumar: IEEE Computer 32, 68 (1999)Google Scholar
  27. 28.
    G. Karypis, V. Kumar: `hMETIS 1.5: A hypergraph partitioning package’, Technical report, Department of Computer Science, University of Minnesota (1998)Google Scholar
  28. 29.
    L. Kaufman, P.J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis ( John Wiley and Sons, New York 1990 )CrossRefGoogle Scholar
  29. 30.
    T. Mitchell: Machine Learning ( McGraw Hill, New York 1997 )MATHGoogle Scholar
  30. 31.
    F. Murtagh, J.-L. Starck, M.W. Berry: The Computer Journal 43, 107 (2000)CrossRefGoogle Scholar
  31. 32.
    H. Nagesh, S. Goil, Alok Choudhary: `MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets’, Technical Report Number CPDC-TR9906–019, Center for Parallel and Distributed Computing, Northwestern University (1999)Google Scholar
  32. 33.
    C.J. Van Rijsbergen: Information Retrieval 2nd Ed. ( Butterworth, London 1979 )Google Scholar
  33. 34.
    S.M. Savaresi, D.L. Boley: `On the Performance of Bisecting K-Means and PDDP’, In Proceedings of the First International SIAM Data Mining Conference, ( Chicago, IL 2001 )Google Scholar
  34. 35.
    G. Strang: Linear Algebra and its Applications third edition ( Harcourt Brace Jovanovich, New York 1986 )Google Scholar
  35. 36.
    G. Sheikholeslami, S. Chatterjee, Aidong Zhang: `Wavecluster: A multi-resolution clustering approach for very large spatial databases’, In Proceedings of the 24th VLDB Conference (1998)Google Scholar
  36. 37.
    M. Steinbach, G. Karypis, V. Kumar: A Comparison of Document Clustering Algorithms’, In Proceedings of the Text Mining Workshop for The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2000) ( Boston, MA 2000 )Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Michael Steinbach
  • Levent Ertöz
  • Vipin Kumar

There are no affiliations available

Personalised recommendations