Advertisement

Clustering High-Dimensional Data

  • Francesco MasulliEmail author
  • Stefano Rovetta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7627)

Abstract

This chapter introduces the task of clustering, concerning the definition of a structure aggregating the data, and the challenges related to its application to the unsupervised analysis of high-dimensional data. In the recent literature, many approaches have been proposed for facing this problem, as the development of efficient clustering methods for high-dimensional data is is a great challenge for Machine Learning as it is of vital importance to obtain safer decision-making processes and better decisions from the nowadays available Big Data, that can mean greater operational efficiency, cost reduction and risk reduction.

Keywords

Cluster Algorithm Subspace Cluster Project Cluster Brute Force Search Subspace Cluster Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. In: International Conference on Data Mining SDM, pp. 413–418 (2007)Google Scholar
  2. 2.
    Aggarwal, C.C., Procopiuc, C., Wolf, J., Yu, P.S., Park, J.-S.: Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 61–72 (1999)Google Scholar
  3. 3.
    Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional space. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 70–81 (2000)Google Scholar
  4. 4.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)Google Scholar
  5. 5.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Aristotle: Categories. In: Barnes, J. (ed.) The Complete Works of Aristotle. Translation J.L. Ackrill., vol. 2, pp. 3–24. Princeton University Press, Princeton(1995)Google Scholar
  7. 7.
    Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)CrossRefzbMATHGoogle Scholar
  8. 8.
    Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell (1981)CrossRefzbMATHGoogle Scholar
  9. 9.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
  10. 10.
    Böhm, C., Kailing, K., Kriegel, H.-P., Kröger, P.: Density connected clustering with local subspace preferences. In: Fourth IEEE International Conference on Data Mining, pp. 27–34 (2004)Google Scholar
  11. 11.
    Böhm, C., Kailing, K., Kröger, P., Zimek, A.: Computing clusters of correlation connected objects. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 455–466 (2004)Google Scholar
  12. 12.
    Bryan, K., Cunningham, P., Bolshakova, N.: Biclustering of expression data using simulated annealing. In: 18th IEEE Symposium on Computer-Based Medical Systems (CBMS 2005), pp. 383–388 (2005)Google Scholar
  13. 13.
    Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 93–103. AAAI Press (2000)Google Scholar
  14. 14.
    Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 189–196 (1999)Google Scholar
  15. 15.
    Cooper, J.M., Hutchinson, D.S. (eds.): Plato: Complete Works. Hackett Publishing Co., Inc., Indianapolis (1997)Google Scholar
  16. 16.
    Dasgupta, S., Littman, M., McAllester, D.: PAC generalization bounds for co-training. Proc. Neural Inf. Process. Syst. 14, 375–382 (2001)Google Scholar
  17. 17.
    Defays, D.: An efficient algorithm for a complete link method. Comput. J. (Br. Comput. Soc.) 20(4), 364–366 (1977)Google Scholar
  18. 18.
    Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. In: Aide-Memoire of a Lecture at the AMS Conference on Math Challenges of the 21st Century (2000)Google Scholar
  19. 19.
    Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. 95(25), 1486–14868 (1998)CrossRefGoogle Scholar
  20. 20.
    Filippone, M., Masulli, F., Rovetta, S., Mitra, S., Banka, H.: Possibilistic Approach to Biclustering: An Application to Oligonucleotide Microarray Data Analysis. In: Priami, C. (ed.) CMSB 2006. LNCS (LNBI), vol. 4210, pp. 312–322. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  21. 21.
    Hadamard, J.: Lectures on Cauchy’s Problem in Linear Partial Differential Equations. Dover Phoenix edn. Dover Publications, New York (1923)zbMATHGoogle Scholar
  22. 22.
    Hartigan, J.A.: Direct clustering of a data matrix. J. Am. Stat. Assoc. 67(337), 123–129 (1972)CrossRefGoogle Scholar
  23. 23.
    Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall, Upper Saddle River (1999)zbMATHGoogle Scholar
  24. 24.
    Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 246–257 (2004)Google Scholar
  25. 25.
    Koffka, K.: Principles of Gestalt Psychology. Harcourt, Brace, New York (1935)Google Scholar
  26. 26.
    Köhler, W.: Gestalt Psychology. Liveright, New York (1929)Google Scholar
  27. 27.
    Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data (TKDD) 3(1), 1–58 (2009)CrossRefGoogle Scholar
  28. 28.
    Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)CrossRefzbMATHGoogle Scholar
  29. 29.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)CrossRefGoogle Scholar
  30. 30.
    Laney, D.: 3D data management: controlling data volume, velocity and variety. Gartner. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed 6 February 2001
  31. 31.
    Laney, D.: The importance of ‘Big Data’: a definition. Gartner. http://www.gartner.com/resId=2057415. Accessed 21 June 2012
  32. 32.
    Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE Trans. Comput. Biol. Bioinf. 1, 24–45 (2004)CrossRefGoogle Scholar
  33. 33.
    Mitra, S., Banka, H.: Multi-objective evolutionary biclustering of gene expression data. Pattern Recogn. 39(12), 2464–2477 (2006)CrossRefzbMATHGoogle Scholar
  34. 34.
    Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8), 114–117 (1965). Reprinted in. Proc. IEEE 86(1), 82–85 (1998)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Rokach, L., Maimon, O.: Clustering methods. In: Rokach, L., Maimon, O. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, USA (2005)CrossRefGoogle Scholar
  36. 36.
    Rovetta, S., Masulli, F.: Shared farthest neighbor approach to clustering of high dimensionality, low cardinality data. Pattern Recogn. 39, 2415–2425 (2006)CrossRefzbMATHGoogle Scholar
  37. 37.
    Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. (Br. Comput. Soc.) 16(1), 30–34 (1973)Google Scholar
  38. 38.
    Steinbach, M., Ertoz, L., Kumar, V.: Challenges of clustering high dimensional data. In: Wille, L.T. (ed.) Proceedings of New Directions in Statistical Physics Econophysics, Bioinformatics, and Pattern Recognition, pp. 273–307. Springer, Berlin (2004)Google Scholar
  39. 39.
    Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, S136–S144 (2002)CrossRefGoogle Scholar
  40. 40.
    Yang, J., Wang, H., Wang, W., Yu, P.: Enhanced biclustering on expression data. In: Proceedings of the Third IEEE Symposium on BioInformatics and Bioengineering (BIBE03), pp. 1–7 (2003)Google Scholar
  41. 41.
    Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceeding ACL 1995 Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, pp. 189–196 (1995)Google Scholar
  42. 42.
    Wertheimer, M.: Untersuchungen zur Lehre von der Gestalt II. Psychologische Forschung 4, 301–350 (1923)CrossRefGoogle Scholar
  43. 43.
    Zhang, Z., Teo, A., Ooi, B.C., Tan, K.-L. : Mining deterministic biclusters in gene expression data. In: Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE04), pp. 283–292 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi DIBRISUniversità di GenovaGenovaItaly
  2. 2.Center for BiotechnologyTemple UniversityPhiladelphiaUSA

Personalised recommendations