Advertisement

Clustering Very Large Dissimilarity Data Sets

  • Barbara Hammer
  • Alexander Hasenfuss
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5998)

Abstract

Clustering and visualization constitute key issues in computer-supported data inspection, and a variety of promising tools exist for such tasks such as the self-organizing map (SOM) and variations thereof. Real life data, however, pose severe problems to standard data inspection: on the one hand, data are often represented by complex non-vectorial objects and standard methods for finite dimensional vectors in Euclidean space cannot be applied. On the other hand, very large data sets have to be dealt with, such that data do neither fit into main memory, nor more than one pass over the data is still affordable, i.e. standard methods can simply not be applied due to the sheer amount of data. We present two recent extensions of topographic mappings: relational clustering, which can deal with general proximity data given by pairwise distances, and patch processing, which can process streaming data of arbitrary size in patches. Together, an efficient linear time data inspection method for general dissimilarity data structures results. We present the theoretical background as well as applications to the areas of text and multimedia processing based on the generalized compression distance.

Keywords

Dissimilarity Measure Symmetric Bilinear Form Dissimilarity Matrix Data Inspection Normalize Compression Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Alex, N., Hasenfuss, A., Hammer, B.: Patch clustering for massive data sets. Neurocomputing 72(7-9), 1455–1469 (2009)CrossRefGoogle Scholar
  2. 2.
    Badoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. STOC, pp. 250–257 (2002)Google Scholar
  3. 3.
    De, G., Barreto, A., Araujo, A.F.R., Kremer, S.C.: A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15(6), 1255–1320 (2003)zbMATHCrossRefGoogle Scholar
  4. 4.
    Belongie, S., Fowlkes, C., Chung, F., Malik, J.: Spectral partitioning with indefinite kernels using the Nyström extension. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 531–542. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Bezdek, J.C., Hathaway, R.J., Huband, J.M., Leckie, C., Kotagiri, R.: Approximate data mining in very large relational data. In: Dobbie, G., Bailey, J. (eds.) Proc. Australasian Database Conference, pp. 3–13 (2006)Google Scholar
  6. 6.
    Bradley, P.S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large data sets. In: Proc. KDD, pp. 9–15. AAAI Press, Menlo Park (1998)Google Scholar
  7. 7.
    Cilibrasi, R., Vitanyi, M.B.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Cottrell, M., Hammer, B., Hasenfuss, A., Villmann, T.: Batch and median neural gas. Neural Networks 19, 762–771 (2006)zbMATHCrossRefGoogle Scholar
  9. 9.
    Domingos, P., Hulten, G.: A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In: Proc. ICML, pp. 106–113 (2001)Google Scholar
  10. 10.
    Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explorations 2(1), 51–57 (2000)CrossRefGoogle Scholar
  11. 11.
    Graepel, T., Obermayer, K.: A stochastic self-organizing map for proximity data. Neural Computation 11, 139–155 (1999)CrossRefGoogle Scholar
  12. 12.
    Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large datasets. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 73–84 (1998)Google Scholar
  13. 13.
    Hammer, B., Hasenfuss, A.: Topographic mapping of large dissimilarity data sets, Technical Report IFI-01-2010, Clausthal University of Technology (2010)Google Scholar
  14. 14.
    Hammer, B., Micheli, A., Sperduti, A., Strickert, M.: Recursive self-organizing network models. Neural Networks 17(8-9), 1061–1086 (2004)zbMATHCrossRefGoogle Scholar
  15. 15.
    Hathaway, R.J., Bezdek, J.C.: Nerf c-means: Non-Euclidean relational fuzzy clustering. Pattern Recognition 27(3), 429–437 (1994)CrossRefGoogle Scholar
  16. 16.
    Hathaway, R.J., Davenport, J.W., Bezdek, J.C.: Relational duals of the c-means algorithms. Pattern Recognition 22, 205–212 (1989)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Heskes, T.: Self-organizing maps, vector quantization, and mixture modeling. IEEE TNN 12, 1299–1305 (2001)Google Scholar
  18. 18.
    Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)Google Scholar
  19. 19.
    Kohonen, T., Somervuo, P.: How to make large self-organizing maps for non-vectorial data. Neural Networks 15, 945–952 (2002)CrossRefGoogle Scholar
  20. 20.
    Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time (1+epsilon)- approximation algorithm for k-means clustering in any dimensions. In: Proc. IEEE FOCS, pp. 454–462 (2004)Google Scholar
  21. 21.
    Laub, J., Roth, V., Buhmann, J.M., Müller, K.-R.: On the information and representation of non-Euclidean pairwise data. Pattern Recognition 39, 1815–1826 (2006)zbMATHCrossRefGoogle Scholar
  22. 22.
    Mokbel, B., Hasenfuss, A., Hammer, B.: Graph-based Representation of Symbolic Musical Data. In: Torsello, A., Escolano, F., Brun, L. (eds.) GbRPR 2009. LNCS, vol. 5534, pp. 42–51. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  23. 23.
    Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Academic Press/Elsevier (2009)Google Scholar
  24. 24.
    Ontrup, J., Ritter, H.: Hyperbolic self-organizing maps for semantic navigation. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems, vol. 14, pp. 1417–1424. MIT Press, Cambridge (2001)Google Scholar
  25. 25.
    Pardalos, P.M., Vavasis, S.A.: Quadratic programming with one negative eigenvalue is NP hard. Journal of Global Optimization 1, 15–22 (1991)zbMATHCrossRefMathSciNetGoogle Scholar
  26. 26.
    Pekalska, E., Duin, R.P.W.: The Dissimilarity Representation for Pattern Recognition – Foundations and Applications. World scientific, Singapore (2005)zbMATHCrossRefGoogle Scholar
  27. 27.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  28. 28.
    Roth, V., Laub, J., Kawanabe, M., Buhmann, J.M.: Optimal cluster preserving embedding of nonmetric proximity data. IEEE TPAMI 25(12), 1540–1551 (2003)Google Scholar
  29. 29.
    Sahni, S.: Computationally related problems. SIAM Journal on Computing 3(4), 262–279 (1974)CrossRefMathSciNetGoogle Scholar
  30. 30.
    Seo, S., Obermayer, K.: Self-organizing maps and clustering methods for matrix data. Neural Networks 17, 1211–1230 (2004)zbMATHCrossRefGoogle Scholar
  31. 31.
    Tino, P., Kaban, A., Sun, Y.: A generative probabilistic approach to visualizing sets of symbolic sequences. In: Kohavi, R., Gehrke, J., DuMouchel, W., Ghosh, J. (eds.) Proc. KDD 2004, pp. 701–706. ACM Press, New York (2004)CrossRefGoogle Scholar
  32. 32.
    Wang, W., Yang, J., Muntz, R.R.: STING: a statistical information grid approach to spatial data mining. In: Proc. VLDB, pp. 186–195 (1997)Google Scholar
  33. 33.
    Wong, P.C., Thomas, J.: Visual Analytics. IEEE Computer Graphics and Applications 24(5), 20–21 (2004)CrossRefGoogle Scholar
  34. 34.
    Yin, H.: On the equivalence between kernel self-organising maps and self-organising mixture density network. Neural Networks 19(6), 780–784 (2006)zbMATHCrossRefGoogle Scholar
  35. 35.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pp. 103–114 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Barbara Hammer
    • 1
  • Alexander Hasenfuss
    • 2
  1. 1.CITECUniversity of BielefeldGermany
  2. 2.Department of Computer ScienceClausthal University of TechnologyGermany

Personalised recommendations