Skip to main content

Part of the book series: Computational Methods in Applied Sciences ((COMPUTMETHODS,volume 27))

  • 1925 Accesses

Abstract

Many modern systems record various types of parameter values. Numerical values are relatively convenient for data analysis tools because there are many methods to measure distances and similarities between them. The application of dimensionality reduction techniques for data sets with such values is also a well known practice. Nominal (i.e., categorical) values, on the other hand, encompass some problems for current methods. Most of all, there is no meaningful distance between possible nominal values, which are either equal or unequal to each other. Since many dimensionality reduction methods rely on preserving some form of similarity or distance measure, their application to such data sets is not straightforward. We propose a method to achieve clustering of such data sets by applying the diffusion maps methodology to it. Our method is based on a distance metric that utilizes the effect of the boolean nature of similarities between nominal values (i.e., equal or unequal) on the diffusion kernel and, in turn, on the embedded space resulting from its principal components. We use a multi-view approach by analyzing small, closely related, sets of parameters at a time instead of the whole data set. This way, we achieve a comprehensive understanding of the data set from many points of view.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The dominant components are clear when points that are too close to the central area are not considered. The dominant components in this case have various interrelated functions specific to the analyzed system.

References

  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD ’98: proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 94–105

    Chapter  Google Scholar 

  2. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD ’99: proceedings of the 1999 ACM SIGMOD international conference on management of data. ACM, New York, pp 49–60

    Chapter  Google Scholar 

  3. Babuška R (1998) Fuzzy modeling for control. Kluwer, Norwell

    Book  Google Scholar 

  4. Berkhin P (2006) A survey of clustering data mining techniques. Grouping Multidimensional Data Cl(c):25–71

    Article  Google Scholar 

  5. Bickel S, Scheffer T (2004) Multi-view clustering. In: ICDM ’04: proceedings of the fourth IEEE international conference on data mining. IEEE, Washington, pp 19–26

    Chapter  Google Scholar 

  6. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory, Madison, WI, 1998. ACM, New York, pp 92–100

    Chapter  Google Scholar 

  7. Chung F (1997) Spectral graph theory. CBMS regional conference series in mathematics, vol 92. AMS, Providence

    MATH  Google Scholar 

  8. Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21(1):5–30

    Article  MathSciNet  MATH  Google Scholar 

  9. Dasgupta S, Littman ML, McAllester D (2001) PAC generalization bounds for co-training. Technical report, AT&T Labs-Research

    Google Scholar 

  10. David G (2009) Anomaly detection and classification via diffusion processes in hyper-networks. PhD thesis, School of Computer Science, Tel Aviv University

    Google Scholar 

  11. David G, Averbuch A (2012) Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl Comput Harmon Anal 33(1):1–23

    Article  MathSciNet  MATH  Google Scholar 

  12. David G, Averbuch A (2011) Localized diffusion. Part II: Coarse-grained process (submitted)

    Google Scholar 

  13. David G, Averbuch A (2012) SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognit 45(1):416–433

    Article  MathSciNet  MATH  Google Scholar 

  14. de Diego IM, Munoz A, Moguerza J (2010) Methods for the combination of kernel matrices within a support vector framework. Mach Learn 78:137–174

    Article  Google Scholar 

  15. de Sa VR, Gallagher PW, Lewis JM, Malave VL (2010) Multi-view kernel construction. Mach Learn 79(1):47–71

    Article  Google Scholar 

  16. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD ’96: proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI, New York, pp 226–231

    Google Scholar 

  17. Everitt B, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London

    MATH  Google Scholar 

  18. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: SIGMOD ’98: proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 73–84

    Chapter  Google Scholar 

  19. Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst (Oxf) 25(5):345–366

    Article  Google Scholar 

  20. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD ’98: proceedings of the 4th international conference on knowledge discovery and data mining, pp 58–65

    Google Scholar 

  21. Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD-DMKD ’97: workshop on research issues on data mining and knowledge discovery

    Google Scholar 

  22. Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304

    Article  Google Scholar 

  23. Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaud Sci Nat 37:547–579

    Google Scholar 

  24. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  25. Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Article  Google Scholar 

  26. Lafon S (2004) Diffusion maps and geometric harmonics. PhD thesis, Yale University

    Google Scholar 

  27. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability. Statistics, vol I. Univ California Press, Berkeley, pp 281–297

    Google Scholar 

  28. Rabin N (2010) Data mining dynamically evolving systems via diffusion methodologies. PhD thesis, School of Computer Science, Tel Aviv University

    Google Scholar 

  29. Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118

    Article  Google Scholar 

  30. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  31. Sebban M, Nock R (2002) A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognit 35(4):835–846

    Article  MATH  Google Scholar 

  32. Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304

    Article  Google Scholar 

  33. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228

    Article  Google Scholar 

  34. Strehl A, Ghosh J (2000) A scalable approach to balanced, high-dimensional clustering of market-baskets. In: HiPC ’00: proceedings of the 7th international conference on high performance computing. Springer, London, pp 525–536

    Google Scholar 

  35. Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: CIKM ’99: proceedings of the 8th international conference on information and knowledge management. ACM, New York, pp 483–490

    Google Scholar 

  36. Wang P (2008) Clustering and classification techniques for nominal data application. PhD thesis, City University of Hong Kong

    Google Scholar 

  37. Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In: VLDB ’97: proceedings of the 23rd international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 186–195

    Google Scholar 

  38. Wang W, Yang J, Muntz R (1999) STING+: an approach to active spatial data mining. In: ICDE ’99: proceedings of the 15th international conference on data engineering. IEEE, Los Alamitos, pp 116–125

    Google Scholar 

  39. Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 682–687

    Chapter  Google Scholar 

  40. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: ACL ’95: proceedings of the 33rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, pp 189–196

    Google Scholar 

  41. Yun CH, Chuang KT, Chen MS (2001) An efficient clustering algorithm for market basket data based on small large ratios. In: COMPSAC ’01: proceedings of the 25th international computer software and applications conference on invigorating software development. IEEE, Washington, pp 505–510

    Google Scholar 

  42. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: SIGMOD ’96: proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, New York, pp 103–114

    Chapter  Google Scholar 

  43. Zhao Y, Song J (2001) GDILC: a grid-based density-isoline clustering algorithm. In: ICII ’01: proceedings of the international conferences on info-tech and info-net, vol 3. IEEE, New York, pp 140–145

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guy Wolf .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Wolf, G., Harussi, S., Shmueli, Y., Averbuch, A. (2013). Polar Classification of Nominal Data. In: Repin, S., Tiihonen, T., Tuovinen, T. (eds) Numerical Methods for Differential Equations, Optimization, and Technological Problems. Computational Methods in Applied Sciences, vol 27. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-5288-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-5288-7_14

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-5287-0

  • Online ISBN: 978-94-007-5288-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics