Polar Classification of Nominal Data

Wolf, Guy; Harussi, Shachar; Shmueli, Yaniv; Averbuch, Amir

doi:10.1007/978-94-007-5288-7_14

Guy Wolf^4,5,
Shachar Harussi^4,5,
Yaniv Shmueli⁴ &
…
Amir Averbuch^4,5

Part of the book series: Computational Methods in Applied Sciences ((COMPUTMETHODS,volume 27))

1925 Accesses

Abstract

Many modern systems record various types of parameter values. Numerical values are relatively convenient for data analysis tools because there are many methods to measure distances and similarities between them. The application of dimensionality reduction techniques for data sets with such values is also a well known practice. Nominal (i.e., categorical) values, on the other hand, encompass some problems for current methods. Most of all, there is no meaningful distance between possible nominal values, which are either equal or unequal to each other. Since many dimensionality reduction methods rely on preserving some form of similarity or distance measure, their application to such data sets is not straightforward. We propose a method to achieve clustering of such data sets by applying the diffusion maps methodology to it. Our method is based on a distance metric that utilizes the effect of the boolean nature of similarities between nominal values (i.e., equal or unequal) on the diffusion kernel and, in turn, on the embedded space resulting from its principal components. We use a multi-view approach by analyzing small, closely related, sets of parameters at a time instead of the whole data set. This way, we achieve a comprehensive understanding of the data set from many points of view.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The dominant components are clear when points that are too close to the central area are not considered. The dominant components in this case have various interrelated functions specific to the analyzed system.

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD ’98: proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 94–105
Chapter Google Scholar
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD ’99: proceedings of the 1999 ACM SIGMOD international conference on management of data. ACM, New York, pp 49–60
Chapter Google Scholar
Babuška R (1998) Fuzzy modeling for control. Kluwer, Norwell
Book Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. Grouping Multidimensional Data Cl(c):25–71
Article Google Scholar
Bickel S, Scheffer T (2004) Multi-view clustering. In: ICDM ’04: proceedings of the fourth IEEE international conference on data mining. IEEE, Washington, pp 19–26
Chapter Google Scholar
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory, Madison, WI, 1998. ACM, New York, pp 92–100
Chapter Google Scholar
Chung F (1997) Spectral graph theory. CBMS regional conference series in mathematics, vol 92. AMS, Providence
MATH Google Scholar
Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21(1):5–30
Article MathSciNet MATH Google Scholar
Dasgupta S, Littman ML, McAllester D (2001) PAC generalization bounds for co-training. Technical report, AT&T Labs-Research
Google Scholar
David G (2009) Anomaly detection and classification via diffusion processes in hyper-networks. PhD thesis, School of Computer Science, Tel Aviv University
Google Scholar
David G, Averbuch A (2012) Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl Comput Harmon Anal 33(1):1–23
Article MathSciNet MATH Google Scholar
David G, Averbuch A (2011) Localized diffusion. Part II: Coarse-grained process (submitted)
Google Scholar
David G, Averbuch A (2012) SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognit 45(1):416–433
Article MathSciNet MATH Google Scholar
de Diego IM, Munoz A, Moguerza J (2010) Methods for the combination of kernel matrices within a support vector framework. Mach Learn 78:137–174
Article Google Scholar
de Sa VR, Gallagher PW, Lewis JM, Malave VL (2010) Multi-view kernel construction. Mach Learn 79(1):47–71
Article Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD ’96: proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI, New York, pp 226–231
Google Scholar
Everitt B, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London
MATH Google Scholar
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: SIGMOD ’98: proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 73–84
Chapter Google Scholar
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst (Oxf) 25(5):345–366
Article Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD ’98: proceedings of the 4th international conference on knowledge discovery and data mining, pp 58–65
Google Scholar
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD-DMKD ’97: workshop on research issues on data mining and knowledge discovery
Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Article Google Scholar
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaud Sci Nat 37:547–579
Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Article Google Scholar
Lafon S (2004) Diffusion maps and geometric harmonics. PhD thesis, Yale University
Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability. Statistics, vol I. Univ California Press, Berkeley, pp 281–297
Google Scholar
Rabin N (2010) Data mining dynamically evolving systems via diffusion methodologies. PhD thesis, School of Computer Science, Tel Aviv University
Google Scholar
Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118
Article Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Sebban M, Nock R (2002) A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognit 35(4):835–846
Article MATH Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304
Article Google Scholar
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Article Google Scholar
Strehl A, Ghosh J (2000) A scalable approach to balanced, high-dimensional clustering of market-baskets. In: HiPC ’00: proceedings of the 7th international conference on high performance computing. Springer, London, pp 525–536
Google Scholar
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: CIKM ’99: proceedings of the 8th international conference on information and knowledge management. ACM, New York, pp 483–490
Google Scholar
Wang P (2008) Clustering and classification techniques for nominal data application. PhD thesis, City University of Hong Kong
Google Scholar
Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In: VLDB ’97: proceedings of the 23rd international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 186–195
Google Scholar
Wang W, Yang J, Muntz R (1999) STING+: an approach to active spatial data mining. In: ICDE ’99: proceedings of the 15th international conference on data engineering. IEEE, Los Alamitos, pp 116–125
Google Scholar
Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 682–687
Chapter Google Scholar
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: ACL ’95: proceedings of the 33rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, pp 189–196
Google Scholar
Yun CH, Chuang KT, Chen MS (2001) An efficient clustering algorithm for market basket data based on small large ratios. In: COMPSAC ’01: proceedings of the 25th international computer software and applications conference on invigorating software development. IEEE, Washington, pp 505–510
Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: SIGMOD ’96: proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, New York, pp 103–114
Chapter Google Scholar
Zhao Y, Song J (2001) GDILC: a grid-based density-isoline clustering algorithm. In: ICII ’01: proceedings of the international conferences on info-tech and info-net, vol 3. IEEE, New York, pp 140–145
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Tel Aviv University, P.O. Box 39040, Tel Aviv, 69978, Israel
Guy Wolf, Shachar Harussi, Yaniv Shmueli & Amir Averbuch
Department of Mathematical Information Technology, University of Jyväskylä, P.O. Box 35 (Agora), 40014, Jyväskylä, Finland
Guy Wolf, Shachar Harussi & Amir Averbuch

Authors

Guy Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Shachar Harussi
View author publications
You can also search for this author in PubMed Google Scholar
Yaniv Shmueli
View author publications
You can also search for this author in PubMed Google Scholar
Amir Averbuch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guy Wolf .

Editor information

Editors and Affiliations

Mathematical Information Technology, University of Jyväskylä, Mattilanniemi 2, Jyväskylä, 40100, Finland
Sergey Repin
Mathematical Information Technology, University of Jyväskylä, Mattilanniemi 2, Jyväskylä, 40100, Finland
Timo Tiihonen
Mathematical Information Technology, University of Jyväskylä, Mattilanniemi 2, Jyväskylä, 40100, Finland
Tero Tuovinen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wolf, G., Harussi, S., Shmueli, Y., Averbuch, A. (2013). Polar Classification of Nominal Data. In: Repin, S., Tiihonen, T., Tuovinen, T. (eds) Numerical Methods for Differential Equations, Optimization, and Technological Problems. Computational Methods in Applied Sciences, vol 27. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-5288-7_14

Download citation

DOI: https://doi.org/10.1007/978-94-007-5288-7_14
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-5287-0
Online ISBN: 978-94-007-5288-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics