Semi-supervised Probabilistic Distance Clustering and the Uncertainty of Classification

Iyigun, Cem; Ben-Israel, Adi

doi:10.1007/978-3-642-01044-6_1

Cem Iyigun &
Adi Ben-Israel⁵

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Abstract

Semi-supervised clustering is an attempt to reconcile clustering (unsupervised learning) and classification (supervised learning, using prior information on the data). These two modes of data analysis are combined in a parameterized model, the parameter θ ∈ [0, 1] is the weight attributed to the prior information, θ = 0 corresponding to clustering, and θ = 1 to classification. The results (cluster centers, classification rule) depend on the parameter θ, an insensitivity to θ indicates that the prior information is in agreement with the intrinsic cluster structure, and is otherwise redundant. This explains why some data sets (such as the Wisconsin breast cancer data, Merz and Murphy, UCI repository of machine learning databases, University of California, Irvine, CA) give good results for all reasonable classification methods. The uncertainty of classification is represented here by the geometric mean of the membership probabilities, shown to be an entropic distance related to the Kullback–Leibler divergence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aczél, J. (1984). Measuring information beyond communication theory – Why some generalized information measures may be useful, others not. Aequationes Mathematicae, 27, 1–19.
Article MATH MathSciNet Google Scholar
Arav, M. (2008). Contour approximation of data and the harmonic mean. Journal of Mathematical Inequalities, 2, 161–167.
MathSciNet Google Scholar
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2005). Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6, 937–965.
MathSciNet Google Scholar
Ben-Israel, A., & Iyigun, C. (2008). Probabilistic distance clustering. Journal of Classification, 25, 5–26.
Article MATH MathSciNet Google Scholar
Ben-Tal, A., Ben-Israel, A., & Teboulle, M. (1991). Certainty equivalents and information measures: Duality and extremal principles. Journal of Mathematical Analysis and Applications, 157, 211–236.
Article MATH MathSciNet Google Scholar
Ben-Tal, A., & Teboulle, M. (1987). Penalty functions and duality in stochastic programming via ϕ-divergence functionals. Mathematics of Operations Research, 12, 224–240.
Article MATH MathSciNet Google Scholar
Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.
MATH Google Scholar
Chapelle, O., Schölkopf, B., & Zien, A. (Eds.) (2006). Semi-supervised learning. Cambridge MA: MIT Press.
Google Scholar
Csiszár, I. (1978). Information measures: A critical survey. In Trans. 7th Prague Conf. on Info. Th., Statist., Decis. Funct., Random Processes and 8th European Meeting of Statist. (Vol. B, pp. 73–86). Prague: Academia.
Google Scholar
Dixon, K. R., & Chapman, J. A. (1980). Harmonic mean measure of animal activity areas. Ecology, 61, 1040–1044.
Article Google Scholar
Grira, N., Crucianu, M., & Boujemaa, N. (2005). Unsupervised and semi-supervised clustering: A brief survey. In A Review of Machine Learning Techniques for Processing Multimedia Content. Report of the MUSCLE European Network of Excellence.
Google Scholar
Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis. New York: Wiley.
MATH Google Scholar
Iyigun, C., & Ben-Israel, A. (2008). Probabilistic distance clustering adjusted for cluster size. Probability in the Engineering and Informational Sciences, 22, 1–19.
Article MathSciNet Google Scholar
Iyigun, C., & Ben-Israel, A. (2009). Contour approximation of data: The dual problem. Linear Algebra and Its Applications, 430, 2771–2780.
Article MATH MathSciNet Google Scholar
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 264–323.
Article Google Scholar
Kuhn, H. W. (1967). On a pair of dual nonlinear programs. In J. Abadie (Ed.), Methods of nonlinear programming (pp. 38–54). Amsterdam: North-Holland.
Google Scholar
Kuhn, H. W. (1973). A note on Fermat’s problem. Mathematical Programming, 4, 98–107.
Article MATH MathSciNet Google Scholar
Kullback, S. (1959). Information theory and statistics. New York: Wiley.
MATH Google Scholar
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86.
Article MATH MathSciNet Google Scholar
Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty three old and new classification algorithms. Machine Learning, 40, 203–228.
Article MATH Google Scholar
Luce, R. D. (1959). Individual choice behavior. New York: Wiley.
MATH Google Scholar
Mangasarian, O. L., Setiono, R., & Wolberg, W. H. (1999). Pattern recognition via linear programming: theory and application to medical diagnosis. In T. Coleman, & Y. Li (Eds.) Large-scale numerical optimization (pp. 22–30). Philadelphia: SIAM Publications.
Google Scholar
Merz, C., & Murphy, P. (1996). UCI repository of machine learning databases. Irvine, CA: Department of Information and Computer Science, University of California. Retrieved from http://www.ics.uci.edu/mlearn/MLRepository.html.
Teboulle, M. (2007). A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research, 8, 65–102.
MathSciNet Google Scholar
Weiszfeld, E. (1937). Sur le point par lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, 43, 355–386.
Google Scholar
Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences of USA 87, 9193–9196.
Article MATH Google Scholar
Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metrice learning with application to clustering with side-information. In Advances in neural information processing systems (Vol. 15). Cambridge MA: MIT Press.
Google Scholar
Yellott, J. I. Jr. (2001). Luce’s Choice Axiom. In N. J. Smelser, & P. B. Baltes (Eds.), International Encyclopedia of the Social and Behavioral Sciences (pp. 9094–9097). Oxford: Elsevier. ISBN 0-08-043076-7.
Google Scholar

Download references

Author information

Authors and Affiliations

RUTCOR–Rutgers Center for Operations Research, Rutgers University, 640 Bartholomew Rd., Piscataway, NJ, 08854-8003, USA
Adi Ben-Israel

Authors

Cem Iyigun
View author publications
You can also search for this author in PubMed Google Scholar
Adi Ben-Israel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adi Ben-Israel .

Editor information

Editors and Affiliations

Universität der Bundeswehr, Fak. Wirtschafts-/Sozialwissenschaften, Helmut-Schmidt-Universität, Holstenhofweg 85, Hamburg, 22043, Germany
Andreas Fink
Dept. Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, United Kingdom
Berthold Lausen
Universität der Bundeswehr, Fak. Wirtschafts-/Sozialwissenschaften, Helmut-Schmidt-Universität, Holstenhofweg 85, Hamburg, 22043, Germany
Wilfried Seidel
FB 12 Mathematik und Informatik, Datenbionik AG, Universität Marburg, Hans-Meerwein-Straße, Marburg, 35032, Germany
Alfred Ultsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Iyigun, C., Ben-Israel, A. (2009). Semi-supervised Probabilistic Distance Clustering and the Uncertainty of Classification. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-01044-6_1
Published: 31 July 2009
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01043-9
Online ISBN: 978-3-642-01044-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics