# Optimization Approaches to Semi-Supervised Learning

## Abstract

We examine mathematical models for semi-supervised support vector machines (S^{3}VM). Given a training set of labeled data and a working set of unlabeled data, S^{3}VM constructs a support vector machine using both the training and working sets. We use S^{3}VM to solve the transductive inference problem posed by Vapnik. In transduction, the task is to estimate the value of a classification function at the given points in the working set. This contrasts with inductive inference which estimates the classification function at all possible values. We propose a general S^{3}VM model that minimizes both the misclassification error and the function capacity based on all the available data. Depending on how poorly-estimated unlabeled data are penalized, different mathematical models result. We examine several practical algorithms for solving these model. The first approach utilizes the S^{3}VM model for 1-norm linear support vector machines converted to a mixed-integer program (MIP). A global solution of the MIP is found using a commercial integer programming solver. The second approach uses a nonconvex quadratic program. Variations of block-coordinate-descent algorithms are used to find local solutions of this problem. Using this MIP within a local learning algorithm produced the best results. Our experimental study on these statistical learning methods indicates that incorporating working data can improve generalization.

## Keywords

Support Vector Machine Mixed Integer Program Unlabeled Data Misclassification Error Local Learning## Preview

Unable to display preview. Download preview PDF.

## References

- [1]C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning.
*Artificial Intelligence Review*, 11:11–73, 1997.CrossRefGoogle Scholar - [2]K. P. Bennett. Global tree optimization: a non-greedy decision tree algorithm.
*Computing Science and Statistics*, 26:156–160, 1994.Google Scholar - [3]K. P. Bennett. Combining support vector and mathematical programming methods for classification. In B. Schölkopf, C. Burges, and A. Smola, editors,
*Advances in Kernel Methods — Support Vector Machines*, pages 307–326, Cambridge, MA, 1999. MIT Press.Google Scholar - [4]K. P. Bennett and E. J. Bredensteiner. Geometry i learning. Web manuscript, Rensselaer Polytechnic Institute, http://www.rpi.edu/~bennek/geometry2.ps, 1996. Accepted for publication in Geometry at Work, C. Gorini et al, editors, MAA Press.Google Scholar
- [5]K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In D. Cohn M. Kearns, S. Solla, editor,
*Advances in Neural Information Processing Systems*, pages 368–374, Cambridge, MA, 1999. MIT Press.Google Scholar - [6]K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets.
*Optimization Methods and Software*, 1:23–34, 1992.CrossRefGoogle Scholar - [7]K. P. Bennett and O. L. Mangasarian. Bilinear separation in n-space.
*Computational Optimization and Applications*, 4(4):207–227, 1993.MathSciNetCrossRefGoogle Scholar - [8]D. P. Bertsekas.
*Nonlinear Programming*. Aethena Scientific, Cambridge, MA, 1996.Google Scholar - [9]J. Blue.
*A hybrid of tabu search and local descent algorithms with applications in artificial intelligence*. PhD thesis, Rensselaer Polytechnic Institute, Troy, NY, 1998.Google Scholar - [10]A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In
*Proceedings of the 1998 Conference on Computational Learning Theory*, Madison WI, 1998. ACM Inc.Google Scholar - [11]E. J. Bredensteiner and K. P. Bennett. Feature minimization within decision trees.
*Computational Optimization and Applications*, 10:110–126, 1997.MathSciNetGoogle Scholar - [12]V. Castelli and T. M. Cover. On the exponential value of labeled samples.
*Pattern Recognition Letters*, 16:105–111, 1995.CrossRefGoogle Scholar - [13]Z. Cataltepe and M. Magdon-Ismail. Incorporating test inputs into learning. In
*Proceedings of the Advances in Neural Information Processing Systems, 10*, Cambridge, MA, 1997. MIT Press.Google Scholar - [14]C. Cortes and V. N. Vapnik. Support vector networks.
*Machine Learning*, 20:273–297, 1995.zbMATHGoogle Scholar - [15]CPLEX Optimization Incorporated, Incline Village, Nevada.
*Using the CPLEX Callable Library*, 1994.Google Scholar - [16]R. Fourer, D. Gay, and B. Kernighan.
*AMPL A Modeling Language for Mathematical Programming*. Boyd and Frazer, Danvers, MA, 1993.Google Scholar - [17]T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification.
*IEEE PAMI*, 18:607–616, 1996.CrossRefGoogle Scholar - [18]T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In
*European Conference on Machine Learning*(*ECML*), 1998.Google Scholar - [19]T. Joachims. Transductive inference for text classification using support vector machines. In
*International Conference on Machine Learning*, 1999.Google Scholar - [20]S. Lawrence, A. C. Tsoi, and A. D. Back. Function approximation with neural networks and local methods: Bias, variance and smoothness. In Peter Bartlett, Anthony Burkitt, and Robert Williamson, editors,
*Australian Conference on Neural Networks, ACNN 96*, pages 16–21. Australian National University, 1996.Google Scholar - [21]O. L. Mangasarian. Arbitrary norm separating plane.
*Operations Research Letters*, 24(1–2), 1999.Google Scholar - [22]O. L. Mangasarian. Generalized support vector machines. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors,
*Advances in Large Margin Classifiers*, pages 135–146, Cambridge, MA, 2000. MIT Press. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98–14.ps.Google Scholar - [23]A. McCallum and K. Nigam. Employing em and pool-based active learning for text classification. In
*Proceedings of the 15th International Conference on Machine Learning*(*ICML-98*), 1998.Google Scholar - [24]P.M. Murphy and D.W. Aha.
*UCI repository of machine learning databases*. Department of Information and Computer Science, University of California, Irvine, California, 1992.Google Scholar - [25]D. R. Musser and A. Saini.
*STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library*. Addison-Wesley, 1996.Google Scholar - [26]K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled documents. In
*Proceedings of the 15th National Conference on Artificial Intelligence*(*AAAI-98*), 1998.Google Scholar - [27]S. Odewahn, E. Stockwell, R. Pennington, R Humphreys, and W Zumach. Automated star/galaxy discrimination with neural networks.
*Astronomical Journal*, 103(1):318–331, 1992.CrossRefGoogle Scholar - [28]V. N. Vapnik.
*Estimation of dependencies based on empirical Data*. Springer, New York, 1982. English translation, Russian version 1979.Google Scholar - [29]V. N. Vapnik.
*The Nature of Statistical Learning Theory*. Springer Verlag, New York, 1995.zbMATHGoogle Scholar - [30]V. N. Vapnik.
*Statistical Learning Theory*. Wiley Inter-Science, 1998.zbMATHGoogle Scholar - [31]V. N. Vapnik and A. Ja. Chervonenkis.
*Theory of Pattern Recognition*. Nauka, Moscow, 1974. In Russian.zbMATHGoogle Scholar