Abstract
In this paper a classification framework for incomplete data, based on electrostatic field model is proposed. An original approach to exploiting incomplete training data with missing features, involving extensive use of electrostatic charge analogy, has been used. The framework supports a hybrid supervised and unsupervised training scenario, enabling learning simultaneously from both labelled and unlabelled data using the same set of rules and adaptation mechanisms. Classification of incomplete patterns has been facilitated by introducing a local dimensionality reduction technique, which aims at exploiting all available information using the data ‘as is’, rather than trying to estimate the missing values. The performance of all proposed methods has been extensively tested in a wide range of missing data scenarios, using a number of standard benchmark datasets in order to make the results comparable with those available in current and future literature. Several modifications to the original Electrostatic Field Classifier aiming at improving speed and robustness in higher dimensional spaces have also been introduced and discussed.
Similar content being viewed by others
Notes
We use the term ‘sample’ to refer to a single object/instance and not to the whole dataset, which is common in statistics literature.
Deficiency level is the level of missingness of a dataset, with 0 for complete data and 1 for maximally incomplete data, taking into account the constraints given.
References
Aggarwal C (2001) Re-designing distance functions and distance-based applications for high dimensional data. ACM SIGMOD Rec 30(1):13–18
Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lect Notes Comput Sci 2001:420–435
Asuncion A, Newman D (2007) UCI machine learning repository
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful. Lect Notes Comput Sci 1540:217–235
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, New York, NY, USA, pp 92–100
Budka M, Gabrys B (2009) Electrostatic field classifier for deficient data. In: Computer recognition systems 3: Proceedings of 6th international conference on computer recognition systems cores 09. Springer, pp 311–318
Chuang I, Nielsen M (2000) Quantum information. Cambridge University Press
Dara R, Kremer S, Stacey D (2002) Clustering unlabeled data with SOMs improves classification of labeled real-world data. In: Neural networks, 2002. IJCNN’02. Proceedings of the 2002 international joint conference on, vol 3, pp 2237–2242
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
Duin R, Juszczak P, Paclik P, Pekalska E, de Ridder D, Tax D, Verzakov S (2007) Pr-tools 4.1, a matlab toolbox for pattern recognition. http://prtools.org
Francois D, Wertz V, Verleysen M (2005) Non-Euclidean metrics for similarity search in noisy datasets. In: Proceedings of the European symposium on artificial neural networks, pp 339–334
Gabrys B (2002) Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems. Int J Approx Reason 30(3):149–179
Gabrys B, Petrakieva L (2004) Combining labelled and unlabelled data in the design of pattern classification systems. Int J Approx Reason 35(3):251–273
Ghahramani Z, Jordan M, Cowan J, Tesauro G, Alspector J (1994) Supervised learning from incomplete data via an EM approach. Adv Neural Inf Process Syst 6:120–127
Graham J, Cumsille P, Elek-Fisk E (2003) Methods for handling missing data. Handb Psychol 2:87–114
Hakkoymaz H, Chatzimilioudis G, Gunopulos D, Mannila H (2009) Applying electromagnetic field theory concepts to clustering with constraints. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I. Springer, p 500
Hild K, Erdogmus D, Principe J (2001) Blind source separation using Renyi’s mutual information. IEEE Signal Process Lett 8(6):174–176
Hochreiter S, Mozer M (2001) Coulomb classifiers: reinterpreting SVMs as electrostatic systems. Technical report CU-CS-921-01. Department of Computer Science, University of Colorado, Boulder
Hochreiter S, Mozer M, Obermayer K (2003) Coulomb classifiers: generalizing support vector machines via an analogy to electrostatic systems. Adv Neural Inf Process Syst 15:545–552
Jenssen R, Eltoft T, Erdogmus D, Principe J (2006) Some equivalences between kernel methods and information theoretic methods. J VLSI Signal Process 45(1):49–65
Kothari R, Jain V (2002) Learning from labeled and unlabeled data. In: Neural networks, 2002. IJCNN’02. Proceedings of the 2002 international joint conference on, vol 3
Kuncheva L (2000) Fuzzy classifier design. Physica Verlag
Loss D, DiVincenzo D (1998) Quantum computation with quantum dots. Phys Rev A 57(1):120–126
Madow W, Olkin I (1983) Incomplete data in sample surveys, vol 3, Proceedings of the symposium. Academic Press, New York
Mitchell T (1999) The role of unlabeled data in supervised learning. In: Proceedings of the sixth international colloquium on cognitive science
Nigam K, Ghani R (2000) Understanding the behavior of co-training. In: Proceedings of KDD-2000 workshop on text mining
Outhwaite W, Turner SP (2007) Handbook of social science methodology. SAGE Publications Ltd
Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern B 27(5):787–795
Principe J, Xu D, Fisher J (2000a) Information theoretic learning, chapter 7. Wiley, New York, pp 265–319
Principe J, Xu D, Zhao Q, Fisher J (2000b) Learning from examples with information theoretic criteria. J VLSI Signal Process 26(1):61–77
Ripley B (1996) Pattern recognition and neural networks. Cambridge University Press
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the 18th international conference on machine learning, pp 441–448
Rubin D (1976) Inference and missing data. Biometrika 63(3):581–592
Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley-Interscience
Ruta D, Gabrys B (2003) Physical field models for pattern classification. Soft Comput 8(2):126–141
Ruta D, Gabrys B (2005) Nature inspired learning models. In: Proceedings of the European symposium on nature inspired smart information systems, Albufeira, Portugal
Ruta D, Gabrys B (2009) A framework for machine learning based on dynamic physical fields. Nat Comput 8(2):219–237
Sarle W (1998) Prediction with missing inputs. JCIS 98:399–402
Schafer J, Graham J (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177
Schafer J, Schenker N (2000) Inference with imputed conditional means. J Am Stat Assoc 95(449):144–154
Sg SG, Goldman S, Zhou Y (2000) Enhancing supervised learning with unlabeled data. Proceedings of the 17th international conference on machine learning, pp 327–334
Torkkola K (2003) Feature extraction by non parametric mutual information maximization. J Mach Learn Res 3:1415–1438
Tresp V, Ahmad S, Neuneier R (1994) Training neural networks with deficient data. Adv Neural Inf Process Syst 6:128–135
Walther P, Resch K, Rudolph T, Schenck E, Weinfurter H, Vedral V, Aspelmeyer M, Zeilinger A (2005) Experimental one-way quantum computing. Nature 434:169–176
Zurek W (1989) Complexity, entropy and the physics of information. Westview Press
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Budka, M., Gabrys, B. Electrostatic field framework for supervised and semi-supervised learning from incomplete data. Nat Comput 10, 921–945 (2011). https://doi.org/10.1007/s11047-010-9182-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-010-9182-4