Detecting Mislabeled Data Using Supervised Machine Learning Techniques

  • Mannes PoelEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10284)


A lot of data sets, gathered for instance during user experiments, are contaminated with noise. Some noise in the measured features is not much of a problem, it even increases the performance of many Machine Learning (ML) techniques. But for noise in the labels (mislabeled data) the situation is quite different, label noise deteriorates the performance of all ML techniques. The research question addressed in this paper is to what extent can one detect mislabeled data using a committee of supervised Machine Learning models. The committee under consideration consists of a Bayesian model, Random Forest, Logistic classifier, a Neural Network and a Support Vector Machine. This committee is applied to a given data set in several iterations of 5-fold Cross validation. If a data sample is misclassified by all committee members in all iterations (consensus) then it is tagged as mislabeled. This approach was tested on the Iris plant data set, which is artificially contaminated with mislabeled data. For this data set the precision of detecting mislabeled samples is 100% and the recall is approximately 5%. The approach was also tested on the Touch data set, a data set of naturalistic social touch gestures. It is known that this data set contains mislabeled data, but the amount is unknown. For this data set the proposed method achieved a precision of 70% and for almost all other tagged samples the corresponding touch gesture deviated a lot from the prototypical touch gesture. Overall the proposed method shows high potential for detecting mislabeled samples, but the precision on other data sets needs to be investigated.


Mislabeled data Supervised Machine Learning 



This work was partially supported by the Dutch national program COMMIT.


  1. Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, Chichester (1994)zbMATHGoogle Scholar
  2. Bishop, C.: Pattern Recognition and Machine Learning, 2nd edn. Springer, New York (2007)zbMATHGoogle Scholar
  3. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)zbMATHGoogle Scholar
  4. Cao, J., Kwong, S., Wang, R.: A noise-detection based AdaBoost algorithm for mislabeled data. Pattern Recognit. 45(12), 4451–4465 (2012)CrossRefzbMATHGoogle Scholar
  5. van Capelleveen, G., Poel, M., Mueller, R.M., Thornton, D., van Hillegersberg, J.: Outlier detection in healthcare fraud: a case study in the medicaid dental domain. Int. J. Account. Inf. Syst. 21, 18–31 (2016)CrossRefGoogle Scholar
  6. Ekambaram, R., Fefilatyev, S., Shreve, M., Kramer, K., Hall, L.O., Goldgof, D.B., Kasturi, R.: Active cleaning of label noise. Pattern Recognit. 51, 463–480 (2016). CrossRefGoogle Scholar
  7. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)CrossRefGoogle Scholar
  8. Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2014)CrossRefGoogle Scholar
  9. Freund, Y., Schapire, R.E., et al.: Experiments with a new boosting algorithm. ICML 96, 148–156 (1996)Google Scholar
  10. Guan, D., Yuan, W.: A survey of mislabeled training data detection techniques for pattern classification. IETE Tech. Rev. 30(6), 524–530 (2013)CrossRefGoogle Scholar
  11. Jung, M.M., Poel, M., Poppe, R., Heylen, D.K.J.: Automatic recognition of touch gestures in the corpus of social touch. J. Multimodal User Interfaces 11, 1–16 (2016). Google Scholar
  12. Quinlan, J.R.: Induction of decision trees. Induction Decis. Trees Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  13. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Human Media InteractionUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations