Identifying and Mitigating Labelling Errors in Active Learning

  • Mohamed-Rafik BougueliaEmail author
  • Yolande Belaïd
  • Abdel Belaïd
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9493)


Most existing active learning methods for classification, assume that the observed labels (i.e. given by a human labeller) are perfectly correct. However, in real world applications, the labeller is usually subject to labelling errors that reduce the classification accuracy of the learned model. In this paper, we address this issue for active learning in the streaming setting and we try to answer the following questions: (1) which labelled instances are most likely to be mislabelled? (2) is it always good to abstain from learning when data is suspected to be mislabelled? (3) which mislabelled instances require relabelling? We propose a hybrid active learning strategy based on two measures. The first measure allows to filter the potentially mislabelled instances, based on the degree of disagreement among the manually given label and the predicted class label. The second measure allows to select (for relabelling) only the most informative instances that deserve to be corrected. An instance is worth relabelling if it shows highly conflicting information among the predicted and the queried labels. Experiments on several real world data show that filtering mislabelled instances according to the first measure and relabelling few instances selected according to the second measure, greatly improves the classification accuracy of the stream-based active learning.


Label noise Active learning Classification Data stream 


  1. 1.
    Zliobaite, I., Bifet, A., Pfahringer, B., Holmes, G.: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 27–39 (2014)CrossRefGoogle Scholar
  2. 2.
    Kremer, J., Steenstrup Pedersen, K., Igel, C.: Active learning with support vector machines. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, pp. 313–326 (2014)Google Scholar
  3. 3.
    Huang, L., Liu, Y., Liu, X., Wang, X., Lang, B.: Graph-based active semi-supervised learning: a new perspective for relieving multi-class annotation labor. In: IEEE International Conference on Multimedia and Expo, pp. 1–6 (2014)Google Scholar
  4. 4.
    Kushnir, D.: Active-transductive learning with label-adapted kernels. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 462–471 (2014)Google Scholar
  5. 5.
    Settles, B.: Active learning. In: Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–114 (2012)Google Scholar
  6. 6.
    Bouguelia, M-R., Belaïd, Y., Belaïd, A.: A stream-based semi-supervised active learning approach for document classification. In: IEEE International Conference on Document Analysis and Recognition, pp. 611–615 (2013)Google Scholar
  7. 7.
    Goldberg, A., Zhu, X., Furger, A., Xu, J.M.: OASIS: online active semi-supervised learning. In: AAAI Conference on Artificial Intelligence, pp. 1–6 (2011)Google Scholar
  8. 8.
    Dasgupta, S.: Coarse sample complexity bounds for active learning. In: Neural Information Processing Systems (NIPS), pp. 235–242 (2005)Google Scholar
  9. 9.
    Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2013)CrossRefGoogle Scholar
  10. 10.
    Zhu, X., Zhang, P., Wu, X., He, D., Zhang, C., Shi, Y.: Cleansing noisy data streams. In: IEEE International Conference on Data Mining, pp. 1139–1144 (2008)Google Scholar
  11. 11.
    Rebbapragada, U., Brodley, C.E., Sulla-Menashe, D., Friedl, M.A.: Active label correction. In: IEEE International Conference on Data Mining, pp. 1080–1085 (2012)Google Scholar
  12. 12.
    Fang, M., Zhu, X.: Active learning with uncertain labeling knowledge. Pattern Recogn. Lett. 43, 98–108 (2013)CrossRefGoogle Scholar
  13. 13.
    Tuia, D., Munoz-Mari, J.: Learning user’s confidence for active learning. IEEE Trans. Geosci. Remote Sens. 51(2), 872–880 (2013)CrossRefGoogle Scholar
  14. 14.
    Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple noisy labelers. In: ACM Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)Google Scholar
  15. 15.
    Ipeirotis, P.G., Provost, F., Sheng, V.S., Wang, J.: Repeated labeling using multiple noisy labelers. In: ACM Conference on Knowledge Discovery and Data Mining, pp. 402–441 (2014)Google Scholar
  16. 16.
    Yan, Y., Fung, G.M., Rosales, R., Dy, J.G.: Active learning from crowds. In: International Conference on Machine Learning, pp. 1161–1168 (2011)Google Scholar
  17. 17.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987)CrossRefzbMATHGoogle Scholar
  18. 18.
    Sun, S.: A survey of multi-view machine learning. Neural Comput. Appl. 23(7–8), 2031–2038 (2013)CrossRefGoogle Scholar
  19. 19.
    Gamberger, D., Lavrac, N., Dzeroski, S.: Noise elimination in inductive concept learning: a case study in medical diagnosis. In: Arikawa, Setsuo, Sharma, A.K. (eds.) ALT 1996. LNCS, vol. 1160, pp. 199–212. Springer, Heidelberg (1996) CrossRefGoogle Scholar
  20. 20.
    Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)zbMATHGoogle Scholar
  21. 21.
    Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Mohamed-Rafik Bouguelia
    • 1
    Email author
  • Yolande Belaïd
    • 1
  • Abdel Belaïd
    • 1
  1. 1.Université de Lorraine - LORIA, UMR 7503Vandoeuvre-les-NancyFrance

Personalised recommendations