Taming the Cross Entropy Loss

  • Manuel MartinezEmail author
  • Rainer Stiefelhagen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11269)


We present the Tamed Cross Entropy (TCE) loss function, a robust derivative of the standard Cross Entropy (CE) loss used in deep learning for classification tasks. However, unlike other robust losses, the TCE loss is designed to exhibit the same training properties than the CE loss in noiseless scenarios. Therefore, the TCE loss requires no modification on the training regime compared to the CE loss and, in consequence, can be applied in all applications where the CE loss is currently used. We evaluate the TCE loss using the ResNet architecture on four image datasets that we artificially contaminated with various levels of label noise. The TCE loss outperforms the CE loss in every tested scenario.


  1. 1.
    Cobb, A.D., Roberts, S.J., Gal, Y.: Loss-calibrated approximate inference in Bayesian neural networks. arXiv preprint arXiv:1805.03901 (2018)
  2. 2.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)Google Scholar
  3. 3.
    Flatow, D., Penner, D.: On the robustness of ConvNets to training on noisy labels. Technical Report. Stanford University (2017)Google Scholar
  4. 4.
    Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. In: Advances in Neural Information Processing Systems (NIPS)Google Scholar
  5. 5.
    Ghosh, A., Kumar, H., Sastry, P.: Robust loss functions under label noise for deep neural networks. In: Association for the Advancement of Artificial Intelligence, AAAI (2017)Google Scholar
  6. 6.
    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision, ICCV (2015)Google Scholar
  7. 7.
    Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, ICML (2017)Google Scholar
  8. 8.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Conference on Computer Vision and Pattern Recognition, CVPR (2006)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition, CVPR (2016)Google Scholar
  10. 10.
    Huber, P.J., et al.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101 (1964)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Jindal, I., Nokleby, M., Chen, X.: Learning deep networks from noisy labels with dropout regularization. In: International Conference Data Mining, ICDM (2016)Google Scholar
  12. 12.
    Koniusz, P., Yan, F., Mikolajczyk, K.: Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Comput. Vis. Image Underst. (CVIU) 117, 479–492 (2013)CrossRefGoogle Scholar
  13. 13.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report (2009)Google Scholar
  14. 14.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  15. 15.
    Lezama, J., Qiu, Q., Musé, P., Sapiro, G.: OLE: orthogonal low-rank embedding, a plug and play geometric loss for deep learning. In: Conference on Computer Vision and Pattern Recognition, CVPR (2018)Google Scholar
  16. 16.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)Google Scholar
  17. 17.
    Prakash, A., Moran, N., Garber, S., DiLillo, A., Storer, J.: Protecting JPEG images against adversarial attacks. In: Data Compression Conference, DCC (2018)Google Scholar
  18. 18.
    Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694 (2017)
  19. 19.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Conference on Computer Vision and Pattern Recognition, CVPR (2015)Google Scholar
  20. 20.
    Tewari, A., Bartlett, P.L.: On the consistency of multiclass classification methods. J. Mach. Learn. Res. 8, 1007–1025 (2007)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations