Skip to main content
Log in

Optimizing facial feature extraction and localization using YOLOv5: An empirical analysis of backbone architectures with data augmentation for precise facial region detection

  • 1232: Human-centric Multimedia Analysis
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The task of object detection in computer vision revolves around the identification of objects within images or videos. A specific subtask within object detection is face detection, which focuses on detecting human faces. Within the realm of face detection, an important research area is facial feature detection, which has diverse applications ranging from facial recognition to emotion detection and facial expression analysis. The crucial step in facial feature detection is the identification and localization of key facial features such as the eyes, eyebrows, nose, mouth, and chin, which can also be called facial region detection. Face region detection can be done in two ways: landmark detection and Bounding box- based detection. Bounding boxes offer computational benefits such as increased speed and efficiency. They are preferable when the objective is to accurately detect and locate the presence of an object or face in an image or video frame. Although most of the existing algorithms for facial feature detection based on bounding box predictions typically treat the eyes as a single entity, our approach using YOLOv5 addresses the separation of left and right eye detection. In this research study, we conducted experiments using YOLOv5, which provides bounding box predictions. We used a subset of LFW (Labelled Faces in the Wild) Dataset which we augmented using GFP-GAN, Gaussian Noise, Image Sharpening, and CLAHE. We explored the effectiveness of different backbone architectures when applied to YOLOv5 for the task of facial region detection. We evaluated three popular backbone networks: EfficientNet-b0, GhostNet, and CSP-Darknet53. Our objective was to identify the most suitable backbone architecture that yields accurate detection of facial features, including the left eye, right eye, nose, and lips. Our experiments show that when GhostNet is used as a backbone in the YOLOv5 architecture, it produces superior results for the detection and classification of features as compared to the other backbones. We present a detailed evaluation of our findings, including discussions of the experimental results using different IOU thresholds and backbone combinations. Our proposed methodology and findings make valuable contributions to the field of facial feature extraction and provide meaningful insights into the potential and performance of YOLOv5 for detecting and localizing key facial elements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data availability

The datasets analyzed during the current study are available with the authors and may be provided on request.

References

  1. Dhingra A (2017) Face identification and clustering. Rutgers The State University of New Jersey, School of Graduate Studies

  2. Hjelmås E, Low BK (2001) Face detection: a survey. Comput Vis Image Underst 83(3):236–274

    Article  Google Scholar 

  3. Lam KM, Yan H (1994) Facial feature location and extraction for computerized human face recognition. In ISITA’94: International Symposium on Information Theory & Its Applications 1994; Proceedings. Institution of Engineers, Australia, Barton, pp 167–171

  4. Crowley JL, Berard F (1997) Multi-modal tracking of faces for video communications. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp 640–645). IEEE

  5. Bagherian E, Rahmat RWO (2008) Facial feature extraction for face recognition: a review. In: 2008 International Symposium on Information Technology (vol 2, pp 1–9). IEEE

  6. Ryu YS, Oh SY (2001) Automatic extraction of eye and mouth fields from a face image using eigenfeatures and multilayer perceptrons. Pattern Recogn 34(12):2459–2466

    Article  Google Scholar 

  7. Cristinacce D, Cootes TF (2003, September) Facial feature detection using AdaBoost with shape constraints. In BMVC, pp 1–10

  8. Wiskott L, Fellous JM, Krüger N, Von Der Malsburg C (2022) Face recognition by elastic bunch graph matching. In Intelligent biometric techniques in fingerprint and face recognition. Routledge, pp 355–396

  9. Feris RS, Gemmell J, Toyama K, Kruger V (2002) Hierarchical wavelet networks for facial feature localization. In: Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. IEEE, pp 125–130

  10. Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

    Article  Google Scholar 

  11. Xiao J, Baker S, Matthews I, Kanade T (2004) Real-time combined 2D+ 3D active appearance models. In CVPR (2), pp 535–542

  12. Wu Y, Ji Q (2019) Facial landmark detection: a literature survey. Int J Comput Vision 127:115–142

    Article  Google Scholar 

  13. Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. Advances in neural information processing systems, 26.

  14. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503

    Article  Google Scholar 

  15. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.

  16. Sun Y, Wang X, Tang X (2013) Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 3476–3483)

  17. Dong X, Yu S, Wu Z, Guo Y, Yang Y (2017) Face alignment with coarse- to-fine topology. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5325–5334

  18. Hou Q, Wang J, Cheng L, Gong Y (2015) Facial landmark detection via cascade multi-channel convolutional neural network. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE, pp 1800–1804

  19. Zhang J, Li H, Wang Y, Wang R, Li Z, Zuo W (2018) Robust facial landmark detection via a fully-convolutional local-global context network. arXiv Preprint arXiv :180303073

  20. Deng J, Trigeorgis G, Zhou Y, Zafeiriou S (2019) Joint multi-view face alignment in the wild. IEEE Trans Image Process 28(7):3636–3648

    Article  MathSciNet  Google Scholar 

  21. Colaco S, Han D (2022) Deep learning-based facial landmarks localization using compound scaling. IEEE Access 1–1

  22. Yang S, Luo P, Loy C-C, Tang X (2015) From facial parts responses to face detection: A deep learning approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

  23. Feng ZH, Kittler J, Awais M, Huber P, Wu XJ (2018) Wing loss for robust facial landmark localisation with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2235–2245

  24. Huang G, Mattar M, Lee H, Learned-Miller E (2012) Learning to align from scratch. Advances in Neural Information Processing Systems, pp 25

  25. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57:137–154

    Article  Google Scholar 

  26. Wang X, Li Y, Zhang H, Shan Y (2021) Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9168–9178

  27. Alqahtani H, Kavakli-Thorne M, Kumar G, SBSSTC F (2019 An analysis of evaluation metrics of GANs. In: International Conference on Information Technology and Applications (ICITA) (vol 7)

  28. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. Advances in Neural Information Processing Systems, pp 29

  29. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, pp 30

  30. Cheng Z, Sun H, Takeuchi M, Katto J (2018) Performance comparison of convolutional autoencoders, generative adversarial networks and super-resolution for image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 2613–2616

  31. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612

    Article  Google Scholar 

  32. Zhang Y, Guo Z, Wu J, Tian Y, Tang H, Guo X (2022) Real-time vehicle detection based on improved yolo v5. Sustainability 14(19):12274

    Article  Google Scholar 

  33. Jocher G, Stoken A, Borovec J, Chaurasia A, Changyu L, Hogan A, …, Ingham F (2021) ultralytics/yolov5: v5. 0-YOLOv5-P6 1280 models, AWS, Supervise. ly and YouTube integrations. Zenodo

  34. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8759–8768

  35. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  36. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 779–788

  37. Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  38. Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1580–1589

  39. Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp 6105–6114

  40. Nowozin S (2014) Optimal decisions from probabilistic models: the intersection-over-union case. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 548–555

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ritu Rani.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chanda, S., Kumar, Y.N., Srivastava, S. et al. Optimizing facial feature extraction and localization using YOLOv5: An empirical analysis of backbone architectures with data augmentation for precise facial region detection. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19284-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19284-8

Keywords

Navigation