Abstract
The task of object detection in computer vision revolves around the identification of objects within images or videos. A specific subtask within object detection is face detection, which focuses on detecting human faces. Within the realm of face detection, an important research area is facial feature detection, which has diverse applications ranging from facial recognition to emotion detection and facial expression analysis. The crucial step in facial feature detection is the identification and localization of key facial features such as the eyes, eyebrows, nose, mouth, and chin, which can also be called facial region detection. Face region detection can be done in two ways: landmark detection and Bounding box- based detection. Bounding boxes offer computational benefits such as increased speed and efficiency. They are preferable when the objective is to accurately detect and locate the presence of an object or face in an image or video frame. Although most of the existing algorithms for facial feature detection based on bounding box predictions typically treat the eyes as a single entity, our approach using YOLOv5 addresses the separation of left and right eye detection. In this research study, we conducted experiments using YOLOv5, which provides bounding box predictions. We used a subset of LFW (Labelled Faces in the Wild) Dataset which we augmented using GFP-GAN, Gaussian Noise, Image Sharpening, and CLAHE. We explored the effectiveness of different backbone architectures when applied to YOLOv5 for the task of facial region detection. We evaluated three popular backbone networks: EfficientNet-b0, GhostNet, and CSP-Darknet53. Our objective was to identify the most suitable backbone architecture that yields accurate detection of facial features, including the left eye, right eye, nose, and lips. Our experiments show that when GhostNet is used as a backbone in the YOLOv5 architecture, it produces superior results for the detection and classification of features as compared to the other backbones. We present a detailed evaluation of our findings, including discussions of the experimental results using different IOU thresholds and backbone combinations. Our proposed methodology and findings make valuable contributions to the field of facial feature extraction and provide meaningful insights into the potential and performance of YOLOv5 for detecting and localizing key facial elements.
Similar content being viewed by others
Data availability
The datasets analyzed during the current study are available with the authors and may be provided on request.
References
Dhingra A (2017) Face identification and clustering. Rutgers The State University of New Jersey, School of Graduate Studies
Hjelmås E, Low BK (2001) Face detection: a survey. Comput Vis Image Underst 83(3):236–274
Lam KM, Yan H (1994) Facial feature location and extraction for computerized human face recognition. In ISITA’94: International Symposium on Information Theory & Its Applications 1994; Proceedings. Institution of Engineers, Australia, Barton, pp 167–171
Crowley JL, Berard F (1997) Multi-modal tracking of faces for video communications. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp 640–645). IEEE
Bagherian E, Rahmat RWO (2008) Facial feature extraction for face recognition: a review. In: 2008 International Symposium on Information Technology (vol 2, pp 1–9). IEEE
Ryu YS, Oh SY (2001) Automatic extraction of eye and mouth fields from a face image using eigenfeatures and multilayer perceptrons. Pattern Recogn 34(12):2459–2466
Cristinacce D, Cootes TF (2003, September) Facial feature detection using AdaBoost with shape constraints. In BMVC, pp 1–10
Wiskott L, Fellous JM, Krüger N, Von Der Malsburg C (2022) Face recognition by elastic bunch graph matching. In Intelligent biometric techniques in fingerprint and face recognition. Routledge, pp 355–396
Feris RS, Gemmell J, Toyama K, Kruger V (2002) Hierarchical wavelet networks for facial feature localization. In: Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. IEEE, pp 125–130
Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Xiao J, Baker S, Matthews I, Kanade T (2004) Real-time combined 2D+ 3D active appearance models. In CVPR (2), pp 535–542
Wu Y, Ji Q (2019) Facial landmark detection: a literature survey. Int J Comput Vision 127:115–142
Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. Advances in neural information processing systems, 26.
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
Sun Y, Wang X, Tang X (2013) Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 3476–3483)
Dong X, Yu S, Wu Z, Guo Y, Yang Y (2017) Face alignment with coarse- to-fine topology. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5325–5334
Hou Q, Wang J, Cheng L, Gong Y (2015) Facial landmark detection via cascade multi-channel convolutional neural network. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE, pp 1800–1804
Zhang J, Li H, Wang Y, Wang R, Li Z, Zuo W (2018) Robust facial landmark detection via a fully-convolutional local-global context network. arXiv Preprint arXiv :180303073
Deng J, Trigeorgis G, Zhou Y, Zafeiriou S (2019) Joint multi-view face alignment in the wild. IEEE Trans Image Process 28(7):3636–3648
Colaco S, Han D (2022) Deep learning-based facial landmarks localization using compound scaling. IEEE Access 1–1
Yang S, Luo P, Loy C-C, Tang X (2015) From facial parts responses to face detection: A deep learning approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Feng ZH, Kittler J, Awais M, Huber P, Wu XJ (2018) Wing loss for robust facial landmark localisation with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2235–2245
Huang G, Mattar M, Lee H, Learned-Miller E (2012) Learning to align from scratch. Advances in Neural Information Processing Systems, pp 25
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57:137–154
Wang X, Li Y, Zhang H, Shan Y (2021) Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9168–9178
Alqahtani H, Kavakli-Thorne M, Kumar G, SBSSTC F (2019 An analysis of evaluation metrics of GANs. In: International Conference on Information Technology and Applications (ICITA) (vol 7)
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. Advances in Neural Information Processing Systems, pp 29
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, pp 30
Cheng Z, Sun H, Takeuchi M, Katto J (2018) Performance comparison of convolutional autoencoders, generative adversarial networks and super-resolution for image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 2613–2616
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Zhang Y, Guo Z, Wu J, Tian Y, Tang H, Guo X (2022) Real-time vehicle detection based on improved yolo v5. Sustainability 14(19):12274
Jocher G, Stoken A, Borovec J, Chaurasia A, Changyu L, Hogan A, …, Ingham F (2021) ultralytics/yolov5: v5. 0-YOLOv5-P6 1280 models, AWS, Supervise. ly and YouTube integrations. Zenodo
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8759–8768
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 779–788
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1580–1589
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, pp 6105–6114
Nowozin S (2014) Optimal decisions from probabilistic models: the intersection-over-union case. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 548–555
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chanda, S., Kumar, Y.N., Srivastava, S. et al. Optimizing facial feature extraction and localization using YOLOv5: An empirical analysis of backbone architectures with data augmentation for precise facial region detection. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19284-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19284-8