1 Introduction

The advent of facial recognition in the field of pattern recognition has found a great range of applicability especially for the purpose of cyber investigations. This has been possible due to the progressions made in the analysis and modelling techniques. Increased demand for secured systems has led the researchers to find solutions in terms of accessing control, verifying identities, securing cyber physical systems, internet communications, computer entertainment and establishing surveillance systems that are sturdy and impenetrable [1,2,3].

Alongside automatic facial recognition systems, automatic processing of digitalized content (like videos and images) has also been achieved due to the low-cost computing systems. As manipulations with identity cards and encroachments into the virtual/physical areas were creating a nuance, it was realized that there was a dire need to have systems that were reliable and could recognize individuals accurately. A number of advancements were made including biometric authentication, computer-human interactions, machine learning, surveillance etc., thereby leading to a natural discourse of research and development in the field of automatic face recognition.

Identification processes of biometric authentication like iris recognition are highly advanced and individualistic but are intrusive in nature. With the evolution of digital technologies and the challenges offered by human identification and surveillance systems, research in the area of face recognition has become imperative as convenient, natural and non-intrusive in nature [4]. In order to discern facial features correctly, various facial recognition systems are available. However, they do not provide the precision needed for a reliable recognition to be made. Thus, better face detection algorithms form a pre-requisite requirement for pattern recognition and computer vision applications.

Further, issues due to variations in the illumination have greatly reduced the potential of facial recognition systems. This is because there is a marked difference in the facial images of the same individual obtained under different illumination. The existing systems are highly sensitive to light variations [5]. Facial images taken under a condition where illumination was un-controlled suffered a non-uniform illumination. To cope with this, certain adaptive techniques are being used. These normalization techniques emend the illumination and restore the features of an image to its original form. Examples of illumination normalization techniques are: Logarithmic Transform (LT), Gamma correction (GC) and Histogram equalization etc. [5, 6]. Extreme Learning Machine (ELM) and it’s kernelized variant K - ELM has been previously employed for facial recognition schemes by Zong et al. [7, 8]. An attempt to enhance the facial classification using ELM based on facial views has been demonstrated by Iosifidis et al. [9]. Rujirakul et al. in [10] demonstrate the use of histogram equalization coupled with principal component analysis (PCA) in hybrid with ELM for facial recognition. Independent Component Analysis (ICA) has also been used in conjunction with hybrid of Standard Particle Swarm Organization (SPSO) and ELM to demonstrate recognition rates upto 93% [11]. Similar schemes employing Linear Discriminant Analysis (LDA) and multi-class support vector machines have also been reported [12,13,14]. An effort has also been made in using local difference binary (LDB) descriptors and fuzzy logic with histogram of oriented gradients (HOG) for efficient facial recognition systems [15, 16].

In this paper, Viola-Jones algorithm is applied to identify the regions corresponding to the face of a subject which is ultimately used for extraction of Histogram Oriented Gradients (HOG) features. HOG feature selection is used as a pre-processing technique as it corrects the overall brightness of a face image to a pre-defined canonical form which essentially discards the consequence of varying lighting. For each image in the dataset, HOG extracts crucial features which form the basis for training of the neural machine.

The paper is structured as follows. Section 2 describes the basics of ELM. Section 3 gives an outline of Viola-Jones Algorithm. Section 4 gives a brief introduction of HOG. Section 5 gives an insight into the face recognition approach that has been adopted in this paper. The results have been summarized in Sect. 6 and finally, the paper has been concluded in Sect. 7.

2 Extreme Learning Machine

The Extreme Learning Machine (ELM) is a single layer feed forward neural network [17,18,19]. Unlike traditional neural machines, application of an ELM is simple. It does not require controlling parameters like learning rate, stopping iterations etc. which are technically complex. It works on the basis of random allocations of input weights and hidden layer biases. This necessitates probability distribution functions that are continuous in nature. Using an inverse method i.e. Moore-Penrose generalized pseudo inverse, output weights are determined [18].

2.1 The ELM Model

Let us consider a set of N training samples (\( {\text{x}}_{\rm i} ,{\text{y}}_{\rm i} \)) where \( {\text{x}}_{\rm i} \in {\mathbb{R}}^{n} \), \( {\text{y}}_{\rm i} \in {\mathbb{R}}^{m} \) and i = 1, 2, …, N. The number of hidden neurons be denoted as \( \widehat{\text{N}} \).

If ‘g’ is the activation function then g: \( {\mathbb{R}} \to {\mathbb{R}} \). The output of the system [17] can then be given as:

$$ \sum\limits_{{{\text{k}} = 1}}^{{\widehat{\text{N}}}} {\upbeta_{\rm k} {\text{g}}\left( {{\text{w}}_{\rm k} {\text{x}}_{i} + {\text{b}}_{k} } \right) = {\text{o}}_{i} \,\forall \,{\text{i}} \in 1,2, \ldots ,{\text{N}}} $$
(1)

Here \( {\text{w}}_{\rm k} \) is the weighting vector that connects the \( {\text{k}}^{\rm th} \) hidden neuron with the input nodes. Similarly, \( \upbeta_{\rm k} \) is a weighting vector which connects the \( {\text{k}}^{\rm th} \) hidden neuron to the output node. \( {\text{b}}_{k} \) represents the threshold bias of the \( {\text{k}}^{\rm th} \) hidden neuron.

As mentioned before, the weighting vectors are chosen randomly as per the continuous probability distribution function. The neural network with \( \widehat{\text{N}} \) hidden neurons and activation function g:\( {\mathbb{R}} \to {\mathbb{R}} \) approximates samples with zero error. The Eq. (1) can thus be written as:

$$ \sum\limits_{{{\text{k}} = 1}}^{{\widehat{\text{N}}}} {\upbeta_{\rm k} {\text{g}}\left( {{\text{w}}_{\rm k} {\text{x}}_{i} + {\text{b}}_{k} } \right) = y_{i} \,\forall \,{\text{i}} \in 1, 2, \ldots ,{\text{N}}} $$
(2)

Thus, we have,

$$ {\text{H}}\upbeta = {\text{Y}} $$
(3)

where,

$$ {\text{H}} = \left[ {\begin{array}{*{20}c} {{\text{g}}\left( {{\text{w}}_{1} {\text{x}}_{1} + {\text{b}}_{1} } \right)} & \ldots & {{\text{g}}\left( {{\text{w}}_{{\widehat{\rm N}}} {\rm x}_{1} + {\text{b}}_{{\widehat{\rm N}}} } \right)} \\ \vdots & \ldots & \vdots \\ {{\text{g}}\left( {{\text{w}}_{1} {\text{x}}_{\rm N} + {\text{b}}_{1} } \right)} & \ldots & {{\text{g}}\left( {{\text{w}}_{{\widehat{\rm N}}} {\rm x}_{\rm N} + {\text{b}}_{{\widehat{\rm N}}} } \right)} \\ \end{array} } \right]_{{{\text{N X }}\widehat{\rm N}}} $$
(4)
$$ \upbeta = \left[ {\begin{array}{*{20}c} {\upbeta_{1}^{\text{T}} } \\ \vdots \\ {\upbeta_{{\widehat{\text{N}}}}^{\rm T} } \\ \end{array} } \right]_{{\widehat{\text{N X m}}}} \; {\text{ and Y}} = \left[ {\begin{array}{*{20}c} {y_{1}^{\text{T}} } \\ \vdots \\ {y_{\text{N}}^{\rm T} } \\ \end{array} } \right]_{\text{N X m}} $$
(5)

The matrix H represents the hidden layer output matrix. The solution of the above system as given by Huang et al. [13] is:

$$ \widehat{\upbeta} = {\text{H}}^{\dag } {\text{Y}} $$
(6)

where, \( {\text{H}}^{\dag } \) is the Moore-Penrose generalized inverse of the hidden-layer output matrix H.

3 Viola Jones Algorithm

Viola Jones algorithm is an object detection framework put forward by Paul Viola and Michael Jones in 2001 [20]. Viola Jones object detector is based on a binary classifier that produces a positive output when the search window consists of the desired object otherwise it returns a negative output. The classifier may be used a number of times as the window slides over the image under test.

The binary classifier used in the algorithm is realized using several layers of hierarchy forming an ensemble classifier [21]. The said classifier operates by classifying images based on value of simple features. This is observed to operate much faster than a system which basis classification on a pixel-based system [20]. The Viola Jones algorithm exercises control over three features as dictated by Viola et al. in [20] viz., Two-Rectangle Feature, Three-Rectangle Feature, and Four-Rectangle Feature. The framework put forward by the group is noted to have following stages: (i) Haar Feature Selection, (ii) Integral Image Generation, (iii) Adaboost Training, and (iv) Cascading Classifiers. This is represented as a flowchart in Fig. 1.

Fig. 1.
figure 1

A depiction of the Viola-Jones Algorithm involving four important stages

The Haar feature selection is computed through Haar basis functions that are based on the three features as listed above and generally include pixel summation of involved adjacent rectangular areas and then calculates the difference between these sums. A depiction of Haar features relative to the corresponding detection window is shown in Fig. 2.

Fig. 2.
figure 2

Depiction of rectangle features shown relative to the detection window.

The integral image is then created which is used to evaluate the rectangular features in a constant time. Since the number of features can vary greatly, Adaboost or Adaptive Boosting algorithm is used to select best features and to train the classifiers that use them. This is responsible for creation of a “strong” classifier which is viewed as a linear weighted combination of simple “weak” classifiers. Finally, in cascading, each stage consisting of “strong” classifiers are grouped into several stages. Each stage is responsible for determining whether a sub-window consists of a face or not as depicted in Fig. 3. The algorithm described is implemented using a MATLAB inbuilt routine as described in [22].

Fig. 3.
figure 3

Depiction of the working flow of classifiers in Viola-Jones algorithm

4 Histograms of Oriented Gradients

Histograms of oriented gradients (HOG) finds applications in object and pattern recognition domain as it is capable of extracting crucial information even from the images that are obtained under garbled environments [23]. It is therefore well suited for tackling the facial recognition problem. The feature extraction process of HOG is based on extracting information about the edges in local regions of a target image [23]. Simply put, HOG feature extraction is primarily the characterization of the orientation and magnitude values of the pixels in an image [24]. That is, it defines an image in terms of groups of local histograms that point to local regions of an image.

The features of HOG can be seen on a grid of rose plots spaced uniformly. The grid dimensions depend upon the size of the cell and image. Thus, every rose plot depicts the gradient orientations distributed in a HOG cell. In a cell histogram, the length of the petals in a rose plot refers to the contribution of every orientation. For the gradient directions, the plot indicates the directions of the edges that are normal. MATLAB inbuilt routine are applied using HOG feature extraction [25].

Thus, in a portion of image with 9 cells (Fig. 4), the HOG feature extraction routine takes input a block of (m × n) cells and arranges them in a vector as depicted in Fig. 5.

Fig. 4.
figure 4

A portion of image realized using equal sized cells. Each cell consists of pixel values of that portion of the image.

Fig. 5.
figure 5

Depiction of composition of HOG feature vector. \( {\text{H}}({\rm C}_{\rm ij} ) \) represents the cell histogram at \( \left( {\text{i, j}} \right) \) position

5 Algorithmic Description of the Proposed Scheme

The facial recognition methodology adopted in this work is depicted in Fig. 6. Zong et al. in [7] compare the performance of one-against-all (OAA) and one-against-one (OAO) multi-class classification using ELM. Given the multi-label dataset consisting of \( \upalpha \) different classes, OAA methodology takes into consideration \( \upalpha \) binary classifiers trained in such a way to distinguish each class and remaining classes. On the other hand, in OAO, one binary classifier is used to distinguish one pair of classes resulting in a total of \( \left( {{\alpha - }1} \right)\, *\,\upalpha/2 \) binary classifiers, \( \upalpha \) being the number of different classes. As per the results presented in [7, 8] and related work using Linear Discriminant Analysis (LDA) and multiclass Support Vector Machine (SVM) [12,13,14], OAA has been observed to give better performance as compared to OAO methodology. Hence, for the current work, we adopt OAA ELM methodology for dealing with the facial recognition problem.

Fig. 6.
figure 6

Block diagram depicting the face recognition approach employing ELM in OAA mode along with Viola Jones object detection framework and HOG feature selection.

The face dataset is first split randomly into training and testing datasets as shown in Fig. 6. To each image in dataset, Viola-Jones algorithm is applied to detect the region which contains useful information pertaining to a subject’s facial features. The implementation of Viola-Jones algorithm is as per the MATLAB routine given in [22]. The extracted regions have been resized to a uniform size of 64 by 64. This is done so as to ensure that the number of HOG features extracted in the subsequent stage are similar and so that the processing that follows is uniform for all the subjects under consideration. The HOG features returned by the MATLAB routine [25] is in form of a row vector and consists of 1764 features for the methodology adopted. All such row vectors corresponding to each image in training dataset is stacked one over the other to form the final dataset that would be fed to an ELM for multi-class classification as depicted in Fig. 7. Once the ELM gets trained the images in testing dataset are subjected to same processing and fed to the trained ELM model for classification. The recognition rate is then determined by obtaining the total number of correct hits to the total images under the testing dataset.

Fig. 7.
figure 7

Depiction of the final dataset that is to be fed to an ELM considering an AT&T database with 80:20 split, giving 160 face images for training dataset. [Note: The images are partitioned randomly into testing and training datasets. The depiction is for understanding purposes.]

6 Experimental Results and Comparisons

The facial recognition scheme so presented has been tested on standard face recognition datasets viz. AT&T [26] and YALE [27]. AT&T consists of ten different images of each 40 distinct subjects with varying lighting conditions, facial expressions and at different time instants. Each image has a dimension of 92 × 112 with 256 grey levels per pixel and is available in portable gray map (PGM) format. The YALE dataset on the other hand consists of 11 images for each 15 distinct individuals with different facial expressions, varying lighting conditions, and with miscellaneous eye wear. Each image in the YALE dataset has a dimension of 243 × 320 with 256 grey levels per pixels. These are available in graphics interchange format (GIF). Some sample face images from both the datasets is shown in Fig. 8.

Fig. 8.
figure 8

The first row depicts the face samples of a particular subject from AT&T database, while second and third rows depict face samples of another subject from YALE Dataset

For each dataset, the all images were split into training and testing datasets as per 80:20 and 70:30 splitting ratios. The simulations are carried out using Mathworks MATLAB 9.4 running on the Windows 10 Home Edition with an 8 GB of memory and an i5 0 7300 HQ (2.50 GHz) processor. The results are tabulated in Tables 1 and 2 and depicted in form of curves in Fig. 9.

Table 1. Consequence of hidden neurons on the recognition rate and testing time spans for AT&T dataset for 80:20 and 70:20 splitting ratios.
Table 2. Consequence of hidden neurons on the recognition rate and testing time spans for YALE dataset for 80:20 and 70:20 splitting ratios
Fig. 9.
figure 9

Curves depicting the effect of hidden neurons on recognition rate, and testing time for the two datasets under 80:20 and 70:30 splitting ratios

The recognition rate mentioned in Tables 1 and 2 and its depiction in Fig. 9 correspond to the average recognition rate so obtained after 20 iterations. This is done to average out the error that emanates due to not so good generalization capabilities of ELM. The ELM may operate in milliseconds regime as dictated by Tables 1 and 2, but lacks in generalization due to random weights being allocated between the input/output and the hidden layers. The results so presented bring forth the fact that as the number of hidden neurons are increased, the recognition rates tend to cross the 90% marker, but the price is paid in terms of the training time spans which although increase but still remain in the scale of few milliseconds.

It is very clear from the results compiled in Tables 1 and 2 that the 80:20 splitting is better placed in comparison to 70:30 splitting ratio. The recognition rate of more than 90% and the training and testing time spans computed thereof is found to be better in case of YALE dataset. Therefore, it is suggested that the images which are captured at different orientations with varying illumination need to be stored in the GIF file format. However, in both these cases, the optimized number of hidden neurons comes out to be slightly more than 250. The recognition rate and the computed testing time first dips prior to L = 250 and then maximizes after this value. This is a pattern which is observed in the case of both datasets. The testing time is observed to be inversely varying the recognition rate. At around L = 250, the testing times (milliseconds) are maximum which then dip to a lower value in case of both datasets. Therefore, according to us, the optimized value of L = 250 for which both recognition rate (%) and testing time (milliseconds) are better placed in YALE dataset in comparison to AT&T dataset.

In order to evaluate the performance of the facial recognition scheme so presented, we compare the results with some state-of-the-art methods for different datasets under Tables 3 and 4.

Table 3. Comparison of recognition rate and running time spans for some state-of-the art facial recognition techniques for AT&T dataset taking 1000 Hidden Neurons
Table 4. Comparison of recognition rate and running time spans for some state-of-the art facial recognition techniques for YALE dataset taking 1000 hidden neurons

A close observation of the data compiled in Tables 3 and 4 yields a similar pattern. Clearly, our results are better placed than the existing methods presented in this paper particularly for the YALE dataset. This is primarily due to the use of GIF image file format. Additionally, the computed testing time spans are also found to be better than the ones reported by other research groups. We, therefore, conclude that our facial recognition technique not only gives better results in terms of the recognition rate (%), but our testing time span is also measured in milliseconds domain thereby suggesting that all necessary procedures-pre-processing, feature extraction and classification of images is carried out in real-time. This is possible only due to the use of a combination of several existing algorithms in this work. The combination is that of the Viola-Jones Algorithm for object identification, HOG based feature selection and the use of Extreme Learning Machine (ELM) for patter classification. This combination brings in the desirable novelty of the proposed facial recognition technique.

7 Conclusions

A novel facial recognition technique working in real-time domain is proposed in this work. The technique involves the use of existing Viola-Jones algorithm for object identification, the Histogram of Oriented Gradients (HOG) based feature extraction and a single layer feed-forward neural network commonly known as Extreme Learning Machine (ELM). Two different datasets of images are considered for this work. These are AT&T and YALE which have several hundred images in different orientations with varying illumination levels. The ELM is found to carry out successful classification in both the datasets. Our technique, however, gives better results in case of YALE dataset as compared to other similar techniques reported in this paper. We conclude that the better results so obtained are primarily due to the GIF image file format used in YALE and due to the fast processing carried out by Viola-Jones algorithm with HOG feature selection procedures. The extremely fast classification (in milliseconds time domain) carried out by the ELM further supplements it. Overall, we find that a very high recognition rate (%) is achieved in the milliseconds time scale. We therefore conclude that the proposed facial recognition technique outperforms several other similar schemes more particularly for the YALE dataset.