1 Introduction

Face detection system involves location of human face regions and their size in a given digital image. A face recognition system relies heavily on a good face detection system. A face recognition system can recognize faces in an image only after a face detection system has identified regions of face in it. Recently Jun et al. [5] also developed a face detection system based on cascaded face detector framework. They used the ADABOOST algorithm [3] to select the best LBP features and their location on the face images. For each detector of the cascade, multiple LBP features were selected and incorporated in the detector until nearly 100% detection rate and around 50% false alarm rate was achieved on the test face/nonface images. They then repeated their work by replacing the LBP features by the LGP features which showed an improvement over their LBP based face detector. Further they repeated the same experiment using a hybrid of LBP, LGP and BHOG descriptors. In each stage of the ADABOOST feature selection based face detector system, they selected highly discriminative mixture of LBP, LGP and BHOG features on the face images. While LBP was global illumination invariant, LGP was local illumination invariant and BHOG captured bigger facial parts like nose, eyes, etc. Because of all these features they could develop a more better face detection system than their previous one.

Fig. 1.
figure 1

(right) The last row is a synthetic signal obtained as a sum of the other 5 sinusoids above. (left) The Stockwell transform based time frequency distribution characterizing the synthetic signal accurately both in time and frequency axes.

Fig. 2.
figure 2

(plot at the top) 1D Signal with sharp edges. (plot at the bottom) Edges detected accurately by convolving the signal with the 1D LDWT kernel.

2 The Stockwell Transform and the Log Dyadic Wavelet Transform

Stockwell transform [1] based time frequency distribution (TFD) of a signal is found to be more accurate (Fig. 1 shows the results of our experiments conducted to demonstrate this fact) than the time frequency distribution obtained through other traditional transforms like the short time Fourier transform (STFT) and the Gabor transform. The log dyadic wavelet transform based representation of image signals [4] is more accurate than the representations obtained through the traditional discrete wavelet transform (Fig. 2 shows the results of our experiments conducted to demonstrate this fact). Hence we have used the ST and the LDWT in our face detection system for representation of image features.

3 Proposed Method

Our face detection system consists of two stages. The first stage is made up of a cascade of 4 face detectors each being constructed using highly discriminative Stockwell transform based feature classifiers and the second stage is made up of a cascade of 4 more face detectors each being a SVM classifier trained using LDWT coefficients of face/non face training images. Each face detector is constructed in such a way that they have 99.5% face detection rate and 50% false alarm rate. Given a sample image at the input of the face detection system, each detector rejects non face regions in it and forwards the probable face image regions to the next face detector in the cascade. At the output, our system is supposed to localize the face regions in the input image.

3.1 Construction of Stage 1 Face Detectors

Highly discriminative Stockwell transform based features (Stockwell transform of 3\(\,\times \,\)3 and 5\(\,\times \,\)5 size regions on face/non face training image samples) are selected as classifiers using the ADABOOST feature selection method, in constructing the face detectors of stage 1. We have set the parameters of the Stockwell transform in such a way that for a 3\(\,\times \,\)3 size image signal, we obtain a 3\(\,\times \,\)9 size TFD plot and for a 5\(\,\times \,\)5 size image signal we obtain a 7\(\,\times \,\)25 size TFD plot. Following are our Stockwell transform based features from which highly discriminative ones are selected as feature classifiers of the face detector.

  1. 1.

    Stockwell transform TFD of a 3\(\,\times \,\)3 size image region, \(ST3_{Fi}(x,y)\), around every pixel (x,y) of a face image sample is computed (at x = 2,3,4,......21, y = 2,3,4......23) for all face image samples i = 1,2,3.....16000. Figure 3 shows some of the 3\(\,\times \,\)3 size image regions (at locations (2,2), (2,7), (2,21), (23,2) and (23,21) of an imaginary face image Fi), whose Stockwell transform TFDs \(ST3_{Fi}(2,2)\), \(ST3_{Fi}(2,7)\), \(ST3_{Fi}(2,21)\), \(ST3_{Fi}(23,2)\) and \(ST3_{Fi}(23,21)\) are computed.

  2. 2.

    Similarly Stockwell transform of a 5\(\,\times \,\)5 size image region, \(ST5_{Fi}(x,y)\), around every pixel (x,y) of a face image sample is computed (at x = 3,4,......20, y = 3,4......22) for all face image samples i = 1,2,3.....16000.

  3. 3.

    Stockwell transform of a 3\(\,\times \,\)3 size image region, \(ST3_{NFi}(x,y)\), around every pixel (x,y) of a non face image sample is computed at x = 2,3,4,......21, y = 2,3,4......23, for non face image samples i = 1,2,3.....16000.

  4. 4.

    Stockwell transform of a 5\(\,\times \,\)5 size image region, \(ST5_{NFi}(x,y)\), around every pixel (x,y) of a non face image sample is computed at x = 3,4,......20, y = 3,4......22, for non face image samples i = 1,2,3.....16000. Note that \(ST3_{Fi}(x,y)\), \(ST5_{Fi}(x,y)\), \(ST3_{NFi}(x,y)\) and \(ST5_{NFi}(x,y)\) are all vectors of either size 3\(\,\times \,\)3 or 5\(\,\times \,\)5.

Fig. 3.
figure 3

The 3\(\,\times \,\)3 size image regions used in computing ST based features.

Apart from these our method uses the following features during the face detector construction.

  1. 1.

    Mean Stockwell transform feature \(mean3_{F}(x,y)\), of 3\(\,\times \,\)3 size image regions of face image samples at location (x,y) is computed as \({\sum ST3_{Fi}(x,y)}\)/16000, where the summation is over i = 1,2,3.....16000. This mean is computed at all locations x = 2,3,4,......21, y = 2,3,4......23, using face image samples. Figure 3 shows some of the mean features computed at locations (2,2), (2,7), (2,21), (23,2), (23,21) i.e. \(mean3_{F}(2,2)\), \(mean3_{F}(2,7)\), \(mean3_{F}(2,21)\), \(mean3_{F}(23,21)\) and \(mean3_{F}(23,2)\)

  2. 2.

    Mean Stockwell transform feature \(mean3_{NF}\)(x,y), of 3\(\,\times \,\)3 size image regions of non face image samples is computed at location (x,y) as \({\sum ST3_{Nfi}(x,y)}\)/16000 where i = 1,2,3.....16000. This mean is computed at all locations x = 2,3,4,......21, y = 2,3,4......23, using non face image samples.

  3. 3.

    Similarly \(mean5_{F}\)(x,y) and \(mean5_{NF}\)(x,y) are computed using \(ST5_{Fi}\)(x,y) and \(ST5_{NFi}\)(x,y) features respectively (at x = 3,4,......20, y = 3,4......22). Note that \(mean3_{F}(x,y)\), \(mean5_{F}(x,y)\), \(mean3_{NF}(x,y)\) and \(mean5_{NF}(x,y)\) are all vectors.

  4. 4.

    \( Dist3_{Fi}(x,y)= \frac{chi\, square\, distance\, of\, ST3_{Fi}(x,y) \,from \,mean3_{F}(x,y)}{chi\, square\, distance\, of\, ST3_{Fi}(x,y)\, from\, mean3_{NF}(x,y)}\) is the chi square distance between the Stockwell transform feature of face image Fi, \(ST3_{Fi}\)(x,y) and mean facial Stockwell transform feature \(mean3_{F}\)(x,y) over the mean non facial Stockwell transform feature \(mean3_{NF}\)(x,y).

  5. 5.

    \(Dist3_{NFi}(x,y)= \frac{chi\, square\, distance\, of\, ST3_{NFi}(x,y)\, from\, mean3_{F}(x,y)}{chi\, square\, distance\, of\, ST3_{NFi}(x,y)\, from\, mean3_{NF}(x,y)}\) is the chi square distance between the Stockwell transform feature of non face image NFi, \(ST3_{NFi}\)(x,y) and mean facial Stockwell transform feature \(mean3_{F}\)(x,y) over the mean non facial Stockwell transform feature \(mean3_{NF}\)(x,y). Also \(Dist5_{Fi}\)(x,y) and \(Dist5_{NFi}\)(x,y) correspond to 5\(\,\times \,\)5 features. Note that \(Dist3_{Fi}\)(x,y), \(Dist3_{NFi}\)(x,y), \(Dist5_{Fi}\)(x,y) and \(Dist5_{NFi}\)(x,y) are all scalars.

A low value of Dist3Fi(xy) indicates that this Stockwell transform feature at (x,y) is a dominant feature of face images and also a recessive feature of non face images. A large value of Dist3NFi(xy) indicates that this Stockwell transform feature at (x,y) is a dominant feature of non face images and also a recessive feature of face images. Locations (x,y) which have a small Dist3Fi(xy) and a large Dist3NFi(xy) (over all i = 1,2,3...16000) are the best candidates to be a feature classifier (and hence are capable of distinguishing face image regions from non face image regions). This is what our ADABOOST feature selection method does during face detector construction

Construction of First Face Detector of Stage 1: The cascade of face detectors framework based on ADABOOST feature selection method (followed by Viola and Jones [2]) is used in our method. Each of the face detectors in the cascade is capable of performing face detection with around 99.5% detection rate and 50% false alarm rate. Here we explain the procedure followed in constructing the first face detector (out of the cascade of face detectors) of our face detection system. Algorithm 1 along with Algorithms 2 and 3 is used in constructing the face detector. Algorithm 1 goes through several iterations (calling Algorithms 2 and 3 in each iteration) selecting the next most discriminative Stockwell transform feature classifier in each iteration and constructs the face detector using these classifiers. The number of feature classifiers thus selected must be capable of performing face detection with a 99.5% face detection rate and 50% false alarm rate. Using the Stockwell transform features explained above, Algorithms 1, 2 and 3 construct face detector as follows.

figure a
figure b
figure c
Fig. 4.
figure 4

The detection performance on the (a). FDDB (top two rows) and (b). CMU-MIT dataset (middle row). (c). shows the output of each of the 8 face detectors of the cascade of the face detection system, on a rotated CMU-MIT image. Also it shows the detected image regions.

Fig. 5.
figure 5

The ROC curves of the performance of the proposed method on CMU-MIT (left) and FDDB (right) datasets

After the construction of the first face detector, the non face training samples are classified (Algorithm 3 is used for classification) and unclassified samples are collected. For the construction of the second face detector of the cascade of stage 1, we use the full set of face training samples and these misclassified non face training samples. Like this 4 detectors of first stage are constructed.

3.2 Stage 2

Log dyadic wavelet transform (LDWT) features of the training samples are used in constructing the 4 face detectors of stage 2. A SVM classifier is trained using the LDWT features until the classifier shows a performance 99.5% detection and 50% false alarm rate on the training samples. During the estimation of this performance, the misclassified training non face samples are collected and along with the full set of the training face samples the construction of the second face detector of this stage continues. This method is followed until the 4 face detectors of the stage is constructed.

4 Experiments and Data Set Preparation

Using face recognition dataset samples of ORL, LFW, FERET, ABERDEEN, PIE and MIT datasets we have formed our training and testing face samples (the cropped face images include variations like scaling, rotation, poor illumination, poor resolution in them). 16000 each of training and testing face samples were obtained. 32000 non face samples were collected from the MIT dataset, of which 16000 were reserved for training and the rest for testing. We have conducted face detection experiments on the CMU-MIT dataset (both rotated and normal version) and the FDDB dataset, the qualitative and quantitative experimental results can be seen in Figs. 4 and 5. Given a test face sample, the first four face detectors of stage 1 removed the easier non face regions from the sample and forwarded the remaining regions to the next face detector of the cascade. The tougher image regions (that contained nearly face-like-looking non-face regions) were removed from the test sample by the next 4 face detectors of stage 2. Each face detector of the cascade has checked the presence of face regions at a given location on the test sample at 8 different scales.

5 Conclusion

We have developed a face detection system following the cascade of detectors model. We have used Stockwell transform and log dyadic wavelet transform feature representation of images. Using these features we have built the face detection system to classify face and non face samples. Our experiments on the FDDB and CMU-MIT face detection datasets have shown comparable performance with the state of the art methods.