Evaluation of different chrominance models in the detection and reconstruction of faces and hands using the growing neural gas network
 69 Downloads
Abstract
Physical traits such as the shape of the hand and face can be used for human recognition and identification in video surveillance systems and in biometric authentication smart card systems, as well as in personal health care. However, the accuracy of such systems suffers from illumination changes, unpredictability, and variability in appearance (e.g. occluded faces or hands, cluttered backgrounds, etc.). This work evaluates different statistical and chrominance models in different environments with increasingly cluttered backgrounds where changes in lighting are common and with no occlusions applied, in order to get a reliable neural network reconstruction of faces and hands, without taking into account the structural and temporal kinematics of the hands. First a statistical model is used for skin colour segmentation to roughly locate hands and faces. Then a neural network is used to reconstruct in 3D the hands and faces. For the filtering and the reconstruction we have used the growing neural gas algorithm which can preserve the topology of an object without restarting the learning process. Experiments conducted on our own database but also on four benchmark databases (Stirling’s, Alicante, Essex, and Stegmann’s) and on deaf individuals from normal 2D videos are freely available on the BSL signbank dataset. Results demonstrate the validity of our system to solve problems of face and hand segmentation and reconstruction under different environmental conditions.
Keywords
Expectation maximisation (EM) algorithm Colour models Selforganising networks Shape modelling1 Introduction
Over the last decades, there has been an increasing interest in using neural networks and computer vision techniques to allow users to directly explore and manipulate objects in a natural and intuitive environment without the use of electromagnetic tracking systems. Such a sensorfree human–machine interaction system is simpler and more flexible and therefore more adaptable for a broad range of applications in all aspects of life in a modern society: from gaming and robotics to medical tasks. Considering recent progress in the computer vision field, there has been an increasing interest in the medical domain [55], especially in relation to screening or assessment of acquired neurological impairments associated with motor changes in older individuals, such as dementia, stroke and Parkinson’s disease. Deploying hand gesture recognition or hand trajectory tracking systems is one of the most practical approaches, which is also attributed to their natural and intuitive quality. Moreover, with the recent rise of nonintrusive sensors (e.g. Microsoft Kinect, Leap motion) gesture recognition and face detection have added an extra dimension to human–machine interaction. However, the images captured of hand gestures, which are effectively a 2D projection of a 3D object, can become highly complex for any recognition system. Systems that follow a modelbased method [1, 46] require an accurate 3D model that efficiently captures the hand’s articulation in terms of its high degrees of freedom (DOF) and elasticity. The main drawback of a modelbased method is that it requires massive calculations, making it unrealistic for realtime implementation. Since this method is too complicated to implement, the most widespread alternative is the featurebased method [26] where features such as the geometric properties of the hand or face are analysed using either neural networks (NNs) [47, 52] or stochastic models such as hidden Markov models (HMMs) [11, 49]. This is feasible because of the emergence of cheap 3D sensors capable of providing a realtime data stream and therefore enabling featurebased computation of threedimensional environment properties like curvature, an approach closer to human learning procedures.
Another approach for both faces and hands is to use a skin colour classifier [27]. Colour processing is much faster than processing other facial features. Under certain lighting conditions, colour is orientation invariant. It reduces the search space for human targets by segmenting images into skin and nonskin regions based on pixel colour. However, tracking human faces using colour as a feature has several problems: the colour representation of a face obtained by a camera is influenced by many factors (luminance, object movement, etc.); different cameras produce significantly different colour values, even for the same person under the same lighting conditions; and skin colour differs from person to person. Nevertheless, many researchers have worked with skin colour segmentation. Ghazali et al. [17] proposed a skin Gaussian model in YCgCr colour space for detecting human faces. However, the technique still produces false positives in complex backgrounds. Subasic et al. [45] used mean shift and AdaBoost to segment and label human faces in complex background images. However, the image database they used for evaluation testing consists of just a single frontal face detection. Khan et al. [24] noted that detection rate is dependent on skin colour selection, which can be improved by using an appropriate lighting correction algorithm. Zakaria and Suandi [54] reported that skin colour detection failure due to illumination effects increases the false positive rate. Additionally, the processing time also increases because many face candidates are sent to the classifier for verification purposes. Yan et al. [51] proposed a hierarchy based on use of a structure model and structural support vector machine (SVM) in learning to handle global variation in the appearance of human faces. However, a hierarchical structure approach in face detection architecture needs integration between one or more classifications, and this increases the overall processing time.
In order to use colour as a feature for face or hand tracking, we have to solve these problems. In the learning framework, the initialisation of the object is crucial. The main approach is to find a suitable means of segmentation that separates the object of interest from the background. While a great deal of research has been focused on efficient detectors and classifiers, little attention has been paid to efficiently acquiring and labelling suitable training data. One method is to partition the image into regions. Each region of interest is spatially contiguous and the pixels in that region are of the same kind with respect to the predefined criteria. However, the segmentation process itself may be timeconsuming as it is usually performed manually [3]. Obtaining a set of training examples automatically is a more difficult task. Existing approaches to minimise labelling effort [28, 33, 39, 40] use a classifier which is trained in a small number of examples. The classifier is then applied by means of a training sequence, and the detected patches are added to the previous set of examples. However, to learn the model for feature position and appearance, a great number (e.g., 10,000 images) of handlabelled face images are needed. A further disadvantage of these approaches is that either manual initialisation [21] or a pretrained classifier is needed to initialise the learning process. With a sequence of images, these disadvantages can be avoided by using an incremental model.
In this work, what we are interested in is the accurate initialisation of the first frame of the neural network model. If this is achieved, by performing an accurate segmentation in relation to the background so that the regions that represent the foreground (objects of interest) can be classified by the learning model, then the network preserves the topology in consecutive frames and acceleration to the network is achieved. The key to successful segmentation relies on reducing meaningless image data. We achieve automatic segmentation by taking into consideration that human skin has a relatively unique colour, and we apply appropriate parametric skin distribution modelling. Although the use of the SOMbased techniques of neural gas (NG) [31], growing cell structures (GCS) [13] and growing neural gas (GNG) [14] for various data inputs has already been studied and successful results have been reported [8, 19, 20, 37, 44, 46], some limitations still persist. Most of these studies have assumed noisefree environments and low complexity distributions. Therefore, applying these methods to challenging realworld data obtained using noisy 2D^{1} and 3D^{2} sensors is our main study. These particular noninvasive sensors have been used in the associated experiments and are typical, contemporary technology.
The remainder of this paper is organised as follows. Section 2 presents the initialisation of the object using probabilistic colour models. Section 3 provides a description of the original GNG algorithm and the modifications for topology preservation in 3D. In Sect. 4, a set of experimental results is presented for various datasets before conclusions are drawn in Sect. 5.
2 Approach and methodology
The growing neural gas (GNG) [14] is an incremental neural model able to learn the topological relations of a given set of input patterns by means of competitive Hebbian learning [30]. Unlike other methods, the incremental character of this model avoids the necessity of previously specifying the network size. On the contrary, from a minimal network size, a growth process takes place, where new neurons are inserted successively using a particular type of vector quantisation [14]. With this approach, however, the problem of background modelling takes central stage, where the goal is to get a segmentation of the background, i.e. the irrelevant part of the scene and the foreground. If the model is accurate, the regions that represent the foreground (objects of interest) can then be extracted. This problem also plays a central role, since we are interested in setting the initial frame for the GNG algorithm.
2.1 Background modelling
We subdivide background modelling methods into two categories: (1) background subtraction methods; and (2) statistical methods. In background subtraction methods, the background is modelled as a single image and the segmentation is estimated by thresholding the background image and the current input image. Background subtraction can be done either using a frame differencing approach or using a pixelwise average or median filter over n number of frames. In statistical methods, a statistical model for each pixel describing the background is estimated. The more the variance of the pixel values, the more accurate the multimodal estimation. In the evaluation stage of the statistical models, the pixels in the input image are tested if they are consistent with the estimated model. The most wellknown statistical models are the eigenbackgrounds [9, 34], the Single Gaussian Model (SGM) [6, 50] and Gaussian Mixture Models (GMM) [12, 42].
The methods based on background subtraction are limited in more complicated scenarios. For example, if the foreground objects have similar colour to the background, these objects cannot be detected by thresholding. Furthermore, these methods only adapt to minor changes in environmental conditions. Changes such as turning on the light cannot be captured by these models. In addition, these methods are limited to segmenting the whole object from the background, although for many tasks, such as face recognition, gesture tracking, etc., specific parts need to be detected. Since most image sources (i.e. cameras) provide colour images, we can use this additional information in our model for the segmentation of the first image. This information can then be stored in the network structure and used to detect changes between consecutive frames.
2.1.1 Probabilistic colour models: single Gaussian
2.1.2 Probabilistic colour models: Gaussian mixture model

Firstly, the variance caused by the intensity is removed. This is achieved by normalising the data or by transforming the original pixel values into a different colour space (e.g., rg colour space [38] or HSV colour space [35]).

Secondly, a colour histogram is computed, which is used to estimate an initial mixture model.

Finally, a Gaussian mixture model is estimated, which can efficiently be done by applying the iterative EMalgorithm [10].
 1.Estep As we do not know the class labels, but do know their probability distribution, what we can do is to use the expected values of the class labels given the current parameters. For the nth iteration, we form the function \(Q(\theta \theta _{n})\) as follows:and define$$\begin{aligned} \begin{aligned} Q(\theta \theta _{n})&= E\left\{ \log p(Z, X\theta )X, \theta _{n}\right\} \\&= \sum ^{T}_{t=1}\sum ^{J}_{j=1}E\{\delta _{t}^{(j)}x_{t},\theta _{n}\}\log \left[ p(x_{t}\delta _{t}^{(j)} = 1, \varphi ^{(j)})\pi ^{(j)}\right] \end{aligned} \end{aligned}$$(9)Using Bayes’ theorem, we can calculate \(h_{n}^{(j)}(x_{t})\) as:$$\begin{aligned} h_{n}^{(j)}(x_{t}) \equiv E\left\{ \delta _{t}^{(j)}x_{t},\theta _{n}\right\} = P\left( \delta _{t}^{(j)} = 1x_{t}, \theta _{n}\right) \end{aligned}$$(10)which is actually the expected posterior distribution of the class labels given the observed data. In other words, the probability that \(x_{t}\) belongs to group j given the current estimates \(\theta _{n}\) is given by \(h_{n}^{(j)}(x_{t})\). The calculation of Q is the Estep of the algorithm and determines the best guess of the membership function \(h_{n}^{(j)}(x_{t})\).$$\begin{aligned} h_{n}^{(j)}(x_{t}) = \frac{p\left( x_{t}\delta _{t}^{(j)} = 1, \varphi _{n}^{(j)}\right) \pi _{n}^{(j)}}{\sum ^{J}_{k=1} p\left( x_{t}\delta _{t}^{(k)} = 1, \varphi _{n}^{(k)}\right) \pi _{n}^{(k)}} \end{aligned}$$(11)
 2.
To compute the new set of parameter values of \(\theta\) (denoted as \(\theta ^{*}\)), we optimise \(Q(\theta \theta _{n})\); such as \(\theta ^{*} = \arg \max _{\theta }Q(\theta \theta _{n}).\) This is the Mstep of the algorithm.
Specifically, the steps are:To determine \(\mu ^{{(k)}^{*}}\), differentiate Q with respect to \(\mu ^{(k)}\) and equate to zero \((\frac{\vartheta Q(\theta \theta _{n})}{\vartheta \mu ^{(k)}} = 0)\) which gives:
Maximise \(Q(\theta \theta _{n})\) with respect to \(\theta\) to find \(\theta ^{*}\).

Replace \(\theta _{n}\) by \(\theta ^{*}\).

Increment n by 1 and repeat the Estep until convergence.
To determine \(\varSigma ^{{(k)}^{*}}\), differentiate Q with respect to \(\varSigma ^{(k)}\) and equate to zero \((\frac{\vartheta Q(\theta \theta _{n})}{\vartheta \varSigma ^{(k)}} = 0)\) which gives:$$\begin{aligned} \mu ^{{(k)}^{*}} = \frac{\sum _{t = 1}^{T}h_{n}^{k}(x_{t})x_{t}}{\sum _{t = 1}^{T}h_{n}^{k}(x_{t})} \end{aligned}$$(12)To determine \(\pi ^{{(k)}^{*}}\), maximise \(Q(\theta \theta _{n})\) with respect to \(\pi ^{(k)}\) subject to the constraint \(\varSigma _{j = 1}^{J}\pi ^{j} = 1\) which gives:$$\begin{aligned} \varSigma ^{{(k)}^{*}} = \frac{\sum _{t = 1}^{T}h_{n}^{k}(x_{t})(x_{t}\mu ^{(k)*})(x_{t}\mu ^{(k)*})^{T}}{\sum _{t = 1}^{T}h_{n}^{k}(x_{t})} \end{aligned}$$(13)$$\begin{aligned} \pi ^{{(k)}^{*}} = \frac{1}{T}\sum _{t = 1}^{T}h_{n}^{k}(x_{t}) \end{aligned}$$(14) 
The results were obtained from a database containing approximately one half million pixels.
Figure 5 shows another example of the probability map for skin colour in deaf individuals, using normal 2D videos freely available from the BSL SignBank dataset,^{5} which can then be implemented with the specific goal of developing an automated dementia screening toolkit. In all three examples, the colour space used to represent the input image plays an important part in the segmentation. As we have seen above, some models are more perceptually uniform than others and some separate out information such as luminance and chrominance.
3 Growing neural gas (GNG) algorithm in 2D and 3D
In order to determine where to insert new neurons, local error measures are gathered during the adaptation process and each new unit is inserted near the neuron which has the highest accumulated error. At each adaptation step, a connection between the winner and the secondnearest neuron is created as dictated by the competitive Hebbian learning algorithm. This is continued until an ending condition is fulfilled, as for example evaluation of the optimal network topology based on the topographic product [16]. This measure is used to detect deviations between the dimensionalities of the network and that of the input space, detecting folds in the network and indications that it is trying to approximate to an input manifold with different dimensions. In addition, in a GNG network the learning parameters are constant in time, in contrast to other methods where learning is based on decaying parameters.

A set N of nodes (neurons). Each neuron \(c \in N\) has its associated reference vector \(w_c \in R^d\). The reference vectors can be regarded as positions in the input space of their corresponding neurons.

A set of edges (connections) between pairs of neurons. These connections are not weighted and their purpose is to define the topological structure. The edges are determined using the competitive Hebbian learning algorithm. An edge ageing scheme is used to remove connections that are invalid because of the activation of the neuron during the adaptation process.
It can be seen how this extended algorithm is able to create a coloured 3D mesh that represents the input data. Since point clouds obtained using the Kinect are partial 3D views, the mesh obtained is not complete and therefore the model generated by the GNG is an open coloured mesh.
4 Experiments
In this section, different experiments are shown validating the capabilities of our method (e.g. which statistical model is best for GNG reconstruction) which has also been used in 3D datasets. We tested our system on our own data set (University of Alicante, and University of Westminster) of faces and hands recorded from 15 participants. To create this data set we have recorded images over several days using a simple webcam with image resolution \(800 \times 600\). In total, we have recorded over 112,500 frames, and for computational efficiency, we have resized the images to \(300 \times 225\), \(200 \times 160\), \(198 \times 234\), and \(124 \times 123\) pixels. In addition, experiments were conducted based on two publicly available databases, Mikkel B. Stegmann^{6} and Stirling’s.^{7} All methods have been developed and tested on a desktop machine of 2.26 GHz Pentium IV processor. These methods have been implemented in MATLAB and C++.
IoU and Dice score for synthetic image 1
Scene  Frame  Algorithm  Dice score  IoU score 

1  1  \(\hbox {CIELab}_s{ ingle}_g{ aussian}\)  0.874067744  0.776282803 
1  1  \(\hbox {CIELab}_multi_gaussian\)  0.860429045  0.755023416 
1  1  \(\hbox {CIExyz}_s{ ingle}_g{ aussian}\)  0.729085392  0.573655639 
1  1  \(\hbox {CIExyz}_m{ ulti}_g{ aussian}\)  0.763614975  0.617600418 
1  1  \(\hbox {HSV}_s{ ingle}_g{ aussian}\)  0.74712024  0.596304817 
1  1  \(\hbox {HSV}_m{ ulti}_g{ aussian}\)  0.760118655  0.613039857 
1  1  \(\hbox {nRGB}_s{ ingle}_g{ aussian}\)  0.773454371  0.630579375 
1  1  \(\hbox {nRGB}_m{ ulti}_g{ aussian}\)  0.766449414  0.621317352 
IoU and Dice score for synthetic image 2
Scene  Frame  Algorithm  Dice score  IoU score 

1  2  \(\hbox {CIELab}_s{ ingle}_g{ aussian}\)  0.898387257  0.815509983 
1  2  \(\hbox {CIELab}_m{ ulti}_g{ aussian}\)  0.935972544  0.879640743 
1  2  \(\hbox {CIExyz}_s{ ingle}_g{ aussian}\)  0.74361557  0.591862691 
1  2  \(\hbox {CIExyz}_m{ ulti}_g{ aussian}\)  0.910774728  0.836158347 
1  2  \(\hbox {HSV}_s{ ingle}_g{ aussian}\)  0.930137277  0.86938926 
1  2  \(\hbox {HSV}_m{ ulti}_g{ aussian}\)  0.882018032  0.788929152 
1  2  \(\hbox {nRGB}_s{ ingle}_g{ aussian}\)  0.901852199  0.821238972 
1  2  \(\hbox {nRGB}_m{ ulti}_g{ aussian}\)  0.891526587  0.804274396 
For further comparison between the SGM and the GMM model, we used ROC curves for all the test images. For drawing ROC curves, we calculated TPR and FPR for all images using different threshold (threshold value is set as per \(P(S  x)'\)). Thus by using K different thresholds, we can get K point vectors, which, when plotted, results in a ROC curve for the specific model with respect to test image.
TPR and FPR rates for all four colour spaces using SGM
HSV  CIE X, Y, Z  nRGB  CIE L*, a*, b*  

TPR  FPR  TPR  FPR  TPR  FPR  TPR  FPR 
0.5689  0.2395  0.8442  0.1341  0.9156  0.207  0.8902  0.1031 
0.8947  0.0995  0.9371  0.0596  0.9641  0.1183  0.9471  0.0485 
0.7538  0.2632  0.8655  0.1985  0.9708  0.2683  0.8015  0.1583 
0.8568  0.2706  0.8915  0.0969  0.9198  0.1581  0.9015  0.0889 
0.6378  0.161  0.9019  0.0709  0.9492  0.1263  0.9159  0.0719 
0.9337  0.2527  0.8587  0.1117  0.9217  0.1669  0.9587  0.1011 
0.6664  0.1598  0.8383  0.0628  0.8966  0.0813  0.8983  0.0528 
0.8742  0.0529  0.9247  0.0822  0.9544  0.1352  0.9201  0.0501 
0.9083  0.109  0.8353  0.0341  0.8943  0.0548  0.8853  0.0508 
0.512  0.297  0.7843  0.1163  0.8836  0.1787  0.8513  0.0963 
0.7496  0.0788  0.9172  0.0607  0.9527  0.1021  0.9561  0.0531 
0.7915  0.0395  0.969  0.0506  0.9834  0.0757  0.9598  0.0306 
0.8437  0.0789  0.8005  0.0535  0.8713  0.0806  0.8995  0.0555 
0.65  0.0373  0.7279  0.0401  0.8249  0.0594  0.8271  0.0324 
0.8503  0.0931  0.9284  0.0864  0.9606  0.1403  0.9684  0.0804 
0.2353  0.1242  0.7993  0.0483  0.9435  0.0254  0.7803  0.0382 
0.9598  0.0362  0.9364  0.0349  0.9641  0.054  0.9164  0.0312 
0.8857  0.0255  0.9501  0.0383  0.9753  0.0556  0.9511  0.0213 
0.5688  0.0254  0.8118  0.0471  0.9114  0.0984  0.9108  0.0206 
0.7346  0.0315  0.8503  0.0214  0.9173  0.0409  0.9403  0.0184 
TPR and FPR rates for all four colour spaces using GMM
HSV  CIE X, Y, Z  nRGB  CIE L*, a*, b*  

TPR  FPR  TPR  FPR  TPR  FPR  TPR  FPR 
0.9228  0.2374  0.8518  0.2256  0.7931  0.2654  0.8831  0.2211 
0.9516  0.0654  0.9699  0.0837  0.9638  0.0832  0.9238  0.0499 
0.8128  0.3118  0.9151  0.2794  0.9063  0.3241  0.8813  0.2683 
0.9726  0.165  0.9685  0.1486  0.9454  0.1939  0.9754  0.0912 
0.9009  0.1065  0.9357  0.0921  0.9388  0.1135  0.8988  0.0989 
0.9895  0.1777  0.994  0.1649  0.992  0.2053  0.9882  0.1541 
0.989  0.1263  0.985  0.1273  0.9745  0.0949  0.9275  0.0928 
0.9644  0.0878  0.9728  0.0995  0.9637  0.1042  0.9644  0.0801 
0.9662  0.0579  0.9699  0.0646  0.9647  0.0705  0.9641  0.0508 
0.9793  0.2022  0.7551  0.1755  0.7528  0.23  0.8328  0.0963 
0.9938  0.0639  0.9395  0.0785  0.9359  0.076  0.9959  0.0531 
0.9857  0.0628  0.9906  0.0643  0.9874  0.0693  0.9751  0.0306 
0.9001  0.0627  0.9335  0.0708  0.8903  0.0694  0.9003  0.0535 
0.9714  0.0445  0.8289  0.0501  0.8141  0.0494  0.9541  0.0324 
0.9936  0.0897  0.9985  0.1050  0.9952  0.1078  0.9952  0.0804 
0.9578  0.0753  0.9596  0.0703  0.9691  0.0894  0.9990  0.0382 
0.9911  0.0363  0.9925  0.0409  0.9893  0.0429  0.9589  0.0312 
0.9888  0.0438  0.9948  0.0495  0.9894  0.0462  0.9899  0.0213 
0.7673  0.0446  0.8534  0.0525  0.8515  0.0574  0.8001  0.0206 
0.9329  0.0267  0.9448  0.0243  0.9451  0.0283  0.9479  0.0184 
While 3D downsampling and reconstruction methods like Poisson or Voxelgrid are not able to deal with noisy data, the GNG method is able to avoid outliers and obtain an accurate representation in the presence of noise. This ability is due to the Hebbian learning rule used and its random nature that updates vertex locations based on the average influence of a large number of input patterns.
5 Conclusions and future work
In this paper, we have compared the performance of different probabilistic colour models and colour spaces for skin segmentation as an initialisation stage for the GNG algorithm. Based on the capabilities of GNG to readjust to new input patterns without restarting the learning process, we are interested in reducing meaningless image data by taking into consideration that human skin has a relatively unique colour and applying appropriate parametric skin distribution modelling. We concluded that GMM was superior to SGM with lower FPR rates. We also showed that CIE L*, a*, b* colour space outperforms all three other colour spaces since it has the lowest FPR rate among the dataset. **Preprocessing was also used as an initialisation stage in the 3D reconstruction of faces and hands based on the work conducted in [2]. Further work will aim at improving system performance by accelerating GPUs which can then be used for robotic system recognition. Nonetheless, we are currently working, after obtaining a clean segmentation, on hand sign trajectories in order to analyse the sign space envelope (sign trajectories/depth/speed) and facial expressions of deaf individuals. An automated screening toolkit will be beneficial not only to screening of deaf individuals for dementia, but also for assessment of other acquired neurological impairments associated with motor changes, for example, stroke and Parkinson’s disease.
Footnotes
 1.
webcam with image resolution \(800\times 600\).
 2.
Kinect for XBox 360: http://www.xbox.com/kinect Microsoft.
 3.
 4.
 5.
BSL Signbank, http://bslsignbank.ucl.ac.uk/.
 6.
 7.
Notes
Funding
This work has been supported by the Spanish Government TIN201676515R Grant, supported with FEDER funds, the University of Alicante Project GRE1619, the Valencian Government Project GV2018022, and the UK Dunhill Medical Trust Grant RPGF1802\37.
References
 1.Albrecht I, J Haber, H Seidel (2003) Construction and animation of anatomically based human hand models. In: Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on computer animation, pp 98–109Google Scholar
 2.Angelopoulou A, GarciaRodriguez J, Orts Escolano S, Gupta G, Psarrou A (2018) Fast 2d/3d object representation with growing neural gas. Neural Comput Appl 29(10):903–919CrossRefGoogle Scholar
 3.Athency A, Ancy BM, Fathima K, Dilin R, Binish M (2017) Brain tumor detection and classification in MRI images. Int J Innov Res Sci Eng Technol 6:84–89Google Scholar
 4.Boehme H, Brakensiek A, Braumann U, Krabbes M, Gross H (1998) Neural networks for gesturebased remote control of a mobile robot. Proc IEEE World Congr Comput Intell 1:372–377Google Scholar
 5.Cédras C, Shah M (1995) Motionbased recognition: a survey. Image Vis Comput 13(2):129–155CrossRefGoogle Scholar
 6.Caetano S, Olabarriaga S, Barone AC (2002) Performance evaluation of single and multipleGaussian models for skin color modeling. In: Proceedings of the Brazilian symposium on computer graphics and image processingSIBGRAPI, pp 275–282Google Scholar
 7.Cheng H, Jiang X, Sun Y, Wang J (2001) Color image segmentation: advances and prospects. Pattern Recognit 34(12):2259–2281CrossRefzbMATHGoogle Scholar
 8.Cretu, A, Petriu E, Payeur P (2008) Evaluation of growing neural gas networks for selective 3D scanning. In: Proceedings of IEEE international workshop on robotics and sensors environments, pp 108–113Google Scholar
 9.De la Torre F, Black M (2001) Probabilistic principal component analysis. In: Proceedings of the IEEE international conference on computer vision, vol I, pp 362–369Google Scholar
 10.Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39(1):1–38MathSciNetzbMATHGoogle Scholar
 11.Eddy S (1996) Hidden Markov models. Curr Opin Struct Biol 6(3):361–365MathSciNetCrossRefGoogle Scholar
 12.Friedman N, Russel S (1997) Image segmentation in video sequences: a probabilistic approach. In: Proceedings of the conference on uncertainty in artificial intelligence, pp 175–181Google Scholar
 13.Fritzke B (1994) Growing cell structures—a selforganising network for unsupervised and supervised learning. J Neural Netw 7(9):1441–1460CrossRefGoogle Scholar
 14.Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems 7 (NIPS’94), pp 625–632Google Scholar
 15.GarcíaRodríguez J, Angelopoulou A, Psarrou A (2006) Growing neural gas (GNG): a soft competitive learning method for 2D hand modeling. IEICE Trans Inf Syst E89–D(7):2124–2131CrossRefGoogle Scholar
 16.Geoffrey J, Goodhill F, Terrence J (1997) A unifying measure for neighbourhood preservation in topographic mappings. In: Proceedings of the 2nd joint symposium on neural computation, vol 5, pp 191–202Google Scholar
 17.Ghazali KHB, Ma J, Xiao R, lubis SA (2012) An innovative face detection based on YCgCr color space. Phys Procedia 25(0):2116–2124CrossRefGoogle Scholar
 18.Gschwandtner M, Kwitt R, Uhl A, Pree W (2011) BlenSor: blender sensor simulation toolbox advances in visual computing. volume 6939 of lecture notes in computer science, chapter 20. Springer, Berlin, pp 199–208Google Scholar
 19.Gupta G, Psarrou A, Angelopoulou A, García J (2012) Region analysis through close contour transformation using growing neural gas. In: Proceedings of the international joint conference on neural networks, IJCNN2012, pp 1–8Google Scholar
 20.Holdstein Y, Fischer A (2008) Threedimensional surface reconstruction using meshing growing neural gas ( MGNG). Vis Comput Int J Comput Graph 24(4):295–302Google Scholar
 21.HyeRin K, Seon J K, InKwon L (2017) Building emotional machines: recognizing image emotions through deep neural networks. CoRR arXiv:abs/1705.07543
 22.Jones M, Rehg J (2002) Statistical color models with application to skin detection. Int J Comput Vis 46(1):81–96CrossRefzbMATHGoogle Scholar
 23.Kakumanu P, Makrogiannis S, Bourbakis N (2007) A survey of skincolor modeling and detection methods. Pattern Recognit 40(3):1106–1122CrossRefzbMATHGoogle Scholar
 24.Khan R, Hanbury A, Stttinger J, Bais A (2012) Color based skin classification. Pattern Recognit Lett 33(2):157–163CrossRefGoogle Scholar
 25.Khoshelham K, Elberink SO (2012) Accuracy and resolution of Kinect depth data for indoor mapping applications. Sensors 12(2):1437–1454CrossRefGoogle Scholar
 26.Koike H, Sato Y, Kobayashi Y (2001) Integrating paper and digital information on enhanced desk: a method for real time finger tracking on an augmented desk system. ACM Trans Comput Hum Interact 8(4):307–322CrossRefGoogle Scholar
 27.Kolkur, S, Kalbande D, Shimpi P, Bapat C, Jatakia J (2017) Human skin detection using RGB, HSV and YCbCr color models. CoRR arXiv:abs/1708.02694
 28.Lee M, Nevatia R (2005) Integrating component cues for human pose estimation. In: Proceedings of the IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 41–48Google Scholar
 29.Lucchese L, Mitra S (2001) Color image segmentation: a stateoftheart survey. Proc Indian Natl Sci Acad (INSAA) 67(2):207–221Google Scholar
 30.Martinez T (1993) Competitive Hebbian learning rule forms perfectly topology preserving maps. In: ICANN93: international conference on artificial neural networks, pp 427–434Google Scholar
 31.Martinetz TM, Schulten KJ (1991) A "neuralgas" network learns topologies. In: Kohonen T, Makisara K, Simula 0, Kangas J (eds) Artificial Neural Networks, pp 397–402. NorthHolland, AmsterdamGoogle Scholar
 32.MartinezGonzalez P, Oprea S, GarciaGarcia A, JoverAlvarez A, OrtsEscolano S, Rodríguez JG (2018) Unrealrox: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. CoRR arXiv:abs/1810.06936
 33.Nair V, Clark J (2004) An unsupervised, online learning framework for moving object detection. In: Proceedings of the IEEE conference on computer vision and pattern recongition, vol II, pp 317–324Google Scholar
 34.Oliver N, Rosario B, Pentland A (2000) A Bayesian computer vision system for modelling human interactions. IEEE Trans Pattern Anal Mach Intell 22(8):831–843CrossRefGoogle Scholar
 35.Raja Y, McKenna S, Gong S (1998) Colour model selection and adaptation in dynamic scenes. In: Proceedings of the European conference on computer vision, pp 460–474Google Scholar
 36.Redner R, Walker H (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26(2):195–239MathSciNetCrossRefzbMATHGoogle Scholar
 37.Rêgo R, Araújo A, de Lima Neto F (2007) Growing selforganizing maps for surface reconstruction from unstructured point clouds. In: Proceedings of the international joint conference on artificial neural networks, IJCNN’07, pp 1900–1905Google Scholar
 38.Schwerdt K, Crowley J (2000) Robust face tracking using color. In: Proceedings of the international conference on automatic face and gesture recognition, pp 90–95Google Scholar
 39.Sharifara A, Rahim MSM, Navabifar F, Ebert D, Ghaderi A, Papakostas M (2017) Enhanced facial recognition framework based on skin tone and false alarm rejection. In: Proceedings of the 10th international conference on PErvasive technologies related to assistive environments, PETRA ’17. ACM, pp 240–241Google Scholar
 40.Sivic J, Everingham M, Zisserman A (2005) Person spotting: video shot retrieval for face sets. In: International conference on image and video retrieval, pp 226–236Google Scholar
 41.Sonka M, Hlavac V, Boyle R (1998) Image processing, analysis, and machine vision. CLEngineering, 2nd edn, pp 513–524Google Scholar
 42.Stauffer C, Grimson W (1999) Adaptive background mixture models for realtime tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol II, pp 246–252Google Scholar
 43.Störring M (2004) Computer vision and human skin colour. Ph.D. thesis, Aalborg UniversityGoogle Scholar
 44.Stergiopoulou E, Papamarkos N (2009) Hand gesture recognition using a neural network shape fitting technique. Eng Appl Artif Intell 22(8):1141–1158CrossRefGoogle Scholar
 45.Subasic M, Loncaric S, Birchbauer J (2009) Expert system segmentation of face images. Expert Syst Appl 36(3, Part 1):4497–4507CrossRefGoogle Scholar
 46.Sui C (2011) Appearancebased hand gesture identification. University of New South Wales, Master of EngineeringGoogle Scholar
 47.Vamplew P, Adams A (1998) Recognition of sign language gestures using neural networks. Austr J Intell Inf Process Syst 5(2):94–102Google Scholar
 48.Vezhnevets V, Sazonov V, Andreeva A (2000) A survey on pixelbased skin color detection techniques. In: Proceedings of the international conference on automatic face and gesture recognition, pp 90–95Google Scholar
 49.Wong S, Ranganath S (2005) Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans Pattern Anal Mach Intell 6:873–891Google Scholar
 50.Wren C, Azarbayejani A,D T, Pentland A (1997) Pfinder: realtime tracking of the human body. IEEE Trans Pattern Anal Mach Intell 19(7):780–785CrossRefGoogle Scholar
 51.Yan J, Zhang X, Lei Z, Li SZ (2014) Face detection by structural models. Image Vis Comput 32(10):790–799CrossRefGoogle Scholar
 52.Yang J, Bang W, Choi E, Cho S, Oh J, Cho J, Kim S, Ki E, Kim D (2009) A 3D handdrawn gesture input device using fuzzy ARTMAPbased recognizer. J Syst Cybern Inform 4(3):1–7Google Scholar
 53.Yang M, Ahuja N (1999) Gaussian mixture model for human skin color and its applications in image and video databases. In: Proceedings of SPIE99, pp 458–466Google Scholar
 54.Zakaria Z, Suandi S A (2011) Combining skin color and cascadelike neural network for face detection. In: Proceedings of IEEE international conference on intelligent computing and intelligent systems, pp 587–591Google Scholar
 55.Zariffa J, Steeves J (2011) Computer visionbased classification of hand grip variations in neurorehabilitation. In: Proceedings of 2011 IEEE international conference on rehabilitation robotics, pp 1–4Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.