Keywords

1 Introduction

We humans have the ability to quickly identify the face of a friend among many people, to distinguish oranges from apples in the market and to remember the name of a song playing in the radio. All those tasks could be considered as instances of what is called the classification problem. Solving it implies assigning a certain label -the class- to a set of features. There are quite a few methods to solve classification tasks. Those methods can be broadly grouped in supervised or unsupervised. The supervised approach tries to find an approximate model for the classes, whose error is smaller than a certain threshold. In order to do that, it is required to have a set of solved instances of the problem. Those instances are used to improve an initial model through a training algorithm [1]. One of the most well known supervised methods is the Artificial Neural Networks (ANN). Current ANN models are used to solve complex problems, such as image processing in medical applications [2], the control of devices [3], as well as face [4] and speech recognition [5], to mention just a few [6]. There are different types of ANNs. One of them is called the Radial-Basis Function (RBF) Neural Networks.

Radial-Basis Function Neural Networks (RBFNN) require three layers of artificial neurons: the first layer connects the network with the environment, the second layer is the only hidden layer, usually implemented by radial activated functions. The third layer is the output, which is implemented as a weighted sum of the outputs of the hidden units (that is, a linear function) [7]. Figure 1 illustrates a RBF NN with three inputs, four hidden units and one output. RBFs are important given their capability to model non-linear systems with only two layers. Under certain conditions on the shape of the activation functions, RBFNN are universal approximators [8]. That means that such networks can approximate any continuous function with arbitrary precision, if enough hidden neurons are provided. Other ANN models, like the Multi-Layer Perceptron, can also approximate non-linear functions, but they may require multiple intermediary layers [7].

Fig. 1.
figure 1

Architecture of a radial basis function network.

When RBF networks are used in classification problems, the inputs are associated to the features that characterize the problem, and the outputs represent the classes or labels. The hidden units correspond to subclasses. Several different functions have been considered as activation functions for the hidden neurons [9]. Undoubtedly, the Gaussians are the most common ones. Equation 1 presents the definition of a Gaussian function.

$$\begin{aligned} \varvec{\phi }_j (\mathbf X ) = \exp [ -( \mathbf X - \varvec{\mu }_j)^T \Sigma _j^{-1}( \mathbf X - \varvec{\mu }_j)] \end{aligned}$$
(1)

for \(j=1,\ldots , L\), where \(\mathbf X \) is the feature vector, L is the number of hidden units, \(\mu _j\) and \(\Sigma _j\) are the mean and the covariance matrix of the j-th Gaussian function.

As for the output layer, it is computed through a weighted sum, defined in Eq. 2:

$$\begin{aligned} \varvec{\varPsi }_k (\mathbf X ) = \sum _{j=1}^L \lambda _{jk} \phi _j(\mathbf X ) \end{aligned}$$
(2)

for \(k=1, \ldots , M\), where \(\lambda _{jk}\) are the output weights, M is the number of output units, and \(\phi _j\) represents node j in the hidden layer. It is common that for pattern classification problems the output is limited to the interval (0, 1) by a sigmoidal function.

The training of the RBF neural networks is often performed in two steps: the first step is to find the number of neurons in the hidden layer, which means defining the corresponding radial-basis functions. The second step consists in finding the weights for the connections to the output layer. Despite the importance of RBF Networks, it turns out that creating one to solve a particular problem is an issue that may imply a good amount of experimental work. One traditional method to perform those steps consists in first deciding on the number L of choosing hidden neurons. Then choosing arbitrarily L data points as centers. Such procedure often causes ill-conditioning, but it has been shown that for large training sets, this approach gives reasonable results [1]. Another way to determine the number of neurons in the hidden layer is performed by using clustering (K-means). Still, it is up to the person solving the problem to decide the number L of neurons. Once that has been done, the K-means algorithm finds the centers of each cluster. This is an iterative process that implies solving an optimization problem. From that, is possible to define the radial-basis functions \(\varvec{\phi }_j(\mathbf X ); \forall j \in \{1,2,\ldots , L\}\). Then, the weights for the output layer are computed, usually through Least-Means Squares (LMS) or Recursive-Means Squares (RMS). Since the number of neurons (centers) is defined arbitrarily, usually several values are tested to determine with which the RBFN gives the best results.

In this work, we propose a new method to automatically determine the number of neurons on the hidden layer. Moreover, our method adjusts the number of neurons so that we get the best performance in the RBFN for a given training dataset. Our contribution is to provide a simple algorithm to construct the best RBFNN for a given problem, requiring no intervention from the user. The experimental results show that our approach is competitive in practice compared with RBFNN generated with the traditional trial and error process. The remainder of the paper is organized as follows: Sect. 2 describes other works related with ours. Section 3 presents our method. Section 4 describe the experiments and results and Sect. 5 discusses possible future work.

2 Related Works

There is a lot of research in neural networks and their applications. In particular, some researchers have attempted to develop methods to automatically choose some of the parameters that control the functioning of the RBFNs. For instance, Moody et al. [10] analyzed three versions of a learning algorithm, using nearest neighbor prediction, adaptive processing units with LMS and self-organizing adaptive units. According to their results, their proposal allows to train faster than with backpropagation.

In another approach, Poggio et al. [11] introduce a supervised method based on regularization to find the centers of the radial-basis functions and to update the weights. In a way, their work leads to a generalization of RBFNs. In this work, the authors focus on the mathematical formulation of their approach and no information is given regarding the applicability of it to practical problems.

Chen et al. [9] on the other hand, consider the application of Orthogonal Least-Squares (OLS). Under this approach, the functions are chosen one at a time until an appropriate network is built. The center of each Gaussian is given by one of the data points. The points are chosen by constructing a set of orthogonal vectors. The authors show several applications and mention as further work a comparison with Moody’s algorithm [10].

Kurban et al. [12] introduces a comparison of training algorithms of radial basis function (RBF) neural networks for classification purposes. The authors compare: gradient descent (GD), genetic algorithms (GA), Kalman filtering (KF) and Artificial Bee Colonies (ABC). They use datasets from UCI to test the algorithms. In their results, ABC offered a good performance, so they used that implementation to solve a problem related with the classification of terrains given information from inertial sensors. The main issue with the approach is that in general the convergence time is bigger than for methods such as GD and KF.

Chun-Tao et al. [13] introduce a methodology based on Particle Swarm Optimization (PSO) and simulation (Resource allocation), to train RBF Networks. Their results seems to suggest to use the method with caution since it got trapped in local minima, though. Oh et al. [14] extended this work and use fuzzy logic to help in the process. Their results are comparable to previous works.

In a very recent work, Reiner [15] propose two algorithms to construct RBF networks. One of them is based on an Incremental Extreme Machine Learning added with a simplex optimization. The second one uses a Levenberg-Marquardt algorithm to find the locations and configurations of the Radial-Basis functions. What the author finds is that doing this allows to define more compact networks with smaller errors, compared with ten different algorithms.

3 Methodology

The classic training method for RBFN is hybrid, and it is comprised by two steps. In the first step, the parameters of the hidden nodes are set. Afterwards, the weights of the output layer are adjusted until a halt criteria is satisfied. For this method, the k-means algorithm is used to set the parameters of the Gaussian nodes. The k-means algorithm find L clusters with minimum internal variation, but with maximum variation among clusters. The number of clusters (L) must be manually configured, and the method does not guarantee halting in a global minimum solution. When k-means finishes, a network is created considering each cluster as a hidden layer node, and a single output layer node.

The proposed method is also hybrid, and it is performed in two steps as well. However, we use the SMLP-P algorithm [16] to set the number of clusters L and the radial basis functions parameters. We adjust the weights of the output layer employing the Recursive Least Squares (RLS) algorithm. We call our method RBFN-SLMP.

3.1 Estimation of Hidden Layer Parameters

In the hidden layer, it is required to set three kinds of parameters: (1) number of Gaussian nodes L, (2) center of each Gaussian function \(\mathbf x _j\) and (3) amplitude \(\sigma \) of the Gaussian functions \(\varphi _j (\mathbf x _j, \sigma )\). It is proposed to use SLMP-P [16] to estimate the latter. Originally, this algorithm is used to train the Lattice Neural Networks with Dendrite Processing (LNNDP) [17]. With this approach, we can automatically calculate the L Gaussian nodes, and their parameters \(\mathbf x _j\) and \(\sigma \). Although, the amplitude of the Gaussian nodes is not uniform for all of them, we calculate a general parameter as Haykin suggests in [7]. The proposed algorithm of training guarantees to stop in a global minimum.

Given p classes, \(C^k\), \(k=1,2,3...,p\), each of them with n features, the training algorithm is the following:

  1. 1.

    A hypercube \(HC^n\) is created. It contains all the cases of the training set. Looseness is added to each of the HC sides in order to add tolerance to noise.

  2. 2.

    \(HC^n\) is divided into \(2^n\) equal sized, smaller hypercubes.

  3. 3.

    For each HC, it is verified if there exist at least two observations with different classes. If it is the case, it returns to step 2; otherwise, it goes to step 4.

  4. 4.

    For each HC, \(\mathbf x _j\) is calculated. Also, \(\sigma \) is calculated.

  5. 5.

    A RBFN is created with L hidden nodes, where L is the number of HC obtained.

The functions for \(\sigma \) and \(\text {center}( \mathbf x _j\)), based on [7] are:

$$\begin{aligned} \sigma = \frac{ d_{max} }{ \sqrt{2L} } \end{aligned}$$
(3)
$$\begin{aligned} \text {center } (\mathbf x _j) = \frac{1}{ |S_j| } \sum _i{ \mathbf x _{j_i} } \end{aligned}$$
(4)

where \(d_{max}\) is the maximum Euclidean distance among the center of the clusters and \(| S_j |\) is the cardinality of cluster \(S_j\) that contains all observations \(\mathbf x _{j_i}\).

3.2 Estimation of Output Layer Weights

The hidden layer’s training is recursive, therefore, Haykin [7] suggests that the computation of the weights in the output layer should be recursive too. Thus, we use Recursive Least Squares (RLS) to adjust the weights. The training algorithm is the following:

  1. 1.

    Begin with \(\hat{\mathbf{w }}(0) = \mathbf 0 \) and \(\mathbf P (0) = \lambda ^{-1} \mathbf I \), where \(\mathbf I \) is the identity matrix. In this case, we use an initialization to zero for the weights, as in [7, 18].

  2. 2.

    Given a sample \(\{ \varvec{\phi }(i), d(i) \}_{i=1}^N\) calculate the following for \(n=1,2,...,N\):

    • \(\mathbf P (n) = \mathbf P (n-1) - \frac{\mathbf{P }(n-1)\varvec{\phi }(n)\varvec{\phi }^T(n) \mathbf P (n-1) }{ 1 + \varvec{\phi }^T(n) \mathbf P (n-1)\varvec{\phi }(n) }\)

    • \(\mathbf g (n) = \mathbf P (n) \varvec{\phi }(n)\)

    • \(\alpha (n) = d(n) - \hat{\mathbf{w }}^T(n-1) \varvec{\phi }(n)\)

    • \(\hat{\mathbf{w }}(n) = \hat{\mathbf{w }}^T(n-1) + \mathbf g (n) * \alpha (n)\)

  3. 3.

    When \( MSE < \theta \) halt.

where \(\theta \) is manually defined by the user, \(\lambda \) is a positive constant as small as possible, and d(i) is the class of the sample i. The vector \(\varvec{\phi }(i)\), results after the transformation in the hidden layer. It is defined as,

$$\begin{aligned} \varvec{\phi }(i) = [ \varphi ( \mathbf x _i, \varvec{\mu }_1) , \varphi ( \mathbf x _i, \varvec{\mu }_2), ..., \varphi ( \mathbf x _i, \varvec{\mu }_L) ]^T \end{aligned}$$
(5)

\(\varvec{\mu }_j\) is the hidden node center j. The Gaussian function is expressed as follows:

$$\begin{aligned} \varphi ( \mathbf x _i, \varvec{\mu }_j) = exp\bigg (- \frac{1}{2\sigma _j^2} || \mathbf x _i - \varvec{\mu }_j ||^2\bigg ), \, j = 1,2,...,L \end{aligned}$$
(6)

4 Experiments and Results

The experiments were run in a server with processors Xeon E5-2650 (20M Cache, 2.00 GHz, 8.00 GT/s Intel QPI) with a shared memory of 100GB. The OS was CentOS 6.5. The code is programmed in Python 3.3.3 with embedded code in R 3.0.3. We tried with 25, 50, 75, 100 and 300 iterations for RLS algorithm. The \(\lambda \) was selected as small as possible.

4.1 Databases

We used the benchmarks published on the Center for Machine Learning and Intelligent Systems (UCI) [19]. The benchmarks were selected following the recommendation of Babu and Suresh in [20]. Table 1 shows for each dataset the amount of features it has, the number of classes and the number of samples used for training and testing. In total, we apply our method to seven different datasets comprising up to a few dozens of features and a few hundreds of samples.

Table 1. Description of the benchmarks used.

4.2 Comparison

We compare the performance of two classifiers. One is our proposed RBFN-SLMP and the second is a Lattice Neural Network with Dendritic Processing (RNMPD). The metric used is the percentage of accuracy, defined as:

$$Acc= \frac{N_{correct}}{N_{Total}} \times 100$$

where \(N_{correct}\) is the number of correct predictions and \(N_{Total}\) is the total number of data points.

Table 2, shows the results of the experiments. As it is possible to see, RBFN-SLMP outperforms RNMPD in all cases. In case that the number of features exceeded 13, we extracted new features using Principal Component Analysis (PCA), and we used the first 10 components. The test conditions are the same for both methods. Besides, we tested the algorithms with the databases standardized, and also using the raw data. We ran the experiments 20 times and calculated the average of the results.

Table 2. Comparison of the accuracy achieved by RBFN-SLMP and RNMPD.

Figure 2 shows how smaller values for sigma (the amplitude of the Gaussian) imply worse results for all databases. Also, it shows that changes in the number of iterations do not have a big impact in the performance.

Fig. 2.
figure 2

The image shows the impact on the performance of the RBFNN, when modifying: iteration number, value of sigma and normalization.

5 Conclusion and Future Work

One of the main purposes of this paper was to test whether the algorithm that was originally developed to train a Lattice Neural Networks with Dendritic Processing (LNNDP) could be applied to the automatic selection of the parameters associated with the hidden layer of a Radial-Basis Function Neural Network (RBFNN). Also, it was our interest to compare the performance of the automatically tuned RBFNN with a LNNDP.

The results presented in Sect. 4 show that it is in fact possible to use the LNNDP training algorithm for the RBFNN. And they also offer data to support the claim that such a network shows a better performance than the LNNDP. According to the set of experiments that we have performed so far, the results show that the selection of the centers of the Gaussians allow a better accuracy in the classifier. Compared with the RNMDP, our approach always achieves a higher accuracy. Although it is not possible to generalize from only one set of experiments, we believe that the approach could be consider as an interesting option to alleviate the tuning of a radial-basis neural network.

Our contribution is to provide a method that automatically adjusts a RBFN depending on the data, requiring no intervention from the user, and giving a good accuracy, even better than the one given by some other classifiers. There are two main advantages in the method: (1) There is no intervention needed from the user. The method works automatically. (2) The amount and location of the radial-basis functions yields performances better or very similar to other approaches. The main disadvantage would be the limitation on the number of attributes that can be considered. Although some alternatives such as the use of Principal Component Analysis are possible.

Of course, there are a number of issues that could be explored in future work. We identify at least this ones:

  • Calculate the amplitude \(\sigma _j\) for each hidden layer node.

  • Include more classifications algorithms in the comparison, e.g. SVM, ELM, MLP, Bayes.

  • Run a design experiment to see if iterations, sigma and normalization are statistically significant.

  • In a extended version of this work, we consider studying the influence of parameters such as the amplitude of the Gaussians, and the number of hidden neurons for different problems.