Keywords

1 Introduction

Random forest (RF) is an ensemble-based, supervised machine learning algorithm proposed by Leo Brieman [6]Footnote 1. It consists of numerous randomized decision trees to solve classification and regression problems. In RF, decision trees are constructed independently. Therefore, RF can be implemented and executed as parallel threads, hence it is fast and easy to implement. It has been used for various domains like brain tumor segmentation, Alzheimer detection, face recognition, human pose detection, object detection etc [7].

A decision tree in RF is built during the training phase using the bagging concept. A decision tree has several important parameters like predefined splitting criteria, tree depth and the number of elements on the leaf node. However, the best choice of these parameters is not answered precisely yet [7, 10]. This motivated various methods to come up with the heuristic approach in building the decision tree and hence RF. The method proposed by Paul et al. [15] converges with reduced and important features, and derived the bound for the number of trees. In addition, there has been some work done on proving the consistency of RF and leveraging dependency on the data by several researchers [4, 5, 9, 16]. Denil et. al. [9] used Poisson distribution in feature selection for growing a tree, whereas Wang et al. [16], has proposed a Bernoulli Random Forest (BRF) framework incorporating Bernoulli distribution for the feature and splitting point selection.

The conventional RF assigns equal weights to the votes casted by each individual tree [6]. Hence, the prediction is made based on the majority voting. However, in the real-life scenario, a dataset may have a huge number of features, but the percentage of truly informative features may be less. Therefore, the contribution of such decision trees, which are populated by less informative attributes may be less. Hence, all the trees in a forest are not equally contributing to the better classification [8]. Therefore, instead of assigning a fixed weight to the decision tree, the dynamic weight should be assigned. Paul et al. [13] have proposed a method to compute the weights during the training phase and assigns a fixed weight to each decision tree. The mechanism proposed by Winham et al. [17] and Liu et al. [12], both computes the weight either based on the performance of tree computed using OOB samples or using a feature weighing scheme. Akash et al. [2] compute the confidence as weight in RF using the entropy or Gini score calculated during the tree construction. However, these methods do not talk about the relationship of these weights with test samples. Therefore, a dynamic weighing scheme is proposed in this paper. It computes the similarity between test cases and the decision tree using exponential distribution. Therefore, the forest formed is named as Exponentially Weighted Random Forest (EWRF).

The remainder of this paper is organized as follows: Sect. 2, describes RF as a classifier and regression and problem associated with conventional RF. Section 3, presents the proposed EWRF approach. Section 4, discuss the implementation details and performance. It has been concluded in Sect. 5.

2 Random Forest

Random forest built upon decision trees as an atomic units. Each decision tree either behaves as a classifier for classification or as a regressor to predict the output for regression task. Given a dataset \(\mathbbm {D} = \{(X_{1},C_{1}),(X_{2},C_{2}),.......,(X_{M},C_{M})\}\) with M number of instances such that \(X_{i} \in \mathbbm {R}^{N}\) with N number of attributes. Let the dataset is having class labels as \({C}_{i} \in \{Y_{1},Y_{2},.......,Y_{C}\}\). Initially, dataset \(\mathbbm {D}\), is partitioned into training set \(\mathbbm {D}_{1}\), having \({M'}\) number of instances \({(M' < M)}\), and testing set \(\mathbbm {D}_{2}\), having remaining instances. Decision trees are constructed using training samples along with bootstrap sampling (random sampling along with replacement) as described in [6].

2.1 Random Forest as Classifier

Random forest assigns the class value based on the proportion of the individual class values present at the leaf node.

The class distribution for the \({j^{th}}\) class at the terminal node h, in the decision tree t, for the test case X, can be represented as:

$$\begin{aligned} p_{j,h}^{t} = \frac{1}{n_{h}}\sum _{{i}\in {h}}\mathbb {I}{(Y_{i}=j)} \end{aligned}$$
(1)

here: \({n_{h}}\) is total number of instances in the terminal node h. \(\mathbb {I(\cdot )}\) is an Indicator function.

Based on maximum class distribution, the class value j, is assigned by the decision tree t, for the test case X, by the following equation:

$$\begin{aligned} \hat{Y}_{j}^{t} = \underset{1 \le j \le C}{\text {max}}\{p_{j,h}^{t}\} \end{aligned}$$
(2)

To assign the final class value based on majority voting in conventional RF, first count the predicted class by each decision tree for the test case X, using the following equation:

$$\begin{aligned} C(Y_{i} = j) = \sum _{t = 1}^{n_{tree}} \mathbbm {1}\cdot \mathbb {J}{(\hat{Y}_{j}^{t})} \end{aligned}$$
(3)

here, \(\mathbb {J(\cdot )}\) is an indicator function. Finally, based on majority voting, RF assigns the final class value using Eq. (4).

$$\begin{aligned} \hat{Y} = \underset{1 \le j \le C}{\text {max}}\{C(Y_{i} = j)\} \end{aligned}$$
(4)

2.2 Random Forest as Regressor

In regressor task, decision trees have to predict the outcome. In the regression dataset, the outcome value associated with each instance is a single real value i.e. \(\mathbf {Y_{i}\in R}\). In order to construct RF as a regressor, Mean Squared Error (MSE) is used as the splitting criterion. Once all the decision trees are constructed, the test instance is passed to each decision tree. Based on the decision tree node values, test instance follows either left or right subtree and reaches to the leaf node. The predicted value is the mean value of instances present at the leaf node. The predicted value for a test case \(\mathbf {X}\), at a terminal node h, by the decision tree t, is the mean value of instances present within the leaf node. It can be calculated as:

$$\begin{aligned} \hat{Y}_{h}^{t} = \frac{1}{n_{h}}\sum _{y_{i}\in {h}}{Y_{i}} \end{aligned}$$
(5)

Finally the predicted value by the RF is the average of values predicted by each trees. Hence, the overall prediction made by forest can be computed as:

$$\begin{aligned} \hat{Y} = \frac{1}{n_{tree}}\sum _{t = 1}^{n_{tree}} {{\mathbbm {1}}} \cdot \hat{Y}_{h}^{t} \end{aligned}$$
(6)

2.3 Problem with Conventional Random Forest

Random forest classifier to be effective, each decision tree must have reasonably good classification performance and trees must be diverse and weakly correlated [14]. The diversity is obtained by randomly choosing training instances and attributes for each tree. However, a decision tree can not always contribute effectively to each and every test instance. Considering a dataset with a high ratio of less informative attributes, the performance of RF gets significantly affected. This is due to the equal contribution of decision trees while performing majority voting. In such cases, performance can be increased by reducing the contribution of decision trees whose nodes are populated by non-informative attributes and assigning a dynamic weight to the decision trees [3, 11].

3 Proposed Method

The proposed EWRF consists of two steps. In the first step, decision trees are constructed as described in conventional RF [6]. In the second step, the exponential weight score is calculated as described in following subsections.

3.1 Exponential Weight Score Calculation

During the testing phase, test samples are passed to each and every decision tree in the forest. Let \(F_{i}\) is the feature value for splitting at an internal node of a decision tree t. A test sample \({X} = \{a_{1},a_{2},....,a_{j},...,a_{N}\}\), is passed to a decision tree. It is guided either to the left \((a_{j}^{X} \le \tau )\) or right \((a_{j}^{X} > \tau )\) subtree, based on threshold \(\tau \), and move down until it reaches to the leaf node of decision tree t. The sum of the squared distance between corresponding attribute values in the test sample X, and participating nodes \(F_{i}\), in the path of the decision tree t, is calculated as follows:

$$\begin{aligned} {d} = \sum { ||F_{i}-a_{j}^{X}||_{2} }; \forall {F_{i} \in t; a_{j} \in X} \end{aligned}$$

Thus, we have \(\{d_{1},d_{2},......,d_{n_{tree}}\}\) distances computed for each test sample, with respect to all decision trees. The smaller the value of d for the decision tree, the more will be the similarity between tree and test case till that node, and hence the corresponding will be high weight value. This has been shown in Fig. 1. In the proposed EWRF, the weight associated with each decision tree directly proportional to the similarity between the test instance and decision tree. Hence, the weight associated with a decision tree is computed using an exponential distribution measure to maintain such a relationship. In this way, the weight of each decision tree for incoming test cases may vary. The exponential tree weight score is calculated as follows:

figure a
$$\begin{aligned} W_\mathbf {X}^{t} = \frac{1}{Z}\exp \left\{ -\frac{\sum { ||F_{i}-a_{j}^{X}||_{2} }}{\alpha } \right\} \end{aligned}$$
(7)

where Z is the normalizing term, which is the sum of weights of all dsecision trees. The \(\alpha \) value is one of the hyper-parameter to control the weight score. For classification, the Eq. (3) is turned out to be as:

$$\begin{aligned} C(Y = j) = \sum _{t = 1}^{n_{tree}} (W_\mathbf {X}^{t})\cdot J(\hat{Y}_{j}^{t}) \end{aligned}$$
(8)

For regression, the Eq. (6) is turned out to be as:

$$\begin{aligned} \hat{Y} = \frac{1}{n_{tree}}\sum _{t = 1}^{n_{tree}} (W_\mathbf {X}^{t}) \cdot \hat{Y}_{h}^{t} \end{aligned}$$
(9)

At last, weighted voting is performed using Eqs. (8) and (9) for predicting output in classification and regression tasks respectively, shown in Fig. 2. The pseudo code for predicting the class or regression value is provided in Algorithm 1.

Fig. 1.
figure 1

An example to show the calculation of distance during testing. In this example an test instance X follows the path marked as bold blue lines \(( F_{e},\) \(F_{l}\) and \(F_{p} )\) to reach up to leaf node. The distance is calculated at the corresponding node in the path followed by the test case. At the root node, all distances are sum up to get the final distance between test case and decision tree. (Color figure online)

Fig. 2.
figure 2

The proposed EWRF method to show how exponentially weighted score is calculated by different decision trees, for the given test instance. Further, weighted voting is performed for final prediction

4 Experimental Results

This section is comprised of datasets, implementation details, and performance analysis of EWRF compared to conventional RF, and state-of-the-art methods.

4.1 Datasets and Implementation Details

The experiments have been conducted over the benchmark datasets, which are publicly available over the UCI repository [1]. These datasets are from a variety of domains and have different combinations of numerical attribute values. These datasets vary in terms of the number of classes, features, and instances for rigorous testing of the proposed method.

There are five main parameters for conducting the experiments: (1) the number of trees \({n_{tree}}\), (2) the number of minimum instances at leaf node \({n_{min}}\), (3) the sample ratio in which dataset is divided into training set and test set, (4) the maximum tree depth \({T_{depth}}\), and (5) value of \(\alpha \) for computation of exponential weighing score. The value of \({n_{tree}}\) is decided empirically. The experiments have been done over Vehicle, Wine, and Abalone datasets with \({n_{tree}}\) in the range of 10 to 100 with a step size of 10. We have observed that beyond \({n_{tree}}\) = 50, the accuracy saturates, so it is kept as 50 in all experiments. The \({n_{min}}\) is kept as 5 and the sample ratio for dividing the datasets into training and testing is kept as 0.5. These values are taken from the state-of-the-art methods for a fair comparison. Experiments have been done with different values of \({T_{depth}}\) and the results are quoted with the depth, where accuracy is better among different trials. The value of \(\alpha \) is chosen as 0.45 for classification and 0.75 for regression. It is also decided by experimenting with different values of \(\alpha \) = \(\{0.15, 0.45, 0.75, 1.0\}\). Each of the experiment is repeated 10 times with the random selection of training and testing subsets.

Table 1. MSE comparison between state-of-the-art methods and proposed EWRF with average over 10 iterations (least value is the best)
Table 2. Classification accuracy comparison between state-of-the-art methods and proposed EWRF with average over 10 iterations (high value is the best)

4.2 Performance Analysis

The results generated with the proposed EWRF are compared to the conventional RF [6], and the state-of-the-art methods, i.e. four variants of random forest Biau08 [5], Biau12 [4], Denil [9] and BRF [16] for the regression and classification datasets. The highest learning performance among these comparisons is marked in boldface for each dataset.

In regression, it can be observed from Table 1 that EWRF achieves the significant reduction in MSE on seven datasets out of ten datasets. In particular, one can observe that for the Concrete dataset, Biau08 [5], Biau12 [4], Denil [9], and BRF [16] have almost same MSE value. However, there is more than \(50\%\) reduction in MSE for Yacht, Concrete, and Housing datasets. The proposed method has also shown improvement for datasets having large number of classes like Student, and Automobile. From Table 1, it is clear that the proposed method has shown much improvement over the compared state-of-the-art methods.

For classification, the comparison between the existing state-of-the-art methods and proposed EWRF is shown in Table 2. It can be seen that EWRF is showing improvement as compare to Biau08 [5], Biau12 [4] and Denil [9] for all the classification data except for Spambase. In comparison with BRF [16], the proposed method is showing improvement for seven datasets out of eleven datasets. In comparison to conventional RF, the proposed EWRF is showing improvement for nine datasets out of eleven datasets.

5 Conclusion

The conventional Random Forest (RF) assigns equal weights to the votes cast by each individual tree. Also, the approaches proposed in the past assigns weights to every decision tree during the training phase only. In this paper, we have explored the dynamic relationship between test samples and decision trees, based on which aggregation/weighted voting is performed. Thus, weights derived in EWRF are dynamic in nature. The proposed method is tested over various heterogeneous datasets and compared to state-of-the-art competitors. The proposed method has shown improvement for both regression and classification tasks.