Keywords

1 Introduction

Machine learning is a subdomain of artificial intelligence whose popularity and success are constantly growing [1, 2]. Its main goal is to extract high-level patterns, i.e. knowledge, from large amounts of raw information, patterns that can provide more abstract and useful insight into the data under study. Many problems in science and social science can be expressed as classification or regression problems, where one does not know an analytical model of some underlying phenomenon, but sampled data is available through experiments or observations, and the aim is to define a predictive model based on those samples. To date, many such algorithms have been proposed, which belong to different paradigms, e.g. neural networks, nearest neighbor, decision trees, support vector machines, Bayesian approaches, etc.

Unfortunately, there is no single best algorithm that can handle the large variety of situations encountered in practice. Each method has its own advantages and disadvantages. They are mainly related to the flexibility or complexity of the models and their generalization capabilities. For a non-trivial pattern, using a very simple model may result in poor performance, whereas using an overly complex model can result in overfitting, i.e. very good results for the training set and poor results for the test set or prediction, in general. Therefore, one must make several choices when dealing with such a problem: first, to establish the most appropriate learning method and, second, to control the complexity of the model generated with that learning method by changing its specific parameters.

In the present paper, we investigate the performance of some well-established algorithms in comparison to an original regression algorithm, namely the Large Margin Nearest Neighbor Regression (LMNNR), which combines the idea of nearest neighbors with that of a large separation margin, typical of support vector machines. The sublimation of naphthalene was chosen to illustrate these methodologies based on the difficulties involved due to the toxicity of the process, in which case predictions on the model become recommended and useful.

We organize our paper as follows. Section 2 presents a selection of related work about regression algorithms applied for the modeling of chemical processes. Section 3 describes the dataset used for the experiments and Sect. 4 presents the algorithms employed to model it. Section 5 describes some experimental results, while Sect. 6 contains the conclusions of our work.

2 Related Work

There are many applications of artificial intelligence and soft computing methods in the domain of chemical engineering, especially for modeling and optimization. In this section, we review several applications of regression algorithms for chemical processes.

Article [3] proposes a combination of online support vector regression with an ensemble learning system to adapt to nonlinear and time-varying changes in process characteristics and various process states in a chemical plant. [4] uses a probabilistic combination of local independent component regression in order to assess the quality of chemical processes with multiple operation modes. [5] addresses a non-linear, time-variant problem of soft sensor modeling for process quality prediction using locally weighted kernel principal component regression. [6] uses multiple linear regressions and least squares support vector regression to model and optimize the dependency of methyl orange removal with various adsorption influential parameters. [7] compares the performance of support vector regression, neural network and random forest models in predicting and mapping soil organic carbon stocks.

In [8], the authors make a thorough presentation of neural networks used for bioprocessing and chemical engineering, with applications in process forecasting, modeling, control of time-dependent systems, and the hybridization between neural networks and expert systems.

The issue of predicting sublimation thermodynamics, such as enthalpy, entropy, and free energy of sublimation using machine learning methods was addressed in [9]. Semi-empirical models were used to model systems of solids and supercritical fluids in order to determine sublimation pressures and sublimation enthalpies, and then to model different multiphasic equilibriums [10].

Some of the recent research of the authors of the present paper addressed a performance comparison of different regression methods for a polymerization process with adaptive sampling [11], a comparison between simulation and experiments for phase equilibrium and physical properties of aqueous mixtures [12], an experimental analysis and mathematical prediction of cadmium removal by biosorption [13] and the prediction of corrosion resistance of some dental metallic materials with an original adaptive regression model based on the k-nearest-neighbor regression technique [14].

3 The Naphthalene Sublimation Dataset

Our case study is naphthalene sublimation – a physical process of solids that transition directly into vapors. This technique is one of the most convenient methods to study heat and mass transfer. In addition, the rate of sublimation, the amount of solid converted to vapor per time unit and solid area unit is used to study problems related to environment protection, health protection, transportation safety and security, meteorology, by determining the concentration of various substances in the environment and the dynamical properties in a wind tunnel.

In a previous approach [15], a series of experiments were performed to investigate the sublimation of the naphthalene samples under atmospheric pressure in air as entrainer, without recycle. Our experimental data fulfill a necessary condition for empirical modeling: a sufficient number of data was obtained which uniformly cover the investigated domain.

The sample weight was measured continuously as a function of time, at different air flow characteristics. The experimental data is then used to calculate the mass transfer rate, the degree of sublimation, the sublimation front position; the influence of air flow characteristics was also evaluated.

More details on experiments and data processing can be found in [15], where neural network modeling was performed. In the current work, a more efficient algorithm, LMNNR, was applied, comparatively with other algorithms: linear regression, support vector regression, neural networks, k-Nearest Neighbors, K*, and Random Forest. In addition, a large dataset was used here (1323 instances) including different shapes of the samples, while in [15] only spherical samples were considered (150 instances).

The data gathered from experiments contains four variables as inputs: the shape of the sample (i.e. pallets, small pills, large pills and rods), time, air speed (the trainer) and temperature, and one output: the speed of naphthalene sublimation.

Consequently, the modeling purpose was to evaluate the performance of the process, quantified by the sublimation rate depending on process time, entrainer temperature, and entrainer flow rate.

In order to apply the instance-based methods, the data is normalized between 0 and 1, independently for each numerical attribute.

Figure 1 presents some statistics regarding the distribution of the data before normalization: the histogram for the first discrete input and a box plot for each numerical input, showing the minimum value, the first quartile, the median, the third quartile and the maximum value. For the output, two box plots are included, with a linear and a logarithmic scale. The output has values between 0.003 and 832.98, with the mean of 34.95 and the median of 9.24. There are a few greater values far from the median, but they are not outliers; they are important results of the process, difficult to learn, and which need to be handled accordingly by the regression models.

Fig. 1.
figure 1

Statistics of the inputs and the output of the naphthalene sublimation dataset

4 Regression Algorithms

The goal of the paper was to find a good model for the naphthalene sublimation data. The first step was to apply classical methods, with known good performance, implemented in Weka [16]. This was intended to constitute a basis for comparison with the original LMNNR algorithm. From the large number of algorithms in Weka, a few were selected which, in previous studies, were noticed to yield good performance for a large number of regression problems. Thus, neural networks, support vector machines, nearest neighbor, K-Star and random forest were selected. The details about their structure and operation are given below.

It must be emphasized that these techniques have very different nature and assumptions, and, by comparing the LMNNR results with the best results obtained with either of these classical algorithms, we can underline that the algorithm proposed by the authors is, in fact, a good alternative for regression.

4.1 Classical Algorithms

Neural networks in the form of multilayer perceptrons (MLP) are often used in classification and regression problems. The structure of an MLP contains an input layer, an output layer and one or more hidden layers of neurons. Each neuron sums the weighted input data of the neurons in the previous layer, to which another term (bias) is added, and the result is sent to the neurons in the next layer through a nonlinear transformation called an activation function. Each connection has an associated weight. In the training process, the weights and biases are adjusted such that the output of the network should match the desired output of the vectors from the training set. The training algorithm used most often is back-propagation [17]. It aims to minimize the mean-squared error between the desired output and the computed one using the gradient descent method.

The Epsilon-Support Vector Regression (ε-SVR) algorithm tries to approximate the desired continuous output within a tolerating error ε while using the idea of the large margin characteristic of support vector machines [18]. When the data is not linearly separable, the ε-SVR algorithm uses kernels to transform them into a higher-dimensional space. There are several types of functions that can be used as kernels, e.g. polynomial or radial basis functions (RBF). If some training instances still do not satisfy the constraints, slack variables are introduced to allow some errors (soft margin). The number of these erroneous instances can be controlled with a cost parameter C. If the value of C is decreased, a larger number of incorrectly classified training instances is allowed, which can however lead to better generalization.

The k-Nearest Neighbor (kNN) algorithm is based on the choice of k nearest neighbors using a distance function as a criterion and the output is computed by aggregating the outputs of those k training instances. As a distance function, one can use Euclidian or Manhattan distance, usually particularizations of the Minkowski distance. Choosing the value of k is important. If k is too small, then the classification can be affected by the noise in the training data, and if the value of k is too large, then distant neighbors can affect the correctness of the results. To avoid the difficulty of finding an optimum value for k, one can weight the neighbor influence. The neighbors have a greater weight as they are closer to the instance, while those farther apart have a smaller weight.

The K-Star algorithm [19] is an instance-based classifier that very much resembles the k-Nearest Neighbor algorithm presented before. Its novelty comes from the usage of an entropy metric in its similarity function, rather than the usual distance metric. It has been shown in the literature that such an approach has beneficial outcomes for certain industry-related problems [20]. The K-Star algorithm can also be used for regression purposes, similarly to how k-Nearest Neighbor is used.

A random forest [21] is composed of a collection of classification or regression trees. Each tree is generated using random split tests on slightly different training set generated using bagging. The output of a new instance is computed by aggregating the outputs of the individual trees.

4.2 The Large Margin Nearest Neighbor Regression Algorithm

The performance of the above algorithms was compared to that of an original algorithm, Large Margin Nearest Neighbor Regression (LMNNR) [22, 23].

The support vector machines, in a classification context, rely on the idea of finding a large margin between classes by solving an optimization problem. This idea was used in conjunction with the k-Nearest Neighbor method, also for classification [24]. Its main assumption is to change the distance metric of the kNN space by using a matrix:

$$ d_{M} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \left( {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right)^{T} {\mathbf{M}}\left( {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right)\,. $$
(1)

If M is a diagonal matrix, the weights of the neighbors are:

$$ w_{{d_{M} }} ({\mathbf{x}},{\mathbf{x}}') = \frac{1}{{d_{M} ({\mathbf{x}},{\mathbf{x}}')}} = \frac{1}{{\sum\limits_{i = 1}^{n} {m_{ii} \cdot \left( {x_{i} - x'_{i} } \right)^{2} } }}. $$
(2)

Equation (2) involves a single, global matrix M for all the instances. However, it is possible to have different distance metrics for the different instances or groups of instances. Thus, prototypes can be used which are defined as special locations in the input space of the problem, and each prototype P has its own matrix \( {\mathbf{M}}^{P} \). When computing the distance weight to a new point, an instance uses the weights of its nearest prototype, i.e. \( m_{ii}^{P} \) instead of m ii in Eq. (2).

Finding the appropriate matrices is achieved by solving an optimization problem. In a simplified formulation, the objective function F, which is to be minimized, takes into account two criteria with equal weights, F1 and F2, described below. In order to briefly explain the expressions of these functions, the following notations were made, where d M means the weighted square distance function using the weights we search for: d ij  = d M (x i , x j ), d ik  = d M (x i , x k ), g ij  = |f(x i ) – f(x j )|, g ik  = |f(x i ) – f(x k )|.

The first criterion is:

$$ F_{1} = \sum\limits_{i = 1}^{n} {\sum\limits_{j \in N(i)}^{{}} {d_{ij} } } \cdot \left( {1 - g_{ij} } \right)\,, $$
(3)

where N(i) is the set of the nearest k neighbors of instance i, e.g. k = 3. Basically, this criterion says that the nearest neighbors of i should have similar values to the one of i, and more distant ones should have different values.

The second criterion is expressed as follows:

$$ F_{2} = \sum\limits_{i = 1}^{n} {\sum\limits_{j \in N(i)}^{{}} {\sum\limits_{l \in N(i)}^{{}} {\hbox{max} \left( {1 + d_{ij} \cdot \left( {1 - g_{ij} } \right) - d_{ik} \cdot \left( {1 - g_{il} } \right),\;0} \right)} } } \,. $$
(4)

Here, the distance to the neighbors with close values (the positive term) is minimized, while simultaneously trying to maximize the distance to the neighbors with distant values (the negative term). An arbitrary margin of at least 1 should be present between an instance with a close value and another with a distant value.

For optimization, both an evolutionary algorithm and an approximate differential method following the central difference definition of the derivative can be used.

The estimated output of a new query instance x q is computed as follows. Its k nearest neighbors are identified using the distance metric from Eq. (1). The weights of these neighbors are computed with Eq. (2) and then normalized:

$$ w_{{d_{M} }}^{n} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{q} ) = \frac{{w_{{d_{M} }} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{q} )}}{{\sum\limits_{j = 1}^{k} {w_{{d_{M} }} ({\mathbf{x}}_{j} ,{\mathbf{x}}_{q} )} }}. $$
(5)

Finally, the output is computed as a weighted average of the neighbor outputs:

$$ \tilde{f}\left( {{\mathbf{x}}_{q} } \right) = \sum\limits_{i = 1}^{k} {w_{{d_{M} }}^{n} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{q} )} \cdot f\left( {{\mathbf{x}}_{i} } \right)\,. $$
(6)

5 Results and Discussion

In this section, the choice of parameters for different regression methods is explained. For each algorithm, multiple experiments with different parameter values were performed. The tables containing the results only display those with the best performance in terms of correlation coefficient (r) and root mean square error (RMSE).

In order to compare the performance of the various algorithms, the cross-validation method with 10 folds was used. Also, since an objective comparison was intended, the data set was randomly divided into 10 groups (iteratively one for test and the rest for training) and the same groups were used by all the algorithms. It was considered that this methodology is particularly important to compare the algorithms implemented in Weka with the original implementation of the LMNNR algorithm. The results obtained for individual test groups, although interesting, were omitted in the results section, and only the aggregated results are displayed in the following tables.

5.1 Parameters of Regression Methods

Multilayer Perceptron Neural Network (MLP).

For the problem at hand, repeated experiments showed that a neural network produces best results when given a low learning rate. The momentum parameter also has a great impact on learning. Its optimal value tends to be around 0.4 or 0.5. The number of hidden layers was automatically chosen by Weka. This option yielded the best outcomes because the optimal number of hidden layers tends to vary between cross-validation sets, making it hard to achieve similar performance with manually chosen values. The encoding for the discrete input is “one-hot”, leading to 7 inputs and 1 output. The best network architecture was the one with one hidden layer containing 4 neurons with sigmoid activation functions, and with the output neuron with a linear activation function. 1000 epochs for training were found to be an acceptable compromise between the quality of the resulting model and the overall training time.

Support Vector Regression (SVR).

The Epsilon-SVR algorithm achieved a very good overall fit if the kernel used was based on radial basis functions. The kernel type choice was vastly influential on the outcome. RBF, therefore, yielded a correlation that was at least 20% better than all of the other options (linear, polynomial and sigmoid). The best results were obtained with relatively large values of the parameters: γ = 14, C = 10, whereas ε was best kept at a low value, i.e. ε = 0.001. Fine-tuning these parameters helped improve the algorithm performance significantly, such that the final correlation was the best out of all the algorithms tested with Weka.

k-Nearest Neighbor (k-NN).

The optimal number of neighbors used in this algorithm is in this case 2. The correlation dropped significantly if the number of neighbors was increased above this value. The search method used was the linear nearest neighbor search. A slight improvement was achieved by using the Manhattan distance as metric, instead of the Euclidean distance.

K-Star (K*).

The only numeric parameter that this algorithm takes, the global blending index, was optimal at low values. In the experiments, the value 3 was used. The parameter, however, influenced the outcome in a slight manner (~5% correlation improvement). The entropic auto blend functionality provided by Weka was turned off for these experiments.

Random Forest (RF).

In the case of this algorithm, the number of trees parameter plays an important role in the overall performance. Several tests were conducted to determine the optimal value of this parameter and the best outcome was recorded with a value of approximately 200 trees. Although the difference in performance obtained by optimizing this parameter was only around 3%, it allowed Random Forest algorithm to yield one of the best correlations for the data.

Large Margin Nearest Neighbor Regression (LMNNR).

For this algorithm, the parameters are the number of prototypes, the number of optimization neighbors and the number of regression neighbors. Different combinations of values for these parameters were attempted. Since the LMNNR results are not deterministic, because the initialization of the matrices is random and then optimized, the best results were included out of 100 algorithm runs for each configuration.

5.2 A Comparison Between Algorithm Performance

In Tables 1 and 2, one can see the best results achieved with the use of the regression algorithms presented in the previous section.

Table 1. The best results obtained for optimized configurations by algorithms in Weka
Table 2. The best results obtained with the original LMNNR algorithm

Five out of the six algorithms tested show a very good correlation of the data (~0.9) and come in a very short range from one another. Linear regression, which is included only for comparative reasons, achieves a low total correlation. This emphasizes the nonlinearity of the problem at hand. ɛ-SVR and Random Forest yield the best, almost identical, predictions. kNN and K-star present similar results, despite the different metrics they use in their similarity functions.

From Table 2, it can be seen that the LMNNR results are clearly better than the results obtained by other well-established regression algorithms.

Unlike the problems studied in previous works [22, 23], it can be seen that more prototypes are needed for this particular problem. 5 prototypes provide the best results in terms of correlation coefficient. This shows that this dataset is more difficult to learn using a unique distance metric and that different regions of its input space have different characteristics with can be properly addressed with the use of prototypes.

Figure 2 shows a comparison between the predictions of the model and the desired data, for the case with 5 prototypes, 5 regression neighbors and 5 optimization neighbors from Table 2, which yields the highest correlation coefficient r. One can see that the two datasets are quite close. An exception is e.g. the data point with the value of 1. Since Fig. 2 presents the results for the 10 testing sets of the cross-validation process put together, the data point with a maximum value in the test set cannot be correctly approximated by the model relying on the rest of the data in the training set. The LMNNR algorithm is based on the nearest neighbor paradigm, and therefore it cannot extrapolate to a value that is larger than any value in the training set. Furthermore, one can see that most of the data has small output values, and only 0.8% of the normalized data has output values above 0.5. This contributes to the difficulty of the model to approximate higher output values.

Fig. 2.
figure 2

Comparison between the predictions of the model and the desired data

6 Conclusions

The results obtained by the LMNNR algorithm proposed by the authors are better than those provided by other classical regression algorithms. These predictions are important for the chosen process, avoiding or, at least, minimizing the number of experiments made in toxicity conditions, and saving materials and energy. In addition, the developed modeling methodologies can be easily adapted and applied to other chemical engineering processes.

The promising results of LMNNR determine the planning of other applications and methodologies that include this algorithm. As a future direction of investigation, one can consider its further refinement in order to automatically detect the optimal values of its parameters, namely the number of prototypes, the number of regression neighbors and the number of optimization neighbors.