1 Linear Support Vector Machine

Support vector machine [1], as mentioned in Chap. 6, provides a classification learning model and an algorithm rather than a regression model and an algorithm. It uses the simple mathematical model \(\mathbf{y} =\mathbf{ w}{\boldsymbol x'}+\gamma\), and manipulates it to allow linear domain division. The support vector machine can be divided into linear and nonlinear models [2]. It is called linear support vector machine if the data domain can be divided linearly (e.g., straight line or hyperplane) to separate the classes in the original domain. If the data domain cannot be divided linearly, and if it can be transformed to a space called the feature space where the data domain can be divided linearly to separate the classes, then it is called nonlinear support vector machine.

Therefore, the steps in the linear support vector machine are: the mapping of the data domain into a response set and the dividing of the data domain. The steps in the nonlinear support vector machines are: the mapping of the data domain to a feature space using a kernel function [3], the mapping of the feature space domain into the response set, and then the dividing of the data domain. Hence, mathematically, we can say that the modeling of a linear support vector machine adopts the linear equation \(\mathbf{y} =\mathbf{ w}{\boldsymbol x'}+\gamma\), and the modeling of a nonlinear support vector machine adopts the nonlinear equation \(\mathbf{y} =\mathbf{ w}\phi ({\boldsymbol x'})+\gamma\). The kernel function makes it nonlinear. The classification technique using a support vector machine includes the parametrization and the optimization objectives. These objectives mainly depend on the topological class structure on the data domain. That is, the classes may be linearly separable or linearly nonseparable. However, linearly separable classes may be nonlinearly separable. Therefore, the parametrization and optimization objectives that focus on the data domain must take these class properties into consideration.

1.1 Linear Classifier: Separable Linearly

This section mainly focuses on the two-class classification problem [4] using the support vector machine. However, a multiclass support vector machine can easily be derived from a combination of two-class support vector machines by integrating an ensemble approach [5]. This chapter focuses only on the two-class classification using the support vector machine learning models. Let us first consider the linear case. In Chap. 7, some preliminaries for the support vector machine were discussed, and a straight line equation was derived as:

$$\displaystyle{ \mathbf{w}{\boldsymbol x'}+\gamma = 0 }$$
(9.1)

Considering a data domain, this parameterized straight line divides the data domain into two subdomains, and we may call them left subdomain and right subdomain (as we state in decision tree-based models), denote them by D 1 and D 2, and define them as follows:

$$\displaystyle\begin{array}{rcl} & & D_{1} =\{\mathbf{ x}:\mathbf{ w}{\boldsymbol x'}+\gamma \leq 0\} \\ & & D_{2} =\{\mathbf{ x}:\mathbf{ w}{\boldsymbol x'}+\gamma > 0\}{}\end{array}$$
(9.2)

The points falling in these subdomains may be distinguished with labels 1 for the subdomain D 1 and − 1 for the subdomain D 2. Therefore, the parametrization objective of the support vector machine can be defined as follows:

$$\displaystyle\begin{array}{rcl} & & \mathbf{w}{\boldsymbol x'}+\gamma = 1,\mathbf{x} \in D_{1} \\ & & \mathbf{w}{\boldsymbol x'}+\gamma = -1,\mathbf{x} \in D_{2}{}\end{array}$$
(9.3)

In the parametrization objectives, we have modeled two straight lines (or hyperplanes) that can help to define boundaries between the classes. The optimization objective is to define an objective function (in this case, the distance between the straight lines) and search for the parameter values that maximize the distance. These lines are parallel to each other; therefore, we can simply use the standard distance formula between two parallel lines \(y = mx + b_{1}\) and \(y = mx + b_{2}\) as follows [6]:

$$\displaystyle\begin{array}{rcl} d = \frac{(b_{2} - b_{1})} {\sqrt{m^{2 } + 1}}& &{}\end{array}$$
(9.4)

where the slopes of the straight lines are \(m =\mathbf{ w}\), and their intercepts are \(b_{1} =\gamma +1\) and \(b_{2} =\gamma -1\). By substituting these variables, we can establish the following:

$$\displaystyle\begin{array}{rcl} d = \frac{\pm 2} {\sqrt{\mathbf{w}{\boldsymbol w'} + 1}}& &{}\end{array}$$
(9.5)

Ultimately, this distance formula will be the measure for the optimization problem that we build; therefore, without loss of generality, we can rewrite it as follows:

$$\displaystyle\begin{array}{rcl} d = \frac{\pm 2} {\sqrt{\mathbf{w}{\boldsymbol w'}}}& &{}\end{array}$$
(9.6)

In practice, the support vector machine optimization problem is written using the mathematical norm notation, therefore we rewrite the above equation as follows:

$$\displaystyle\begin{array}{rcl} d = \frac{\pm 2} {\vert \vert \mathbf{w}\vert \vert ^{2}}& &{}\end{array}$$
(9.7)

By squaring both sides of the equation, and then dividing both sides of the equation by the value of 2, we can obtain the following simple mathematical relationship:

$$\displaystyle\begin{array}{rcl} \frac{d^{2}} {2} = \frac{1} {\frac{\vert \vert \mathbf{w}\vert \vert ^{2}} {2} }& &{}\end{array}$$
(9.8)

It states that instead of maximizing the distance function d 2∕2, we can minimize \(\vert \vert \mathbf{w}\vert \vert ^{2}/2\). In other words, we can minimize the prediction error with respect to the above classifier while maximizing the distance between them (this is the optimization objective). Therefore, the following mathematical expression can be defined for the prediction error between \(\mathbf{x} \in D\) and its response variables y:

$$\displaystyle\begin{array}{rcl} e = 1 - y(\mathbf{w}{\boldsymbol x'}+\gamma )& &{}\end{array}$$
(9.9)

This error function plays a major role in the development of an optimization problem for the support vector machine. Let us now understand its role through the following thinking with examples.

Thinking with Example 9.1

Suppose the actual response y is − 1, and the predicted response based on the classifier \(\mathbf{w}{\boldsymbol x'}+\gamma = -1\) is −1, then we have \(e = 1 - (-1)(-1) = 1 - 1 = 0\). Similarly, suppose the actual response y is 1, and the predicted response based on the classifier is \(\mathbf{w}{\boldsymbol x'}+\gamma = 1\) is 1, then \(e = 1 - (1)(1) = 1 - 1 = 0\). Therefore, it is clear the classification error is 0. However, if the actual response y is 1 and the predicted response based on the classifier \(\mathbf{w}{\boldsymbol x'}+\gamma = -1\) is −1, then \(e = 1 - (1)(-1) = 1 + 1 = 2\). This indicates the error in the predicted response. Similarly, if the actual response y is − 1, and the predicted response based on the classifier \(\mathbf{w}{\boldsymbol x'}+\gamma = 1\) is 1, then \(e = 1 - (-1)(1) = 1 + 1 = 2\)—this also gives the error indicator 2.

Suppose the actual response y is − 1. 1, and the predicted response of the classifier \(\mathbf{w}{\boldsymbol x'}+\gamma = -1\) is −1, then what is the value of e? Well! \(e = 1 - (-1.1)(-1) = 1 - 1.1 = -0.1\). This means the variable x that corresponds to the response \(y = -1.1\) is on the correct side of the classifier. Now suppose the actual response y is − 0. 9, and the response of the classifier \(\mathbf{w}{\boldsymbol x'}+\gamma = -1\), then what is the value of e? The answer is: \(e = 1 - (-0.9)(-1) = 1 - 0.9 = 0.1\). It means the variable x that corresponds to \(y = -1.1\) is on the wrong side of the classifier. Therefore, we can conclude that the negative error e is preferred when we optimize the classification.

1.1.1 The Learning Model

These examples show that the parameters \(\mathbf{w}\) and γ must be selected such that the error e ≤ 0. This leads to the following inequality:

$$\displaystyle\begin{array}{rcl} 1 - y(\mathbf{w}{\boldsymbol x'}+\gamma ) \leq 0.& &{}\end{array}$$
(9.10)
$$\displaystyle\begin{array}{rcl} y(\mathbf{w}{\boldsymbol x'}+\gamma ) \geq 1.& &{}\end{array}$$
(9.11)

By combining the minimization goals, we can create the following optimization problem, and it is the basis for the two-class support vector machine [7]:

$$\displaystyle\begin{array}{rcl} & & \mathop{\text{Minimize:}}\limits_{\mathbf{w},\gamma }\quad \frac{\vert \vert \mathbf{w}\vert \vert ^{2}} {2} \\ & & \text{subject to:}\quad y(\mathbf{w}{\boldsymbol x'}+\gamma ) \geq 1{}\end{array}$$
(9.12)

We can now extend this optimization problem to a multidimensional data domain with a complete matrix representation as follows [8]:

$$\displaystyle{ \begin{array}{llll} &\mathop{\text{Minimize:}}\limits_{\mathbf{w},\gamma }&&\frac{\vert \vert \mathbf{w}\vert \vert ^{2}} {2} \\ & \text{subject to:} &&\mathbf{s}(\mathbf{w}{\boldsymbol x'} +\gamma \mathbf{ I}) \geq \mathbf{ I}\end{array} }$$
(9.13)

We can call the above “Minimize” term the svm-measure and the “subject to” term the label error. In this equation, \(\mathbf{x}\) represents the matrix or the n points in the data domain D, and \(\mathbf{s}\) is the set that represents the response variables of \(\mathbf{x}\). The matrix I is the identity matrix, and γ is the intercept of the straight line (or the hyperplane). Three coding examples are designed to help you understand the svm-based optimization problem presented in Eq. (9.13). These examples are based on: (1) two points and single line svm-based domain division, (2) two points and three lines svm-based domain division, and (3) five points and three lines svm-based domain division, which will help you extend it to a generalized svm-based domain division.

1.1.2 A Coding Example: Two Points, Single Line

The main objective of this coding example is to illustrate the first iterative step of the svm-based optimization problem presented in Eq. (9.13). In Listing 9.1, a coding example is given to illustrate the problem of dividing the data domain linearly without applying any optimization mechanism. It is written in the R programming language, and it is expected that this example will help you build the concepts of the support vector machine. In this example, two-class points \(\mathbf{x}_{1}\) = (5.5, 1.5) and \(\mathbf{x}_{2}\) = (4. 5, −1. 8) are considered as illustrated in the first figure of Fig. 9.1 on a two-dimensional data domain. The goal is to find a straight line that separates the points.

Fig. 9.1
figure 1

The results of the “two point, straight line” coding example in Listing 9.1

Listing 9.1 An R programming example—a svm-based domain division

The block of code from line 4 to line 11 sets the parameters for a straight line, which could be the svm-based classifier. Two parameter values −10 and 17 are selected for the intercept parameter of the straight line. The codes in lines 14 and 18 select two points and assign labels, respectively. The code in line 26 helps us select the index of the minimum error and will be used for selecting the weights and intercept values as shown in the codes in lines 35 and 36. In line 22, the label error is determined based on the straight line defined earlier, and the svm-measure is calculated in line 29.

The slope and the intercept are calculated as shown in lines 35 and 36. The block of code in lines 38–46 produced the figures in Fig. 9.1. The first figure is related to the intercept value selected according to line 9, and the second figure is related to line 10 when uncommented. The program statements are written sequentially to help you understand the mathematical processes for the support vector machine presented at the beginning of this chapter. Comparing the two choices of the parameters, we can see the first set of parameters provides a better domain division than the second set for svm-based classification. It also illustrates the effect of the intercept parameter.

1.1.3 A Coding Example: Two Points, Three Lines

The main objective of the coding example in Listing 9.2 is to illustrate the iterative steps that lead to an optimization in the svm-based classification problem which is presented in Eq. (9.13). However, the iterative steps are shown sequentially in the program, so that you can understand the algorithm better. As an exercise, once you understand the algorithm, make the program efficient using loops and functions.

Listing 9.2 An R programming example—the svm-based optimization problem

The first iteration with the first set of weights is presented in the block of code from line 6 to line 47, and it reflects the code presented in Listing 9.1. Similarly, with the other sets of weights, the iterations 2 and 3 are presented in the blocks of code from line 51 to line 90 and from line 94 to line 133, respectively. In the block of code from line 136 to 138, the results of svm-measures are displayed, and we can then select the one with the smaller value for the best classifier. Thus this program calculates the label errors for the straight line equations \(y = 2x_{1} + 3x_{2} - 10\), \(y = 2.1x_{1} + 4.1x_{2} - 10\), and \(y = 1.9x_{1} + 3.1x_{2} - 8\), and the svm-measures to determine the straight line that minimizes the svm-measure. It also produced the graph in Fig. 9.2, and we can see the three classifiers and the best among them.

Fig. 9.2
figure 2

A possible classifier for two points

1.1.4 A Coding Example: Five Points, Three Lines

The main objective of the coding example in Listing 9.3 is to generalize the iterative steps that lead to optimization in the svm-based optimization problem presented in Eq. (9.13) using matrix formulation. This example inputs a file “file3.txt,” which contains the data points and class labels. The file contains three columns where the first two columns represent the two features, and the third column represents the class labels. It has two features f 1 and f 2 with values \(f_{1} =\{ 5.5,5.7,6.1,4.5,4.3\}\) and \(f_{2}=\{1.5,0.5,1.1,-1.8,-1.5\}\), and their corresponding label set \(L=\{1,1,1,-1,-1\}\).

Listing 9.3 An R programming example—linear svm-based classifiers

This program has produced the figure presented in Fig. 9.3, and we can see the five points in the data domain with two classes separated by the same three straight lines considered previously. The program has also produced svm-measures that are calculated in line 37 and displayed in line 64. The difference between Listing 9.3, and Listings 9.2 and 9.1 is the calculations using the matrix form rather than the iterative steps; hence, the diagonal values of the matrix variable “measure.m” are the svm-measures for the three lines considered.

Fig. 9.3
figure 3

A possible classifier for five points

1.2 Linear Classifier: Nonseparable Linearly

In the above section we studied the classification problem of separable classes. If classes are nonseparable to an acceptable level, then a slack variable that describes the false positives must be introduced to the optimization problem described in Eq. (9.12). This will lead to the following equation [7]:

$$\displaystyle{ \begin{array}{llll} &\mathop{\text{Minimize:}}\limits_{\mathbf{w},\gamma,\zeta \geq 0}&&\frac{\vert \vert \mathbf{w}\vert \vert ^{2}} {2} +\epsilon (\zeta ) \\ &\text{subject to:} &&\mathbf{s}(\mathbf{w}{\boldsymbol x'} +\gamma \mathbf{ I})+\zeta \geq \mathbf{ I}\end{array} }$$
(9.14)

where the new variable ζ is called the slack variable, and it describes the acceptance of false positive and true negative errors in the classification results. Incorporating this error variable in the optimization goal of the support vector machine, we can obtain a better and acceptable classifier.

2 Lagrangian Support Vector Machine

The Lagrangian Support Vector Machine may be conceptualized as matrix expansion and matrix multiplications. The paper [9] by Mangasarian and Musicant provides mathematical modeling of this approach with a detail explanation. It is mathematically intensive; therefore, this approach is simplified in this section with the usage of matrix expansions and multiplications with a conceptualized example.

2.1 Modeling of LSVM

The modeling of the Lagrangian support vector machine can be easily understood if you have a clear understanding of the support vector machine theory presented in Chap. 7 and in the earlier sections of this chapter. The Lagrangian support vector machine may be explained based on the details in [7, 9], however, adopting the following optimization problem proposed by Dunbar [10] as a new formulation (called \(L_{1} + L_{2} -\mathrm{SVM}\)) can help with better implementation:

$$\displaystyle{ \begin{array}{llll} &\mathop{\text{Minimize:}}\limits_{\mathbf{W},\gamma,\hat{\zeta }\geq 0}&&\frac{{\boldsymbol W'}{\boldsymbol L_{1}}\mathbf{W}} {2} + \frac{\gamma ^{2}} {2} + \frac{\lambda _{2}} {2}\hat{\zeta }'\hat{\zeta } \\ & \text{subject to:} &&\mathbf{S}(\mathbf{X}\mathbf{W} +\hat{ I}\gamma )+\hat{\zeta } \geq \hat{ I} \end{array} }$$
(9.15)

where

$$\displaystyle{ \mathbf{W} = \left [\begin{array}{*{10}c} \mathbf{w}\\ \mathbf{v} \end{array} \right ];{\boldsymbol L_{1}} =\lambda _{1}\left [\begin{array}{*{10}c} \mathbf{I}&{\boldsymbol 0}\\ {\boldsymbol 0} &{\boldsymbol 0} \end{array} \right ];\mathbf{S} = \left [\begin{array}{*{10}c} \mathbf{s}&{\boldsymbol 0}\\ {\boldsymbol 0} &\mathbf{I} \end{array} \right ];{\boldsymbol{\hat{I}}} = \left [\begin{array}{*{10}c} \mathbf{I}\\ {\boldsymbol 0} \end{array} \right ]; }$$
(9.16)
$$\displaystyle{ \mathbf{X} = \left [\begin{array}{*{10}c} \mathbf{x} &{\boldsymbol 0}\\ \mathbf{I} &{\boldsymbol 0} \\ {\boldsymbol -I}&{\boldsymbol 0} \end{array} \right ];{\boldsymbol{\hat{\zeta }}} = \left [\begin{array}{*{10}c} {\boldsymbol \zeta }\\ \mathbf{v} \\ 0 \end{array} \right ]. }$$
(9.17)

It can be considered as the generalized model of the support vector machine presented in Eq. (9.14). Say, for example, if you substitute value 1 for λ 1 and λ 2 with appropriate matrix dimensions, then we will be able to obtain the same optimization model as the one presented in Eq. (9.14).

2.2 Conceptualized Example

The optimization model \(L_{1} + L_{2} -\mathrm{SVM}\) proposed in [10] and presented above may be simplified for its implementation by the process diagram with data flow illustrated in Figs. 9.4 and 9.5. These figures illustrate the approach using a simple example with two classes (labeled 1 and −1), and 5 data points {(5.5, 1.5), (5.7, 0.5), (6.1, 1.1), (4.5, −1.8), (4.3, −1.5)}, λ 1 = 0. 95, and λ 2 = 1. Let us begin our explanation with the “Start” from the top of the diagram. The data table is first divided into the data domain and the response set. The data domain is then processed via the left side of the diagram, and the response set is processed via the right side of the diagram, and the results are combined to generate the input to the Mangasarian and Musicant code, which provides the results in the bottom right-hand corner of the figure. The step to get the final weights for the slope and the intercept of the classifier is illustrated in Fig. 9.5. The process in Figs. 9.4 and 9.5 reflects the code in Listing 9.4. This conceptualized example provides a simple visual tool to understand the code of \(L_{1} + L_{2} -\mathrm{SVM}\) in Listing 9.4.

Fig. 9.4
figure 4

The process diagram with data flow to develop the classifier based on \(L_{1} + L_{2} -\mathrm{SVM}\) and the Mangasarian and Musicant pseudo code in [9]

Fig. 9.5
figure 5

Calculation of the final output of the process diagram presented in the previous figure in Fig. 9.4

2.3 Algorithm and Coding of LSVM

The code in Listing 9.4 is written in R programming language, based on the process diagram with the data flow in Fig. 9.4, which was developed based on the \(L_{1} + L_{2} -\mathrm{SVM}\) formulation proposed by Dunbar [10] and the pseudo code presented by Mangasarian and Musicant in [9].

Listing 9.4 An R programming example—implementation of LSVM

The output of this program is presented in Fig. 9.6. It shows the scatter plot of the input data in file “file3.txt” and the support vector machine classifier calculated by this program. We can easily agree with this linear classification result.

Fig. 9.6
figure 6

The implementation of SVM and the classifier

3 Nonlinear Support Vector Machine

We have seen that the scatter plots play a major role in classification by facilitating the domain divisions. In the scatter plots, the dimension (i.e., each axis) is defined by a feature, and the space defined by the feature is called the vector space. The scatter plot describes the relationship between the features, and thus the correlated and uncorrelated data points can be identified in the vector space. The classification (in other words, the domain division) may be carried out either in a vector space or in a feature space, where the vector space is defined as the space that contains the scatter plot of the original features, and the feature space is defined as the space that contains the scatter plot of the transformed features using kernel functions [3].

3.1 Feature Space

Suppose there are p features \(X_{1},X_{2},\ldots,X_{p}\) and the ith observation is denoted by \(x_{i1},x_{i2},\ldots,x_{ip}\), where \(i = 1,\ldots,n\). Then we can plot these n data points in a p-dimensional space to form a multidimensional scatter plot. This space is the vector space, and it displays both the magnitude and the directional information of the data. This set of p features may be transformed to a new set of d features using a polynomial kernel [3]. This space is called the feature space and, in general, each data point in the feature space carries information about a single data point in the vector space. The advantage of a feature space is that the nonseparable classes in the vector space may be turned into separable classes using a right choice of a kernel. However, finding a kernel and generating such a transformation are not simple.

$$\displaystyle\begin{array}{rcl} \phi: R^{p} \rightarrow R^{d}& &{}\end{array}$$
(9.18)

where R p is the vector space (original domain) and R d is the feature space, which is high dimensional (generally d > > p). It is possible to find ϕ that helps transform the vector space to a feature space where the classes are linearly separable. Hence, the support vector machine classifier in the feature space can be written as follows:

$$\displaystyle{ \begin{array}{llll} &\mathop{\text{Minimize:}}\limits_{\mathbf{w},\gamma }&&\frac{\vert \vert \mathbf{w}\vert \vert ^{2}} {2} \\ & \text{subject to:} &&\mathbf{s}(\mathbf{w}\phi ({\boldsymbol x'}) +\gamma \mathbf{ I}) \geq \mathbf{ I} \end{array} }$$
(9.19)

However, \(\phi (\mathbf{x})\) is high dimensional, thus the computation becomes very expensive. This can be tackled using an approach called a kernel trick. Very useful lecture notes on kernel trick can be found at http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf.

3.2 Kernel Trick

The usefulness of the kernel trick technique is explained in this section using the data points shown in Fig. 9.7. This figure shows classes that are not linearly separable. However, if we transform them to a three-dimensional feature space (higher dimensional) using the following transformation, then we can obtain linear separability as shown in Fig. 9.8:

$$\displaystyle{ \phi (u_{1},u_{2}) = (au_{1}^{2},bu_{ 2}^{2},cu_{ 1}u_{2}) }$$
(9.20)
Fig. 9.7
figure 7

It shows nonlinear classifiers are required to classify these two classes—a circle or an ellipse is needed to separate these two classes

Fig. 9.8
figure 8

This example shows that the classification of nonlinear separable classes is possible in a higher dimensional space called feature space

Therefore, the support vector machine technique can be applied to this higher dimensional space, and a hyperplane can be derived as a classifier. In many real applications, such linear separability may be achieved in a very high-dimensional space, which makes it infeasible to apply the support vector machine techniques. This is where the kernel tricks help. What exactly is the kernel trick? It is explained with the following simple example: Let us take two points (u 1, u 2) and (v 1, v 2) from the two-dimensional space presented in Fig. 9.7. Then we can have their transformed points as follows:

$$\displaystyle\begin{array}{rcl} & & \phi (u_{1},u_{2}) = (au_{1}^{2},bu_{ 2}^{2},cu_{ 1}u_{2}) \\ & & \ \ \phi (v_{1},v_{2}) = (av_{1}^{2},bv_{ 2}^{2},cv_{ 1}v_{2}){}\end{array}$$
(9.21)

Let us now define a new function, which is called the kernel function, as follows:

$$\displaystyle\begin{array}{rcl} k(u,v) =\phi (u_{1},u_{2}).\phi (v_{1},v_{2})& &{}\end{array}$$
(9.22)

It gives us

$$\displaystyle\begin{array}{rcl} k(u,v) = a^{2}u_{ 1}^{2}v_{ 1}^{2} + b^{2}u_{ 2}^{2}v_{ 2}^{2} + c^{2}u_{ 1}v_{1}u_{2}v_{2}& &{}\end{array}$$
(9.23)

If we select c 2 = 2ab, then we can have

$$\displaystyle\begin{array}{rcl} k(u,v) = (au_{1}v_{1} + bu_{2}v_{2})^{2}& &{}\end{array}$$
(9.24)

This can be written in the following matrix form:

$$\displaystyle{ k(u,v) = \left (\left [\begin{array}{*{10}c} u_{1} & u_{2} \end{array} \right ] {\ast}\left [\begin{array}{*{10}c} a&0\\ 0 & b \end{array} \right ] {\ast}\left [\begin{array}{*{10}c} v_{1} \\ v_{2}\end{array} \right ]\right )^{2} }$$
(9.25)

It shows that even if the function ϕ transforms the original data domain to a higher dimensional domain, the product ϕ. ϕ can be easily defined based on the data in the original domain. Therefore, we can conclude that the kernel function

presented in Eq. (9.22) can be obtained by the matrix operations inside the original vector space rather than in the higher dimensional feature space. With the dual form [10] and the kernel function, the support vector machine can be applied in the original space (which is the lower dimension) with the same effect as its application inside the higher dimensional feature space.

Listing 9.5 An R programming example—Kernel trick example

This R program reads the contents of “file4.txt” and first generates the scatter plot in Fig. 9.7. We can see the need for a nonlinear classifier. A kernel trick code in the rest of the program transforms the data to a higher dimension (in this case, 3D) and generates the plot as shown in Fig. 9.8. We can clearly see a linear separation in the transformed data.

The kernel trick generally increases the dimensionality and, in turn, it can increase the computational time.

3.3 SVM Algorithms on Hadoop

Big data classification requires the support vector machine to be implemented on the system like the RHadoop, which provides a distributed file system and the R programming framework. As we recall, this framework provides mapper(), reducer(), and mapreduce() functions. Therefore, the MapReduce programming on Hadoop distributed files systems allows the implementation of svm-based algorithms either inside the mapper() function or inside the reducer() function. Two examples are considered in this section: the first example adopts the five points, three lines svm-based example discussed earlier and implements it inside the reducer() function, and the second example implements the lsvm algorithm inside the mapper() function and illustrates the conceptual example presented previously.

3.3.1 SVM: Reducer Implementation

Once again, the \(L_{1} + L_{2} -\mathrm{SVM}\) formulation proposed by Dunbar [10] and the pseudo code presented by Mangasarian and Musicant in [9] have been used in this implementation. The RHadoop system requires a number of environment variables [11]; therefore, they are included in the program from lines 4 to 6 in Listing 9.6. They provide path to the home of MapReduce (to access necessary libraries), to the Hadoop command (for program execution), and streaming jar file in the Linux system. In line 8, the data is uploaded into the R environment. The implementations on RHadoop requires two libraries [12, 13], rmr2, and rhdfs, and they are included in lines 10 and 11.

Listing 9.6 An RHadoop example—LSVM as a reducer() function

We should initialize the Hadoop environment and feed the data to it, and these tasks are presented in lines 13 and 14. Once these tasks are performed, we can define mapper() and reducer() functions, and then input them to the MapReduce model. The mapper() function is defined from line 16 to 21, and it creates a (key, value) pair from the input data. The integer value of 1 is used as key (see line 17) because of the single file processing, and the features in the first and the second column (v[, 1], v[, 2]) of the file are used as values (see line 18) in the key value pair presented in line 20. The reducer() function accepts the (key, value) pair, and uses the values to find the classifier (i.e., the slope and the intercept), which gives the minimum measure adopted in the svm’s optimization approach [see Eq. (9.13)]. These steps are in the code from lines 23 to 86. Then these optimal parameters are tagged with the key (see line 85). Each block of code in this program is commented such that they are self explanatory. The block of code from line 58 to line 78 produces the scatter plot and the straight lines (possible svm-based classifiers) presented in Fig. 9.9. Because this program is executed inside the RHadoop it saves this result as Rplots.pdf file. The MapReduce model in lines 88–90 then assigns them to the variable called “classify.” These data processing tasks occur inside the Hadoop environment, and they must be transferred to outside the Hadoop environment as performed with the command in line 92.

Fig. 9.9
figure 9

It shows the implementation of SVM and the classifier

3.3.2 LSVM: Mapper Implementation

The mapper implementation of the \(L_{1} + L_{2} -\mathrm{SVM}\) formulation proposed by Dunbar [10] is presented, and it uses the pseudo code presented by Mangasarian and Musicant in [9]. In the reducer() implementation, the sorted data was obtained from the mapper(), and then the reducer() implemented the LSVM-based approach to derive the weights for the slope and intercept parameters of the svm classifier. But in the mapper implementation, the LSM-based approach is implemented in the mapper() function, and the slope and intercept parameters are calculated. These parameters are passed to the reducer() function. In both cases, a single key is used. However, multiple keys can be used to take advantage of the parallelization and sorting features of the MapReduce framework.

Listing 9.7 An RHadoop example—LSVM as a mapper() function

The output of this program is presented in Fig. 9.10. This is similar to the one in Fig. 9.6, except this classifier is obtain using the RHadoop and MapReduce computing tools. However, the results show a significant similarity. The slopes and the intercepts may be numerically compared to determine their similarities.

Fig. 9.10
figure 10

The implementation of SVM on RHadoop and the classifier

3.4 Real Application

In this real application, the hardwood floor and carpet floor data sets are used. As you recall, these data sets have 1024 observations in each with 64 features that correspond to the intensity values of the pixels. To show the performance of the support vector machine implemented in Listing 9.7 in a two-dimensional data domain, the features 48 and 49 are selected. The scatter plots of the data sets corresponding to these two features and the support vector machine classifier obtained using the algorithm are presented in Fig. 9.11. We can clearly see the linear classification performance of the Lagrangian support vector machine.

Fig. 9.11
figure 11

The implementation of SVM on RHadoop, and the classifier with hardwood floor and carpet floor data sets

Problems

9.1. Code Revision

Revise the MapReduce programs presented in this chapter using the coding principles taught in Chap. 5

9.2. Coding Efficiency

  1. (a)

    Study the programs presented in the listings and draw the structure diagrams, data flow diagrams, and process diagrams based on the software engineering principles.

  2. (b)

    Study the R programs in the Listings and improve their efficiencies using coding principles and modularization. Make this program more efficient using arrays and input files as well.

9.3. Comparison

Discuss the advantages and disadvantages of the mapper() and the reducer() implementations of the svm-based approaches. You may also run these implementations and obtain the system times to support the discussion.

9.4. Split-Merge-Split

  1. (a)

    Assuming that you have completed the problem presented in “Problem 3.1,” perform the same steps using the RHadoop system with the R programming framework.

  2. (b)

    Compare the results that you obtained in (a) with the results that you obtained in Problem 3.1.