# Elliptical modeling and pattern analysis for perturbation models and classification

- 122 Downloads

## Abstract

The characteristics of a feature vector in the transform domain of a perturbation model differ significantly from those of its corresponding feature vector in the input domain. These differences—caused by the perturbation techniques used for the transformation of feature patterns—degrade the performance of machine learning techniques in the transform domain. In this paper, we proposed a semi-parametric perturbation model that transforms the input feature patterns to a set of elliptical patterns and studied the performance degradation issues associated with random forest classification technique using both the input and transform domain features. Compared with the linear transformation such as principal component analysis (PCA), the proposed method requires less statistical assumptions and is highly suitable for the applications such as data privacy and security due to the difficulty of inverting the elliptical patterns from the transform domain to the input domain. In addition, we adopted a flexible block-wise dimensionality reduction step in the proposed method to accommodate the possible high-dimensional data in modern applications. We evaluated the empirical performance of the proposed method on a network intrusion data set and a biological data set, and compared the results with PCA in terms of classification performance and data privacy protection (measured by the blind source separation attack and signal interference ratio). Both results confirmed the superior performance of the proposed elliptical transformation.

## Keywords

Data privacy Classification Dimension reduction Network intrusion Perturbation model## 1 Introduction

Feature vectors carry useful numerical patterns that characterize the original domain (or a sub original domain—input domain) formed by the feature vectors themselves. Machine learning algorithms generally utilize these patterns to generate classifiers, that can help make decisions from data, by using supervised or unsupervised learning techniques [30]. However, certain data science applications, such as data privacy and data security [34], require the alteration of these feature patterns to protect data privacy so that it should be difficult to recover the original patterns from the altered patterns [22]. Perturbation models have been studied and developed for this purpose [11, 24]. The perturbation models generally transform the feature vectors from an original domain to a new set of feature vectors within a transform domain where the data privacy can be protected. On the other hand, the performance of machine learning algorithms can be degraded in the transform domain due to the alternations of the patterns. Hence a significant research has been performed to develop an efficient perturbation model to minimize the degradation of the performance of machine learning algorithms while providing a robust protection of data privacy. Perturbation models may be categorized into two top-level groups: parametric models and nonparametric models. The parametric models may also be further divided into two subgroups: vector space (or the original domain) models and feature space (or the transform domain) models. The vector space models include the models proposed by [24], in which the authors have shown that their proposed models perform well in the original domain. Alternatively, Oliveira and Zaïane [26] proposed a feature space model which was constructed using a matrix rotation, and Lasko and Vinterbo [19] also developed a feature space model, but they used a spectral analysis. They showed that their proposed techniques performed well in the transform domain. These types of models make parametric statistical assumption which in practice can be easily violated for different types of data. As a consequence, the current techniques may not perform as desired. A thorough review was presented in a recent paper by [27], in which the authors summarized the possible types of violations of parametric assumptions, including uncertainty in marginal distributional properties of independent variables and possible nonlinear relationship that linear models cannot fully explore (e.g., invert-U shape [1]). They proposed a nonparametric model based on density ratios to address these problems and reported that the nonparametric models in general can perform better than the other parametric models.

In this paper, we consider a semi-parametric perturbation model in the sense that no parametric assumptions are imposed on the marginal distribution of features, while the machine learning model that combines transformed variables is indeed parametric. The main idea is to construct a transform domain (or feature space) from the original domain using parametrized elliptical patterns with the goals of making the restoration of the original patterns very difficult, while maintaining a similar performance for the machine learning algorithms in both the original and the transform domains. Our proposed approach, elliptical pattern analysis (EPA), sets the criteria on privacy strength based on blind source separation attack [35], because of the use of mutual interaction between variables to construct transform domain.

Our key contribution includes the use of mutual interaction between two variables (or features); however, this type of aggregation may jeopardize the performance of classification algorithms through the loss of some of the data characteristics (or patterns). To solve this problem, we proposed an additional data aggregation step through the random projection in the feature space before applying any machine learning algorithms. The main idea is to search over possible ways to combine pairs (or blocks) of variables to achieve efficient dimension reduction while maintaining useful predictive information to help later-stage for machine learning algorithms. In particular, we consider classification algorithms and use random forest classification on the reduced feature space. By aggregating feature variables, the proposed method significantly enhances the protection of data privacy and reduces computational cost.

## 2 A perturbation model

We define the proposed EPA approach as a model that transforms a suboriginal domain (input domain) through a systematic perturbation process such that the feature vector is altered in the transform domain to achieve a set of specific recommended goals—the goals that lead to the protection of data privacy and the generation of classifiers. In this section, the perturbation models is defined using a mathematical transformation (*T*) and recommended quantitative measures for quantifying the strength of data privacy (\(\rho \)) and misclassification error (\(\eta \) or \(\zeta \)).

### 2.1 Mathematical definition

*p*in the input domain

*X*, and \(\mathbf y \) is its perturbed feature vector with dimension

*q*(where \(q < p\)) in the transform domain

*Y*, then we define the mathematical relationship between \(\mathbf x \) and \(\mathbf y \) as described in the following equation:

*T*defines the proposed perturbation model, and its intention is to satisfy the condition \(\rho (\mathbf x ,\mathbf y )> \epsilon _0\) for some quantitative measure \(\rho \). In other words, this condition describes the difficulty of recovering the feature vector \(\mathbf x \) from the feature vector \(\mathbf y \) given the transformation

*T*and the quantitative measure \(\rho \). One of the real-world, data science applications that satisfy this type of modeling is data privacy where the owner of the data wants to share the data to an intended user, while its privacy is protected, given the transformation

*T*and the measure \(\rho \) are chosen appropriately.

### 2.2 Problem definition

*M*, then the performance degradation of the perturbation model

*T*can be defined as follows:

*M*that we consider in this paper is a classification technique—in particular the random forest technique—with the misclassification error (MCE) as the performance measure \(\eta \).

## 3 The proposed methodology

*T*with its condition measure \(\rho \), and an application

*M*with its performance degradation measure \(\eta \). They are presented in this section with a detailed discussion.

### 3.1 Elliptical perturbation model

*p*variables (or features), \(x_1, x_2, \dots , x_p \ge 0\). We also assume

*p*is an even integer without loss of generality. We use the proposed perturbation on consecutive pairs of variables (tuples): \((x_1,x_2)\), \((x_3,x_4)\), \(\dots \), \((x_{p-1},x_p)\) to generate the feature vector \(\mathbf y \) which is represented by new variables \(y_1,y_2,\ldots ,y_q\); \(q=p/2\), respectively. Taking the tuple \((x_1,x_2)\) as an example, we consider

*a*and

*b*are unknown parameters, \(\varepsilon \sim N(0, 1)\) and \(\alpha \) determines the strength of noise degradation. To further simplify the process, we can assume \(a+b=1\) and \(a,b \ge 0\). The model reduces to the standard linear model when \(a \rightarrow 0\) or \(a \rightarrow 1\). The nonlinear transformation \(\sqrt{a x_1^2 + bx_2^2}\) defines the elliptical perturbation model and describes the nonlinear mutual interaction between the feature variables \(x_1\) and \(x_2\).

On the one hand, we can choose the value for *a* such that the classification results using \(\mathbf y \) and \(\mathbf x \) are significantly close to each other (i.e., \(\zeta _\mathrm{T} \sim 0\)). On the other hand, we can choose *a* to minimize the absolute value of correlations between \(y_1\) and \((x_1,x_2)\). Meanwhile, noise strength \(\alpha \) will be tuned to achieve the intended goal (e.g., data privacy determined by \(\rho \)) of the perturbation model. In the process of building the model, we will use this correlation-minimization to tune the model parameter *a*.

### 3.2 Elliptical patterns visualization

*y*to a single value and varying the values of the parameters

*a*,

*b*, and \(\alpha \). For simplicity, we have selected \(y=1\), and a set of values (0.22, 0.78, 0.03), (0.32, 0.68, 0.04), and (0.1, 0.9, 0.05) for the parameters

*a*,

*b*, \(\alpha \), respectively. The model in Eq. (3), with these values, provides the three elliptical patterns with interference characteristics as illustrated in Fig. 1. In order to generate these elliptical patterns, we transform Eq. (3) as follows:

*a*. The measure of this interference will help to determine parameters of the model for the protection of data privacy. We treat this interference conceptually as signal interference, and then apply blind signal separation approaches [35] to determine the strength of data privacy.

### 3.3 Blind source separation

### 3.4 Random forest classification

Among many classification techniques in a machine learning system, we have selected the random forest technique [4] for our research, because of its ability to address multi-class classification problem better than many other machine learning techniques, including support vector machine [15, 31] and decision tree [25]. The random forest classifiers divide the data domain efficiently using bootstrapping technique—used to generate random decision trees—and Gini index—used to split the tree nodes. Hence it is highly suitable for the classification objectives of a large and imbalanced data set with many features.

### 3.5 Misclassification and OOB errors

Several measures have been used to quantify the performance of classification techniques in machine learning; among them out-of-bag (OOB) error and misclassification errors are the most commonly used errors for the random forest classifiers [3]. OOB error is defined by the ratio between the total number of misclassified items from a set and the total number of items in the set. Similarly the misclassification error of a class is defined by the ratio between the number of misclassified items in the class and the total number of items in the class. We have used both of these quantitative measures to evaluate the performance of random forest classification algorithm in the input domain as well in the transform domain with the proposed perturbation model, and compare the simulation results.

### 3.6 Theory

To better understand the theoretical properties of the proposed random projection idea, we study the asymptotic property of the proposed method and show the following lemma.

## Lemma 1

Suppose that the number of observations *n* and the number of predictors *p* both go to \(\infty \). Assume that there are only a fixed number of predictors related to the output, denoted by \(p_0\), and that \(p = m B\), where *m* is the fixed block size and *B* is the number of blocks that goes to \(\infty \). Then with probability tending to one, our method will manage to select the true set of predictors. In other words, the proposed method achieves variable selection consistency.

Lemma 1 states that the proposed random projection idea enjoys the nice variable selection consistency property under an asymptotic sense. This result is desired as is provides a theoretical justification for the proposed method in the sense that the randomly generated block structure manages to capture the unknown true set of predictors with probability tending to one each time. The assumptions we make here are standard and commonly used in both statistics and machine learning literature, e.g., [10]. The proof follows by standard calculation.

## Proof

Note that the proposed method essentially employs a quadratic transformation model within each block. Then if one block only contains one true predictor, then that predictor will naturally be recovered from the model. In other words, it is good enough to show that with probability tending to one, each block from the proposed random projection iteration will only contain at most one true predictor.

*p*predictors in total, there are

*p*! possible permutations. Out of these permutations, the number of possible occurrence of having at most one true predictor in each block is given by \(B (B-1) \cdots (B-p_0+1) m^{p_0} (p-p_0)!\). By taking the ratio, we obtain the probability as

## 4 Experimental results

We studied the performance degradation of random forest classifiers using the proposed elliptical perturbation model and the highly imbalanced NSL-KDD data set (http://www.unb.ca/cic/research/datasets/nsl.html), which we downloaded and used it in a previous research [32]. This data set has 25,192 observations with 41 network traffic features and 22 network traffic classes. We labeled the entire feature vector as (\(f_1, f_2, \dots , f_{41}\)) and reduced it later to a lower-dimensional feature vector, based on their importance to random forest classification. This data set forms the original domain and we represented this data set as “dataset-O”. In this data set, the normal traffic class and the Neptune attack class have large number of observations, compared to other attack classes; hence, it provides a highly imbalanced data set that is useful for our analysis.

The network traffic details of this data set presented in Table 1 clearly show the imbalanced nature of the data set between normal and attack traffic classes, and among the attack traffic classes. The first 11 traffic classes (labeled 0–10) presented in this table have more than 30 observations, and the next 11 traffic classes (labeled 11–21) have much less than 30 observations. One of the goals is to study the effect of the proposed perturbation model on the performance of random forest classifiers using the first 11 traffic classes only; however, we will use the other 11 traffic classes to understand imbalanced nature of the data and its significance to random forest classification.

### 4.1 Feature selection using random forest

There are 41 features—as we denoted by (\(f_1, f_2, \dots , f_{41}\)) earlier—in the dataset-O, and this feature vector determines the dimensionality 41 of the original domain; however, not necessarily all of these features contribute to the classification performance of random forest. To prepare the data set for our experiments and select the important features for classification, we first removed the categorical variables (or features) along with the features that overshadow the other features due to outliers. We then applied random forest classification to determine the importance of features by ordering them based on their misclassification errors.

Using the approach suggested by [36], and by removing the least important feature from the feature vector one-by-one, while performing random forest classification repeatedly until a change in misclassification error can be observed. This process resulted in a lower-dimensional data set with 16 features, (\(f_{33}\), \(f_4\), \(f_{32}\), \(f_6\), \(f_{36}\), \(f_{20}\), \(f_{28}\), \(f_{19}\), \(f_{31}\), \(f_{27}\), \(f_9\), \(f_{29}\), \(f_8\), \(f_{23}\), \(f_{37}\), \(f_{30}\)) in the decreasing order of importance. Hence, we have reduced the data set to a data set (\(p=16\)) with the most important feature vector that contributes to random forest classification. For simplicity, we represented these features by (\(x_1, x_2, \dots , x_{16}\)), respectively. Therefore, the dimension of the input domain of the proposed perturbation model is \(p=16\) with 25,192 observations, 16 network traffic features, and 22 network traffic classes. For convenience, let us represent this dimension-reduced data set for the input domain as “dataset-I.”

### 4.2 Transform domain pattern analysis

The next step of this experiment is to build the perturbation model, using the dataset-I as the input domain and construct the transform domain so that the random forest classifiers can be evaluated. Due to the pairing of features, multiple elliptical perturbation models were generated by selecting suitable parameters for the model, and they are discussed in the subsections below.

#### 4.2.1 Multiple model generation

#### 4.2.2 Parameter selection for the models

We used Monte Carlo approach with the JADE implementation of SIR computation to assess BSS attack empirically. In this implementation, multiple copies of modulated source signals are generated using random weights, and then a SIR value is calculated to determine if the source signals are recoverable (if SIR is greater than 20 dB, then source signals are recoverable, otherwise they are not) from the multiple modulated signals. In our implementation, the feature pair (\(x_{2i-1},x_{2i}\)), \(i = 1, \dots , 8\) is considered as source signals, and \(y_i\) is considered as their modulated signal. To create, multiple copies of modulated signal \(y_i\), using (\(x_{2i-1},x_{2i}\)), we generated several values for \(a_i\) randomly from uniform distribution, then used the Monte Carlo approach to achieve desired results.

The Monte Carlo approach, combined with the JADE application of SIR and BSS attack provided us with the three values 0.042, 0.021, 0.096, which we selected for \(a_1\), \(a_2\), and \(a_3\). To cut down the computational cost of Monte Carlo approach, we used them repeatedly for the parameters \(a_i\) as follows: \(a_1=0.042\), \(a_2=0.021\), \(a_3=0.096\), \(a_4=0.042\), \(a_5=0.021\), \(a_6=0.096\), \(a_7=0.042\), and \(a_8=0.021\) for the 8 models, respectively. We obtained the SIR values for these parameters: 14.289, 10.983, 7.873, 11.483, 11.758, 12.608, 14.675, 16.235, respectively—the values less than 20 dB indicate the source signal separation is difficult; hence, BSS attack is not possible. We can also see, each model has different privacy strengths, for example, model \(M_3\) is much stronger than model \(M_8\) against BSS attack. Therefore, in this step, we generated a data set for the transform domain, and it has 25,192 observations with 8 newly defined traffic features (\(y_i\), \(i=1, \dots , 8\)) and 22 network traffic classes. Let’s represent this transform domain data set as “dataset-T”.

### 4.3 Performance degradation evaluation

Statistical information of different traffic types in the NSL-KDD data set—number of observations \(\ge \) 30

Label | Traffic | #Obs. |
---|---|---|

0 | Normal | 13,449 |

1 | Neptune | 8282 |

2 | back | 196 |

3 | Warezclient | 181 |

4 | ipsweep | 710 |

5 | portsweep | 587 |

6 | teardrop | 188 |

7 | nmap | 301 |

8 | satan | 691 |

9 | smurf | 529 |

10 | pod | 38 |

#### 4.3.1 Experiment with full-imbalanced data sets:

Input domain: random forest classification results of NSL-KDD data with original features and full-imbalanced data

Label | OOB errors | Misclassification errors |
---|---|---|

Normal | 0.0098 | 0.005 (13,379, 70) |

Neptune | 0.0098 | 0.003 (8256, 26) |

back | 0.0098 | 0.025 (191, 5) |

warezclient | 0.0098 | 0.127 (158, 23) |

ipsweep | 0.0098 | 0.026 (691, 19) |

portsweep | 0.0098 | 0.017 (577, 10) |

teardrop | 0.0098 | 0.010 (186, 2) |

nmap | 0.0098 | 0.086 (275, 26) |

satan | 0.0098 | 0.041 (662, 29) |

smurf | 0.0098 | 0.015 (521, 8) |

pod | 0.0098 | 0.184 (31, 7) |

*OOB error* The OOB errors and misclassification errors are presented in Tables 2 and 3 in their second and third columns, respectively. The tables also provide the information of the tuples, correctly classified and misclassified number of observations, for each class in input domain—denoted by (*idcc*, *idmc*)—and transform domain—denoted by (*tdcc*, *tdmc*), respectively. In the tables, the OOB errors are calculated as a single measure for the classification performance on the set, and thus we have a single value of 0.0098 for input variables (unprotected features), 0.0169 for transformed variables (protected features). If we round these values to the second decimal places, we get 0.01 and 0.02 OOB errors, making it 1% error difference in the performance degradation—input domain versus transform domain. We can see that the perturbation model increases the OOB error slightly while protecting data privacy.

*Misclassification error*Similarly, by comparing misclassification errors presented in Tables 2 and 3, we observed that the perturbation model has a higher misclassification errors as expected, showing the characteristics of a perturbation model. As we can observe, the misclassification errors are increased, except for the traffic types ipsweep, teardrop, and pod. However, the error differences are significantly lower; hence, the perturbation model helps achieve both the protection of data privacy and the classification performance of random forest.

Transform domain: random forest classification results of NSL-KDD data with EPA transformed features and full-imbalanced data

Label | OOB errors | Misclassification errors |
---|---|---|

Normal | 0.0169 | 0.009 (13,322, 127) |

Neptune | 0.0169 | 0.009 (8205, 77) |

back | 0.0169 | 0.041 (188, 8) |

warezclient | 0.0169 | 0.232 (139, 42) |

ipsweep | 0.0169 | 0.021 (695, 15) |

portsweep | 0.0169 | 0.063 (550, 37) |

teardrop | 0.0169 | 0.005 (187, 1) |

nmap | 0.0169 | 0.116 (266, 35) |

satan | 0.0169 | 0.063 (647, 44) |

smurf | 0.0169 | 0.045 (505, 24) |

pod | 0.0169 | 0.053 (36, 2) |

Input domain: random forest classification results of NSL-KDD data with original features and reduced-imbalanced data

Label | OOB errors | Misclassification errors |
---|---|---|

Normal | 0.0088 | 0.005 (13,381, 68) |

Neptune | 0.0088 | 0.003 (8253, 29) |

back | 0.0088 | 0.025 (191, 5) |

warezclient | 0.0088 | 0.127 (158, 23) |

ipsweep | 0.0088 | 0.025 (692, 18) |

portsweep | 0.0088 | 0.013 (579, 8) |

teardrop | 0.0088 | 0.010 (186, 2) |

nmap | 0.0088 | 0.093 (273, 28) |

satan | 0.0088 | 0.044 (660, 31) |

smurf | 0.0088 | 0.015 (521, 8) |

pod | 0.0088 | 0.210 (30, 8) |

Transform domain: random forest classification results of NSL-KDD data with EPA transformed features and reduced-imbalanced data

Label | OOB errors | Misclassification errors |
---|---|---|

Normal | 0.0156 | 0.009 (13,322, 127) |

Neptune | 0.0156 | 0.009 (8207, 75) |

back | 0.0156 | 0.040 (188, 8) |

warezclient | 0.0156 | 0.220 (141, 40) |

ipsweep | 0.0156 | 0.022 (694, 16) |

portsweep | 0.0156 | 0.061 (551, 36) |

teardrop | 0.0156 | 0.005 (187, 1) |

nmap | 0.0156 | 0.102 (270, 31) |

satan | 0.0156 | 0.059 (650, 41) |

smurf | 0.0156 | 0.039 (508, 21) |

pod | 0.0156 | 0.053 (36, 2) |

#### 4.3.2 Experiment with reduced-imbalanced data sets

We used dataset-IR and dataset-TR to compare the performance of random forest classifiers in input and transform domains for the purpose of this experiment. It means only the 11 traffic types with more than 30 observations were classified to study whether there was any significant effect due to the elimination of other traffic types that have significantly lower number of observations. The results are presented in Tables 4 and 5, and we can observe similar patterns between the input domain and transform domain results. Hence, comparing the results in Tables 2 and 4, we can see that the OOB error has slightly decreased due to the reduced-imbalanced nature of traffic types, as expected. Similarly, comparing the results in Tables 3 and 5, we can see the reduction in the OOB error, and an overall reduction in the misclassification errors.

### 4.4 Overall performance degradation

*Note that a positive value indicates it is a degradation over input domain to transform domain, whereas, a negative value indicates there is an improvement over input domain to transform domain*. The average degradations over all the class types are 1.05% for full-imbalanced data sets, and 0.45% for reduced-imbalanced data sets—indicating additional average degradation of 1.05% when the full-imbalanced data are used, additional average degradation of 0.45% when reduced-imbalanced data are used, and the difference shows the use of additional imbalanced data affects the performance negatively.

Performance degradation of random forest classifiers over input domain to EPA transformed domain using full-/reduced-imbalanced data

Label (t) | Full-Imb. (\(pd_\mathrm{t}\)) | Reduced-Imb. (\(pd_\mathrm{t}\)) |
---|---|---|

Normal | 0.4238233 | 0.4386943 |

Neptune | 0.6157933 | 0.5554214 |

back | 1.5306122 | 1.5306122 |

warezclient | 10.4972376 | 9.3922652 |

ipsweep | \(-\) 0.5633803 | \(-\) 0.2816901 |

portsweep | 4.5996593 | 4.7700170 |

teardrop | \(-\) 0.5319149 | \(-\) 0.5319149 |

nmap | 2.9900332 | 0.9966777 |

satan | 2.1707670 | 1.4471780 |

smurf | 3.0245747 | 2.4574669 |

pod | \(-\) 13.1578947 | \(-\) 15.7894737 |

Avg. Err. | 1.054483 | 0.4532049 |

Performance degradation of random forest classifiers over input domain to PCA-transformed domain using full-imbalanced data only

Label (t) | Full-Imb. 5PC (\(pd_\mathrm{t}\)) | Full-Imb. 6PC (\(pd_\mathrm{t}\)) |
---|---|---|

Normal | 0.3345974 | 0.1487099 |

Neptune | 0.4829751 | 0.4346776 |

back | 13.7755102 | 10.7142857 |

warezclient | 7.1823204 | 3.3149171 |

ipsweep | 0.4225352 | 0.7042254 |

portsweep | 1.8739353 | 1.7035775 |

teardrop | \(-\) 0.5319149 | \(-\) 0.5319149 |

nmap | 0.9966777 | 0.3322259 |

satan | 0.8683068 | 0.5788712 |

smurf | 5.6710775 | 6.4272212 |

pod | \(-\) 7.8947368 | \(-\) 13.1578947 |

Avg. Err. | 2.107389 | 0.9699002 |

## 5 Comparisons with competing methods

Performance evaluation of DPLR

(Class1, Class 2) | Misclassification error (MCE) | SIR (dB) |
---|---|---|

(Normal, Neptune) | (0.0081, 0.0397) | 22.59660 |

(Normal, Portsweep) | (0.0017, 0.0988) | 28.04944 |

(Neptune, Back) | (0.0024, 0.0000) | 26.24273 |

(Neptune, Smurf) | (0.0050, 0.0245) | 9.55109 |

(Back, Portseep) | (0.0714, 0.0034) | 37.75119 |

(Back, Smurf) | (0.1582, 0.0132) | 3.98875 |

(Teardrop, Satan) | (0.3936, 0.0449) | 18.111127 |

(Smurf, Pod) | (0.0056, 0.0526) | 1.59793 |

Performance evaluation of ESP

(Class1, Class 2) | Misclassification error (MCE) | SIR (dB) |
---|---|---|

(Normal, Neptune) | (0.0009, 0.0027) | 7.33663 |

(Normal, Portsweep) | (0.0008, 0.0102) | 8.87933 |

(Neptune, Back) | (0.0000, 0.0000) | 8.09299 |

(Neptune, Smurf) | (0.0000, 0.0000) | 7.80478 |

(Back, Portseep) | (0.0000, 0.0017) | 8.17580 |

(Back, Smurf) | (0.0000, 0.0000) | 4.61716 |

(Teardrop, Satan) | (0.0000, 0.0014) | 6.73531 |

(Smurf, Pod) | (0.0000, 0.0000) | 7.99792 |

### 5.1 Comparative analysis: PCA vs. EPA

The results of PCA transformation—applied to the full-imbalanced NSL-KDD data—are presented in Table 7 and they can be compared with the results of the proposed EPA approach (applied to the same data) in the second column of Table 6. We adopted two criterion to extract number of PCs: eigenvalue greater than 1 criterion (i.e., Kaiser–Guttman criterion) as used in [14] and 80% cumulative variance rule as stated in [5]. The number of PCs selected by these criterion are 5 and 6, respectively. The random forest results (\(pd_\mathrm{t}\)) using the first 5 PCs and 6 PCs of these data are presented in the second and third columns of Table 7.

The results in the second columns of Tables 6 and 7 show that the average performance degradation caused by PCA with 5 PCs is higher (almost double) than the degradation caused by the proposed EPA approach. In contrast, the results in the third column suggests a smaller degradation is possible if 6 PCs are used. These results, with the use of higher number of PCs and PCA, can achieve better classification accuracy; thus, it also suggests the proposed approach can be competitive.

OOB errors of three cases using IRIS plant data set

Class | OOB: RF | OOB: RF-PCA | OOB: RF-EPA |
---|---|---|---|

Setosa | 0.00 | 0.00 | 0.00 |

Versicolor | 0.08 | 0.12 | 0.12 |

Virginica | 0.06 | 0.10 | 0.06 |

Performance evaluation of DPLR and EPA using IRIS data as binary classifiers

(Class 1, Class 2) | MCE: DPLR | SIR: DPLR | SIR: EPA |
---|---|---|---|

(Setosa, Versicolor) | (0.00, 0.00) | 13.52917 | 9.49978 |

(Setosa, Virginica) | (0.00, 0.00) | 10.48724 | 9.69263 |

(Versocolor, Virginica) | (0.04, 0.06) | 18.74225 | 7.40357 |

### 5.2 Comparative analysis: DPLR vs. EPA

The results of DPLR transformation—applied to a subset of NSL-KDD data—are presented in Table 8. The DPLR is a binary classification approach [8]; therefore, we have divided the NSL-KDD data set into several subsets with two classes and then applied the DPLR approach. To facilitate a fair comparison, the proposed EPA approach is also applied to this new subset of NSL-KDD data and a set of new results are obtained. These results are presented in Table 9. In Table 8, the tuples of traffic types that are considered for the binary classifier DPLR are presented in this first column. We have tested all the pairs of traffic types and the results are similar to the candidate pairs presented in this Table. The second and third columns of Table 8 provide information about the misclassification errors and the SIR values of the pairs of traffic types designated by the first column of Table 8. The low misclassification errors indicate the DPLR performs very good in terms of classifying traffic types based on two class classification. At the same time, the SIR values above 20 dB or closer to 20 indicate the source signals (variables) can be recovered from the results of DPLR classifiers—hence the data privacy is not strong.

The comparison between the MCE values in Tables 8 and 9, we can clearly see that the misclassification errors of the proposed EPA approach is much lower than that of DPLR. At the same time, the SIR values below 12 dB of ESP approach indicate that the source signals cannot be recovered; thus, provides a very strong data privacy. Therefore, in over all we can decide the propose approach is much better than DPLR and PCA approaches. In the next section, we present and experimental results that support the same findings using IRIS data set.

### 5.3 Evaluation using IRIS plant data set

We also used the iris plant dataset to evaluate and compare EPA and PCA transformations, and the transformation in DPLR. This dataset is a simple, yet effective dataset, which has been used in machine learning extensively for the last several decades [7]. We obtained this data from the UCI Machine Learning Repository [21]. Random forest is applied to the original iris data first; hence, OOB errors are calculated and presented in the second column of Table 10. These data are then transformed into PCs using PCA. The random forest classification is applied using all the PCs and the OOB results are presented in the third column of Table 10. We also transformed the data set using the proposed EPA transformation and then applied random forest classification. The OOB results of the proposed approach is presented in the fourth column of the table. Note that the first column of the table shows the three classes of the iris plant. Comparing the results in Table 10, we can say that the proposed transformation provides the classification results closer to the results of random forest applied to the original data than the principal components.

The DPLR approach is then applied to the iris data set and the results are presented with the all the possible pairs of class types Table 11. Once again, we can see higher SIR for DPLR than that of EPA approach. However, the SIR values are below 20 dB; hence both DPLR and EPA provide data privacy; however, the SIR values below 12 dB of EPA indicate EPA provides much stronger data privacy than DPLR.

## 6 Conclusion

This study allowed us to understand the variations caused by the perturbation models between their input domain and transform domain characteristics or numerical patterns. This knowledge helped us construct a parametric perturbation model using an elliptical transformation along with an additive Gaussian noise degradation. The degradation performance analysis using random forest classifiers together with blind source separation attack and quantitative measures—signal interference ratio, OOB error, and misclassification error—showed that the parametric elliptical perturbation model performed very well in the classification of network intrusion and biological data, while protecting data privacy patterns of feature vectors of the data.

Compared with classical linear transformations such as PCA, the proposed method requires less statistical assumptions on the data and is highly suitable for the applications such as data privacy and security as a result of the difficulty of inverting the elliptical patterns from the transform domain to the input domain. In addition, we adopted a flexible block-wise dimension reduction step in the proposed method to accommodate the possible high-dimensional data (\(p \gg n\)) in modern applications, in which PCA is not directly applicable. The empirical performance results also confirmed the superior performance of the proposed EPA approach over the widely used PCA.

We also carried out a sensitivity analysis to evaluate the robustness of our model by changing the parameter values and the block size (from 2 to 4 and 8). The results (e.g., misclassification rate) seem quite stable under those changes. In a future work, it will be of interest to develop theoretical results for the optimal block size. Another future research direction is to consider developing perturbation models for online streaming data where data may come in a sequential order and their associated distribution may vary with the time. Then the proposed perturbation model can be modified by including an additional layer of latent structure to allow model parameters and block size to change with the time. It will then be of interest to evaluate and compare the proposed methodology with other popular approaches in this area, including active learning [28], data stream mining [12], and transfer learning [33].

## Notes

### Acknowledgements

This research of the first author was partially supported by the Department of Statistics, University of California at Irvine, and by the University of North Carolina at Greensboro. This material was based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Shen’s research is partially supported by Simons Foundation Award 512620. The authors thank the Editor, the Associate Editor, and the referees for their valuable comments.

## References

- 1.Aghion, P., Bloom, N., Blundell, R., Griffith, R., Howitt, P.: Competition and innovation: an inverted-u relationship. Q. J. Econ.
**120**(2), 701–728 (2005)Google Scholar - 2.Boscolo, R., Pan, H., Roychowdhury, V.P.: Independent component analysis based on nonparametric density estimation. IEEE Trans. Neural Netw.
**15**(1), 55–65 (2004)CrossRefGoogle Scholar - 3.Breiman, L.: Bagging predictors. Mach. Learn.
**24**(2), 123–140 (1996)MATHGoogle Scholar - 4.Breiman, L.: Random forests. Mach. Learn.
**45**(1), 5–32 (2001)CrossRefMATHGoogle Scholar - 5.Bruce, P., Bruce, A.: Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilly Media, Inc., Sebastopol (2017)Google Scholar
- 6.Caiafa, C.F., Proto, A.N.: A non-gaussianity measure for blind source separation. In: Proceedings of SPARS05 (2005)Google Scholar
- 7.Chaudhary, A., Kolhe, S., Kamal, R.: A hybrid ensemble for classification in multiclass datasets: an application to oilseed disease dataset. Comput. Electron. Agric.
**124**, 65–72 (2016)CrossRefGoogle Scholar - 8.Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res.
**12**(Mar), 1069–1109 (2011)MathSciNetMATHGoogle Scholar - 9.Du, K.L., Swamy, M.: Principal component analysis. In: Neural Networks and Statistical Learning, pp. 355–405. Springer, London (2014)Google Scholar
- 10.Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.
**96**, 1348–1360 (2001)MathSciNetCrossRefMATHGoogle Scholar - 11.Fienberg, S.E., Steele, R.J.: Disclosure limitation using perturbation and related methods for categorical data. J. Off. Stat.
**14**(4), 485–502 (1998)Google Scholar - 12.Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec.
**34**(2), 18–26 (2005). https://doi.org/10.1145/1083784.1083789 CrossRefMATHGoogle Scholar - 13.Geiger, B.C.: Information loss in deterministic systems. Ph. D. Thesis, Graz University of Technology, Graz, Austria (2014)Google Scholar
- 14.Hung, C.C., Liu, H.C., Lin, C.C., Lee, B.O.: Development and validation of the simulation-based learning evaluation scale. Nurse Educ. Today
**40**, 72–77 (2016)Google Scholar - 15.Jeyakumar, V., Li, G., Suthaharan, S.: Support vector machine classifiers with uncertain knowledge sets via robust optimization. Optimization
**63**(7), 1099–1116 (2014)Google Scholar - 16.Jin, S., Yeung, D.S., Wang, X.: Network intrusion detection in covariance feature space. Pattern Recogn.
**40**(8), 2185–2197 (2007Google Scholar - 17.Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A
**374**(2065), 20150202 (2016)MathSciNetCrossRefMATHGoogle Scholar - 18.Jones, D.G., Beston, B.R., Murphy, K.M.: Novel application of principal component analysis to understanding visual cortical development. BMC Neurosci.
**8**(S2), P188 (2007)CrossRefGoogle Scholar - 19.Lasko, T.A., Vinterbo, S.A.: Spectral anonymization of data. IEEE Trans. Knowl. Data Eng.
**22**(3), 437–446 (2010)CrossRefGoogle Scholar - 20.Lee, S., Habeck, C., Razlighi, Q., Salthouse, T., Stern, Y.: Selective association between cortical thickness and reference abilities in normal aging. NeuroImage
**142**, 293–300 (2016)CrossRefGoogle Scholar - 21.Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 1 Nov 2017
- 22.Little, R.J.: Statistical analysis of masked data. J. Off. Stat.
**9**(2), 407–426 (1993)Google Scholar - 23.Liu, K., Giannella, C., Kargupta, H.: A survey of attack techniques on privacy-preserving data perturbation methods. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining, pp. 359–381. Springer, US (2008)Google Scholar
- 24.Muralidhar, K., Sarathy, R.: A theoretical basis for perturbation methods. Stat. Comput.
**13**(4), 329–335 (2003)MathSciNetCrossRefGoogle Scholar - 25.Murthy, S.K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov.
**2**(4), 345–389 (1998)CrossRefGoogle Scholar - 26.Oliveira, S.R., Zaïane, O.R.: Achieving privacy preservation when sharing data for clustering. In: Jonker, W., Petković, M. (eds.) Workshop on Secure Data Management, pp. 67–82. Springer, Berlin Heidelberg (2004)Google Scholar
- 27.Qian, Y., Xie, H.: Drive more effective data-based innovations: enhancing the utility of secure databases. Manag. Sci.
**61**(3), 520–541 (2015)CrossRefGoogle Scholar - 28.Rubens, N., Elahi, M., Sugiyama, M., Kaplan, D.: Recommender systems handbook. In: Ricci, F., Rokach, L., Shapira B. (eds.) Active Learning in Recommender Systems, pp. 809–846. Springer, Boston (2016)Google Scholar
- 29.Sørensen, M., De Lathauwer, L.: Blind signal separation via tensor decomposition with Vandermonde factor: canonical polyadic decomposition. IEEE Trans. Signal Process.
**61**(22), 5507–5519 (2013)MathSciNetCrossRefGoogle Scholar - 30.Suthaharan, S.: Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning, vol. 36. Springer, New York (2015)MATHGoogle Scholar
- 31.Suthaharan, S.: Support vector machine. In: Machine Learning Models and Algorithms for Big Data Classification, pp. 207–235. Springer, US (2016)Google Scholar
- 32.Suthaharan, S., Panchagnula, T.: Relevance feature selection with data cleaning for intrusion detection system. In: Southeastcon, 2012 Proceedings of IEEE, pp. 1–6. IEEE (2012)Google Scholar
- 33.Thrun, S., Pratt, L.: Learning to Learn. Springer, New York (2012)MATHGoogle Scholar
- 34.Whitworth, J., Suthaharan, S.: Security problems and challenges in a machine learning-based hybrid big data processing network systems. ACM SIGMETRICS Perform. Eval. Rev.
**41**(4), 82–85 (2014)CrossRefGoogle Scholar - 35.Zarzoso, V., Nandi, A.: Blind source separation. In: Nandi, A. (ed.) Blind Estimation Using Higher-Order Statistics, pp. 167–252. Springer, US (1999)Google Scholar
- 36.Zumel, N., Mount, J., Porzak, J.: Practical data science with R, 1st edn. Manning, Shelter Island (2014)Google Scholar