Keywords

1 Introduction

Building a model over a data set is often a straightforward task. However, when the data set contains sensitive information, special care has to be taken. Instead of having direct access to data, the users are provided with a sanitized view of the database containing private information, either through perturbed individual records or perturbed query responses.

From users’ perspective, knowing the responses to their queries are perturbed, users may not want to directly query the output of a complex model. Many statistical models are constructed using very basic statistics. Knowing the values of means, variances and covariances, or equivalently the sums, the sums of squares and the sums of cross products, users can build least square regression models, conduct principal component analysis, construct hypothesis tests, and construct Bayesian classifiers under Gaussian mixture models, etc. Although the basic statistics (e.g., means, variances and covariances) satisfy differential privacy guarantee, the complex models constructed using these basic statistics may no longer meet the differential privacy guarantee.

We notice it is up to the users to decide how to query a database and how to further utilize the queried results. Building statistical models using the perturbed basic statistics provides quick initial estimates. If the results based on the perturbed query responses are promising, users can then proceed to improve the accuracy of the results.

In this article, our goal is to understand the impact of differential privacy mechanism for the mixture of Gaussian models. Gaussian mixture models refer to the case where each model follows multivariate Gaussian distribution. Hence users only need to obtain the mean vector and the variance-covariance matrix for each class. Out of all the statistical techniques that can be applied to Gaussian mixture models without further querying the database, we focus on building a classifier or performing a hypothesis test with the noisy responses. Through extensive experiments and theoretical discussions, we show when the classifiers and tests work reliably under privacy protection mechanism, in particular, differential privacy.

k-anonymity [17, 18, 20] and differential privacy [5] are two major privacy preserving models. Under k-anonymity model the perturbed individual records are released to the users, while under differential privacy model the perturbed query responses are released to the users. Recent work pointed out the two privacy preserving models are complimentary [3]. Main contributions of this article could be summarized as follows:

  1. 1.

    We provide theoretical results on the type I and type II errors under differential privacy for several hypothesis tests. We also show when a hypothesis test returns reliable result under differential privacy mechanism.

  2. 2.

    We propose a heuristic algorithm to repair the noise added variance-covariance matrix, which is no longer positive definite and cannot be directly used in building a Bayesian classifier.

  3. 3.

    We examine the classification error for the multivariate Gaussian case through experiments. The experiments demonstrate when the impact of the added noise can be reduced.

The rest of the paper is organized as follows. Section 1.1 provides a brief overview of differential privacy mechanism. Related work is discussed in Sect. 2. Section 3 provides theoretical results for hypothesis tests under differential privacy. In Sect. 4 we provide an algorithm to repair the noise added variance-covariance matrix, and study the classification error through extensive experiments. Section 5 concludes our discussion.

1.1 Differential Privacy

Let \(D=\{X_1,\cdots ,X_d\}\) be a d-dimensional database with n observations, where the domain of each attribute \(X_i\) is continuous and bounded. We are interested in building Gaussian mixture models over database D. One only needs to compute the expected values of each attribute \(X_i\) and the variance-covariance matrix, \(\varSigma _{ij}=cov(X_i,X_j)=E[(X_i-\mu _i)(X_j-\mu _j)]\), where \(\mu _i=E(X_i)\). More details follow in Sect. 4. Users obtain the values of \(\mu _i\)s and \(\varSigma _{ij}\)s by querying the database D. The query results are perturbed according to differential privacy mechanism. Next we briefly review differential privacy.

Given a set of queries \(Q=\{Q_1,...,Q_q \}\), Laplace mechanism for differential privacy adds Laplace noise with parameter \(\lambda \) to the actual value. \(\lambda \) is determined by privacy parameter \(\varepsilon \) and sensitivity S(Q). Here, \(\varepsilon \) is a pre-determined parameter, selected by the database curator, while sensitivity S(Q) is a function of the query Q. Hence differential privacy mechanism minimizes the risk of identifying individual records from a database containing sensitive information.

Sensitivity is defined over sibling databases, which differ in only one observation.

$$\begin{aligned} S(Q)= \mathop {\mathrm {max}}_{\forall ~\mathrm {sibling~databases}~D_1,D_2} \sum _{i=1}^q|Q^{D_1}_i - Q^{D_2 }_i| \end{aligned}$$
(1)

That is, sensitivity of Q is the maximum difference in the \(L_1\) norm of the query response caused by a single record update. Sensitivities for standard queries, such as sum, mean, variance-covariance are well established [6].

Once \(\varepsilon \) and S(Q) are known, \(\lambda \) is set such that \(\lambda \ge S(Q)/\varepsilon \). Then for each query Q, the database first computes the actual value \(Q^D\) in D, then adds Laplace noise to obtain the noisy response \(R^D\), and return \(R^D\) to users: \(R^D = Q^D + r\), where \(r \sim \) Laplace\((\lambda )\). There have been many work on sensitivity analysis. For querying mean, variance and covariance, we use the sensitivity results as in [22] in this article. Later in the experimental studies, the Laplace noises are added according to the results in [22]. Although there are other techniques to satisfy differential privacy (e.g., exponential mechanism [16]), for the three basic queries needed to build Gaussian mixture models, we leverage the Laplace mechanism discussed above.

2 Related Work

Gaussian mixture models are widely used in practice [4, 9]. Differential privacy mechanism [5] models the database as a statistical database. Random noises are added to the responses to user queries. The magnitude of random noise is proportional to the privacy parameter \(\varepsilon \) and the sensitivity of the query set. Different formulations of differential privacy have been proposed. One definition of sensitivity consider sibling data sets that have the same size but differ in one record [5, 7]. Other studies have sibling data sets through insertion of a new record sets [6]. We follow the formulation in [5] in this article.

Classification under differential privacy has received some attention. In [8], Friedman et al. built a decision tree, a method of ID3 classification, through recursive queries retrieving the information gain. Jagannathan et al. [10] built multiple random decision trees using sum queries. [1] proposed perturbing the objective function before optimization for empirical risk minimization. The lower bounds of the noisy versions of convex optimization algorithms were studied. Privacy preserving optimization is an important component in some classifiers, such as regularized logistic regression and support vector machine (SVM). [12] extended the results in [1], and also proposed differentially private optimization algorithms for convex empirical risk minimization. [19] proposed a privacy preserving mechanism for SVM.

In [14] every component in a mixture population follows a Gaussian mixture distribution. A perturbation matrix was generated based on a gamma distribution. Gamma perturbations were included in the objective function as multipliers, and a classifier was learned through maximizing the perturbed objective function. On the other hand, we consider classifiers that can be constructed using very basic statistics, i.e., means, variances and covariances, and show how their performance is affected by the added noises. In this article, we present Bayes classifiers based on Gaussian mixture models by querying the mean vector and the variance-covariance matrix for each class.

[2] proposed an algorithm using a Markov Chain Monte Carlo procedure to produce principal components that satisfy differential privacy. It is a differentially private lower rank approximation to a semi-positive definite matrix. Typically the rank k is much smaller than the dimension d. [11] also proposed an algorithm to produce differentially private low rank approximation to a positive definite matrix. [15] focused on producing recommendations under differential privacy. In [15], the true ratings were perturbed. A variance-covariance matrix was computed using the perturbed ratings; noises were added to the resulting matrix; then a low rank approximation to the noise added matrix was computed. Compared to the existing work, we focus on the scenario where all the variables are used to learn a variance-covariance matrix and the subsequent classifier and dimension reduction is not needed.

[13] proposed the differentially private M-estimators, such as sample quantiles, maximum likelihood estimator, based on the perturbed histogram. Our work has a different focus. We examine the classifiers and hypothesis tests constructed using the differentially private sample means, variances and covariances. [21] derived rules for how to adjust sample sizes to achieve a pre-specified power for Pearson’s \(\chi ^2\) test of independence and the test of sample proportion. For the second test, when sample size is reasonably large, the sample proportion is approximately normally distributed. [21] developed sample size adjustment results based on the approximate normal distribution. Our work provides theoretical results to compute the exact type I and type II errors for one sample z test, one sample t test, and two sample t test. Both type I and type II errors are functions of \(\varepsilon \) and n. Hence with a known \(\varepsilon \) value users can obtain a minimum sample size required to achieve a pre-specified power while the exact type I error is controlled by a certain upper bound.

3 Hypothesis Tests Under Differential Privacy

Differential privacy mechanism has a big impact on hypothesis tests because the test statistic is now created using the noise added query results, and hypothesis tests often apply to data with smaller sample size than classification. Next we provide the distributions for the noise added test statistic under the null value and an alternative value.

Only when we know the true \(\lambda \)s for the Laplace noises, we can numerically compute the exact p-value given a noise added test statistic. The true \(\lambda \)s are unknown to the users querying a database. Hence in this section we examine a more realistic scenario: A rejection region is constructed using the critical values from a Gaussian distribution or a t distribution as usual, users compute a test statistic using the noise added mean and variance, and make a decision. The exact type I and type II errors can be computed numerically for likely \(\varepsilon \) values, which provide a reliability check of the test for users. Here we show for what sample size the exact type I and type II errors are close to those without the added noises. We consider the most commonly used hypothesis tests: the one sample z test, the one sample t test, the two sample t test with equal variance.

For the two sample t test with unequal variances, the degrees of freedom for the standard test is also affected by the added Laplace noises. To construct a rejection region and compute the exact type I and type II errors merits more effort in this case. It is part of our future work.

3.1 One Sample z Test

Assume n samples \(Y_1,Y_2,...,Y_n\) i.i.d \(\sim N(\mu , \sigma ^2)\), where \(\sigma ^2\) is known. The null hypothesis is \(H_0: \mu = \mu _0\). We consider the common two-sided alternative hypothesis \(H_a: \mu \ne \mu _0\) or the one-sided \(H_a: \mu > \mu _0\) and \(H_a: \mu < \mu _0\).

The test statistic is based on the noise added sample mean. \(\bar{Y}^a =\bar{Y} + r\), where \(r \sim \text {Laplace}(\lambda )\). The test statistic under differential privacy is

$$\begin{aligned} Z=\frac{\bar{Y}^a - \mu _0}{\sigma /\sqrt{n}}. \end{aligned}$$

\(\bar{Y}^a\) follows a Gaussian-Laplace mixture distribution, \(\text {GL}(\mu , \sigma ^2, n, \lambda )\). It has the cumulative distribution function (CDF) \(F_{a}(y| \mu )\) as follows.

$$\begin{aligned} F_{a}(y| \mu )= & {} \varPhi \left( \frac{y-\mu }{\sigma /\sqrt{n}}\right) + \frac{1}{2}\exp \{\frac{y - \mu }{\lambda } + \frac{\sigma ^2}{2n \lambda ^2}\} \varPhi \left( - \frac{y-\mu }{\sigma /\sqrt{n}} - \frac{\sigma }{\lambda \sqrt{n}}\right) \nonumber \\- & {} \frac{1}{2}\exp \{\frac{-y + \mu }{\lambda } + \frac{\sigma ^2}{2n \lambda ^2}\}\varPhi \left( \frac{y-\mu }{\sigma /\sqrt{n}} - \frac{\sigma }{\lambda \sqrt{n}}\right) , \end{aligned}$$
(2)

where \(\varPhi (\cdot )\) is the CDF of the unit Gaussian distribution.

We can easily derive the distribution of the test statistic under the null value and an alternative value by re-scaling \(\bar{Y}^a\). However for the one sample z test the computation of the exact type I and type II errors can be done in a simpler fashion. Here and for the rest of this section we show the exact type I and type II errors for the two-sided alternative \(H_a: \mu \ne \mu _0\). The results for the one-sided alternatives can be derived similarly.

Let \(\alpha \) be the significance level of the test. Let \(z_{\frac{\alpha }{2}}\) be the \((1-\frac{\alpha }{2})\) quantile of the unit Gaussian distribution (i.e., the upper quantile). \(\alpha \) and \(\beta \) are the type I and type II errors for the standard test, without the added Laplace noise. For the test under differential privacy, we have the exact type I error, \(\alpha ^a\), and type II error, \(\beta ^a\), as follows.

$$\begin{aligned} \alpha ^a= & {} P\left( \left| \frac{\bar{Y}^a - \mu _0}{\sigma /\sqrt{n}} \right| > z_{\frac{\alpha }{2}} | H_0 \right) = 1-F_{a}\left( \mu _0 + z_{\frac{\alpha }{2}}\frac{\sigma }{\sqrt{n}} ~|~\mu _0 \right) + F_{a}\left( \mu _0 - z_{\frac{\alpha }{2}}\frac{\sigma }{\sqrt{n}} ~|~\mu _0 \right) \\= & {} \alpha + e^{\frac{-z_{\frac{\alpha }{2}} \sigma }{\lambda \sqrt{n}} + \frac{\sigma ^2}{2n \lambda ^2}} \varPhi \left( z_{\frac{\alpha }{2}} - \frac{\sigma }{\lambda \sqrt{n}}\right) -e^{\frac{z_{\frac{\alpha }{2}} \sigma }{\lambda \sqrt{n}} + \frac{\sigma ^2}{2n \lambda ^2}} \varPhi \left( -z_{\frac{\alpha }{2}} - \frac{\sigma }{\lambda \sqrt{n}}\right) , \end{aligned}$$
$$\begin{aligned} \beta ^a= & {} P\left( \left| \frac{\bar{Y}^a - \mu _0}{\sigma /\sqrt{n}} \right| < z_{\frac{\alpha }{2}} | H_a \right) = F_{a}\left( \mu _0 + z_{\frac{\alpha }{2}}\frac{\sigma }{\sqrt{n}} ~|~\mu _a \right) - F_{a}\left( \mu _0 - z_{\frac{\alpha }{2}}\frac{\sigma }{\sqrt{n}}~|~\mu _a \right) \\= & {} \beta + \frac{1}{2}\exp \{\frac{-z_{\frac{\alpha }{2}} \sigma }{\lambda \sqrt{n}} + \frac{\mu _0-\mu _a}{\lambda } + \frac{\sigma ^2}{2n \lambda ^2}\} \times \varPhi \left( -z_{\frac{\alpha }{2}}+ \frac{\mu _0-\mu _a}{\sigma /\sqrt{n}} + \frac{\sigma }{\lambda \sqrt{n}}\right) \\+ & {} \frac{1}{2}\exp \{ \frac{z_{\frac{\alpha }{2}} \sigma }{\lambda \sqrt{n}} -\frac{\mu _0-\mu _a}{\lambda } + \frac{\sigma ^2}{2n \lambda ^2}\} \times \varPhi \left( -z_{\frac{\alpha }{2}} + \frac{\mu _0-\mu _a}{\sigma /\sqrt{n}} - \frac{\sigma }{\lambda \sqrt{n}} \right) \\- & {} \frac{1}{2}\exp \{ \frac{\sigma ^2}{2n \lambda ^2} + \frac{\mu _0-\mu _a}{\lambda } - \frac{z_{\frac{\alpha }{2}} \sigma }{\lambda \sqrt{n}}\} +\frac{1}{2}\exp \{ \frac{\sigma ^2}{2n \lambda ^2} + \frac{\mu _0-\mu _a}{\lambda }+ \frac{z_{\frac{\alpha }{2}}\sigma }{\lambda \sqrt{n}} \} \\- & {} \frac{1}{2}\exp \{\frac{z_{\frac{\alpha }{2}}\sigma }{\lambda \sqrt{n}} + \frac{\mu _0-\mu _a}{\lambda } +\frac{\sigma ^2}{2n \lambda ^2}\} \times \varPhi \left( z_{\frac{\alpha }{2}}+ \frac{\mu _0-\mu _a}{\sigma /\sqrt{n}} + \frac{\sigma }{\lambda \sqrt{n}}\right) \\- & {} \frac{1}{2}\exp \{\frac{-z_{\frac{\alpha }{2}} \sigma }{\lambda \sqrt{n}} - \frac{\mu _0-\mu _a}{\lambda } + \frac{\sigma ^2}{2n \lambda ^2}\} \times \varPhi \left( z_{\frac{\alpha }{2}}+ \frac{\mu _0-\mu _a}{\sigma /\sqrt{n}} - \frac{\sigma }{\lambda \sqrt{n}} \right) . \end{aligned}$$

3.2 One Sample t Test

Assume n samples \(Y_1,Y_2,...,Y_n\) i.i.d \(\sim N(\mu , \sigma ^2)\), where \(\sigma ^2\) is unknown. The null hypothesis is \(H_0: \mu = \mu _0\). The common alternative hypotheses are \(H_a: \mu \ne \mu _0\), \(H_a: \mu > \mu _0\), or \(H_a: \mu < \mu _0\). Suppose users query the sample mean and the sample variance. Then the test statistic involves two noise added sample statistics,

$$\begin{aligned} T^a=\frac{\bar{Y}^a - \mu _0}{S^a/\sqrt{n}}, \end{aligned}$$

where \(Y^a =\bar{Y} + r_1\) with \(r_1 \sim \text {Laplace}(\lambda _1)\), and \(S^a = \sqrt{S^2+r_2}\) with \(r_2 \sim \text {Laplace}(\lambda _2)\).

To obtain the distribution of the test statistic under either the null value or an alternative value, we re-write the test statistic as

$$\begin{aligned} T^a=\frac{Z^a}{X^a}, ~\text { where } Z^a=\frac{\bar{Y}^a -\mu }{\sigma /\sqrt{n}}+\frac{\mu - \mu _0}{\sigma /\sqrt{n}} \text { and } X^a=\sqrt{(S^a)^2/\sigma ^2 }. \end{aligned}$$

We obtain the distribution of \(Z^a\) by rescaling a Gaussian-Laplace mixture distribution. Similarly we obtain the distribution of \(X^a\) based on a Chi-Square-Laplace mixture distribution. Let \(F_Z(z)\) be the CDF of \(Z^a\) and \(f_X(x)\) be the PDF of \(X^a\).

$$\begin{aligned} F_{Z}(z|\mu )= & {} \varPhi \left( z-\delta \right) +\frac{1}{2}\exp \{ \frac{\sigma (z-\delta )}{\lambda _1 \sqrt{n}} + \frac{\sigma ^2}{2n \lambda _1^2}\} \times \varPhi \left( - (z- \delta ) - \frac{\sigma }{\lambda _1 \sqrt{n}} \right) \nonumber \\- & {} \frac{1}{2}\exp \{-\frac{\sigma (z-\delta )}{\lambda _1 \sqrt{n}} + \frac{\sigma ^2}{2n \lambda _1^2}\} \times \varPhi \left( (z-\delta ) - \frac{\sigma }{\lambda _1 \sqrt{n}}\right) , \end{aligned}$$
(3)

where \(\delta =\frac{\mu _0-\mu }{\sigma /\sqrt{n}}\). \(\delta \) equals to 0 under the null and does not equal to 0 under the alternative. The distribution of \(X^a\) does not depend on the mean.

$$\begin{aligned} f_{X}(x)= & {} [~2x f_{g}(x^2|\frac{n-1}{2},\theta _0) + \frac{\sigma ^2 x}{\lambda _2} e^{-\frac{\sigma ^2 x^2}{\lambda _2}}(\frac{\theta _1}{\theta _0})^{\frac{n-1}{2}} F_{g}(x^2|\frac{n-1}{2},\theta _1) \nonumber \\- & {} x e^{-\frac{\sigma ^2 x^2}{\lambda _2}} (\frac{\theta _1}{\theta _0})^{\frac{n-1}{2}} f_{g}(x^2|\frac{n-1}{2},\theta _1) + \frac{\sigma ^2 x}{\lambda _2} e^{\frac{\sigma ^2 x^2}{\lambda _2}} (\frac{\theta _2}{\theta _0})^{\frac{n-1}{2}} (1 - F_{g}(x^2|\frac{n-1}{2},\theta _2)) \nonumber \\- & {} x e^{\frac{\sigma ^2 x^2}{\lambda _2}} (\frac{\theta _2}{\theta _0})^{\frac{n-1}{2}} f_{g}(x^2|\frac{n-1}{2},\theta _2)~]~/~ [ 1-\frac{1}{2}(\frac{\theta _2}{\theta _0})^{\frac{n-1}{2}} ] \end{aligned}$$
(4)

where \(\theta _0=\frac{2}{n-1}\), \(\theta _1=\frac{2}{n-1-2\sigma ^2/\lambda _2}\), \(\theta _2= \frac{2}{n-1 + 2\sigma ^2/\lambda _2}\), and \(F_{g}\) and \(f_{g}\) are the CDF and PDF of a gamma distribution respectively.

The distribution of the test statistic \(T^a\) given mean \(\mu \) is

$$\begin{aligned} F_{T}(t|\mu ) = \left\{ \begin{array}{l l} \int _0^{\infty }F_{Z}(tx|\mu )f_{X}(x)\;\mathrm {d}x &{} t\ge 0\\ \int _0^{\infty }\left( 1- F_{Z}(tx|\mu ) \right) f_{X}(x)\; \mathrm {d}x &{} t<0 \end{array} \right. \end{aligned}$$
(5)

Let \(t_{\frac{\alpha }{2},n-1}\) be the \((1-\frac{\alpha }{2})\) quantile of a t distribution with \(n-1\) degrees of freedom. The exact type I and type II errors can be computed numerically. Again we just show \(\alpha ^a\) and \(\beta ^a\) under the two sided alternative. Similarly we can obtain the revised errors for the one sided alternatives.

3.3 Two Sample t Test with Equal Variance

Assume \(n_1\) samples \(Y^1_1,Y^1_2,...,Y^1_{n_1}\) i.i.d \(\sim N(\mu _1, \sigma ^2)\), \(n_2\) samples \(Y^2_1,Y^2_2,...,Y^2_{n_2}\) i.i.d \(\sim N(\mu _2, \sigma ^2)\), where \(\sigma ^2\) is unknown. The null hypothesis is \(H_0: \mu _1-\mu _2 = 0\). The common alternative hypotheses are \(H_a: \mu _1-\mu _2 \ne 0\), \(H_a: \mu _1-\mu _2 > 0\), or \(H_a: \mu _1-\mu _2 < 0\).

Suppose users query the sample means and the sample variances. Then the test statistic involves multiple noise added sample statistics.

$$\begin{aligned} T^a=\frac{\bar{Y}^a_1 - \bar{Y}^a_2}{S^a \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} , \end{aligned}$$

where \(\bar{Y}^a_1 =\bar{Y}_1 + r_1\), \(\bar{Y}^a_2 =\bar{Y}_2 + r_2\), and \(S^a = \sqrt{\frac{(n_1-1)(S_1^2+r_3)+(n_2-1)(S_2^2+r_4)}{n_1+n_2-2}}\), with \(r_i \sim \text {Laplace}(\lambda _i)\), \(i=1 \sim 4\). We re-write the test statistic as

$$\begin{aligned} T^a=\frac{Z^a}{X^a} ,\text { where } Z^a=\frac{\bar{Y}_1^a - \bar{Y}_2^a - (\mu _1-\mu _2)}{\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} + \frac{(\mu _1-\mu _2)}{\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \text { and } X^a=\frac{S^a}{\sigma }. \end{aligned}$$

Since the Laplace noises are added independently, we can then obtain the distribution of the numerator by convoluting Gaussian and Laplace distributions. The distribution of \(X^a\) is based on convolution of chi-square and Laplace distributions. The distributions of \(Z^a\) and \(X^a\) depend on the Laplace noise parameters \(\lambda _i\), \(i=1\sim 4\). We obtain their distributions under two separate cases. Let \(\upsilon =n_1+n_2-2\). Let \(\delta =\frac{\mu _1-\mu _2}{\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\). \(\delta \) equals to 0 under \(H_0\) and is non-zero under \(H_a\).

Distribution of \(Z^a\), \(\lambda _1\ne \lambda _2\) : We have the CDF

$$\begin{aligned}&F_{Z}(z|\mu _1-\mu _2) = \varPhi \left( z- \delta \right) - \frac{\lambda _2^2}{2(\lambda _1^2-\lambda _2^2)} e^{\tau _2(z-\delta ) +\frac{\tau _2^2}{2}}\left( 1 - \varPhi (z-\delta + \tau _2)\right) \nonumber \\+ & {} \frac{\lambda _2^2}{2(\lambda _1^2-\lambda _2^2)} e^{\frac{\tau _2^2}{2}-\tau _2(z-\delta )}\varPhi (z- \delta - \tau _2) + \frac{\lambda _1^2}{2(\lambda _1^2-\lambda _2^2)} e^{\frac{\tau _1^2}{2} + \tau _1(z-\delta )}(1-\varPhi (z-\delta + \tau _1)) \nonumber \\- & {} \frac{\lambda _1^2}{2(\lambda _1^2-\lambda _2^2)} e^{\frac{\tau _1^2}{2} - \tau _1(z-\delta )}\varPhi (z-\delta - \tau _1) \end{aligned}$$
(6)

where \(\tau _1=\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}/\lambda _1\), and \(\tau _2=\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}/\lambda _2\).

Distribution of \(Z^a\), \(\lambda _1 = \lambda _2\) : We have the CDF

$$\begin{aligned}&F_{Z}(z|\mu _1-\mu _2)= \varPhi \left( z - \delta \right) - \left( \frac{1}{2} + \frac{\tau (z-\delta )}{4} - \frac{\tau ^2}{4} \right) e^{\frac{\tau ^2}{2} - \tau (z - \delta )}\varPhi \left( z - \delta - \tau \right) \nonumber \\- & {} \frac{ \tau }{4\sqrt{2\pi }} e^{\frac{\tau ^2}{2} - \tau (z-\delta ) - \frac{(z-\delta - \tau )^2}{2}} + \frac{\tau }{4\sqrt{2\pi }} e^{\frac{\tau ^2}{2} + \tau (z - \delta ) - \frac{(z - \delta + \tau )^2}{2}} \nonumber \\+ & {} ( \frac{1}{2} - \frac{\tau (z-\delta )}{4} - \frac{\tau ^2}{4} ) e^{\frac{\tau ^2}{2} + \tau (z-\delta )} (1 - \varPhi (z-\delta + \tau )) \end{aligned}$$
(7)

where \(\tau =\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}/\lambda _1\).

Distribution of \(X^a\), \(\lambda _3 \ne \lambda _4\) : It does not depend on \(\mu _1-\mu _2\). Note \(\upsilon =n_1+n_2-2\). We have the PDF

$$\begin{aligned}&f_{X}(x) = [ 2xf_{G}(x^2;\frac{\upsilon }{2}, \theta _0) + \frac{b_2^2}{b_2^2 - b_1^2} e^{-b_1x^2}(b_1x)(\frac{\theta _1}{\theta _0})^{\frac{\upsilon }{2}}F_{G}(x^2;\frac{\upsilon }{2}, \theta _1) \\- & {} \frac{b_2^2 x}{b_2^2 - b_1^2} e^{-b_1x^2}(\frac{\theta _1}{\theta _0})^{\frac{\upsilon }{2}} f_{G}(x^2;\frac{\upsilon }{2}, \theta _1) - \frac{b_1^2}{b_2^2 - b_1^2} e^{-b_2x^2}(b_2x)(\frac{\theta _2}{\theta _0})^{\frac{\upsilon }{2}}F_{G}(x^2;\frac{\upsilon }{2}, \theta _2) \\+ & {} \frac{b_1^2 x}{b_2^2 - b_1^2} e^{-b_2x^2}(\frac{\theta _2}{\theta _0})^{\frac{\upsilon }{2}}f_{G}(x^2;\frac{\upsilon }{2}, \theta _2) + \frac{b_2^2 }{b_2^2 - b_1^2} e^{b_1x^2}(b_1x)(\frac{\theta _3}{\theta _0})^{\frac{\upsilon }{2}}(1 - F_{G}(x^2;\frac{\upsilon }{2}, \theta _3)) \\- & {} \frac{b_2^2 x}{b_2^2 - b_1^2} e^{b_1x^2}(\frac{\theta _3}{\theta _0})^{\frac{\upsilon }{2}}f_{G}(x^2;\frac{\upsilon }{2}, \theta _3) - \frac{b_1^2}{b_2^2 - b_1^2} e^{b_2x^2}(b_2x)(\frac{\theta _4}{\theta _0})^{\frac{\upsilon }{2}}(1 - F_{G}(x^2;\frac{\upsilon }{2}, \theta _4)) \\+ & {} \frac{b_1^2 x}{b_2^2 - b_1^2} e^{b_2x^2}(\frac{\theta _4}{\theta _0})^{\frac{\upsilon }{2}}f_{G}(x^2;\frac{\upsilon }{2}, \theta _4) ]/ [1 - \frac{b_2^2}{2(b_2^2 -b_1^2)}(\frac{\theta _3}{\theta _0})^{\frac{\upsilon }{2}} + \frac{b_1^2}{2(b_2^2-b_1^2)}(\frac{\theta _4}{\theta _0})^{\frac{\upsilon }{2}}] \end{aligned}$$

where \(\tau _1=\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}/\lambda _1\), \(\tau _2=\sigma \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}/\lambda _2\), \(b_1=\frac{(n_1+n_2-2) \sigma ^2}{(n_1-1)\lambda _3}\), \(b_2=\frac{(n_1+n_2-2)\sigma ^2}{(n_2-1)\lambda _4}\), \(\theta _0=\frac{2}{n_1+n_2-2}\), \(\theta _1=\frac{2}{n_1+n_2-2-2b_1}\), \(\theta _2=\frac{2}{n_1+n_2-2-2b_2}\), \(\theta _3=\frac{2}{n_1+n_2-2+2b_1}\), \(\theta _4=\frac{2}{n_1+n_2-2+2b_2}\).

Fig. 1.
figure 1

Exact type I errors for increasing sample size n and five \(\varepsilon \)s: 0.2, 0.4, 0.6, 0.8, and 1. Top left is one sample z test; top right is one sample t test; bottom left is two sample t test with equal sample size and equal variance; bottom right is two sample t test with unequal sample sizes and equal variance.

Fig. 2.
figure 2

Red line X is for one sample z test; blue line o is for one sample t test; pink line triangle is for two sample t test with equal sample size and equal variance; green line + is for two sample t test with unequal sample size and equal variance. \(n=50\). Left: \(\varepsilon =0.2\); Middle: \(\varepsilon =0.6\); Right: \(\varepsilon =1\). (Color figure online)

Fig. 3.
figure 3

Red line X is for one sample z test; blue line o is for one sample t test; pink line triangle is for two sample t test with equal sample size and equal variance; green line + is for two sample t test with unequal sample size and equal variance. \(n=100\). Left: \(\varepsilon =0.2\); Middle: \(\varepsilon =0.6\); Right: \(\varepsilon =1\). (Color figure online)

Distribution of \(X^a\), \(\lambda _3 = \lambda _4\) : Again, it does not depend on \(\mu _1-\mu _2\). We have the PDF

$$\begin{aligned}&f_{X}(x) = [~ 2xf_G(x^2;\frac{\upsilon }{2},\theta _0) + (\frac{b^2x^3+bx}{2}) e^{-bx^2}(\frac{\theta _1}{\theta _0})^{\frac{\upsilon }{2}} F_{G}(x^2;\frac{\upsilon }{2}, \theta _1) \\- & {} (\frac{2x + bx^3}{2}) e^{-bx^2}(\frac{\theta _1}{\theta _0})^{\frac{\upsilon }{2}} f_{G}(x^2;\frac{\upsilon }{2}, \theta _1) - (\frac{b^2x}{2}) e^{-bx^2}(\frac{\theta _1}{\theta _0})^{\frac{\upsilon +2}{2}} F_{G}(x^2;\frac{\upsilon +2}{2}, \theta _1) \\+ & {} ( \frac{bx}{2}) e^{-bx^2}(\frac{\theta _1}{\theta _0})^{\frac{\upsilon +2}{2}} f_{G}(x^2;\frac{\upsilon +2}{2}, \theta _1) + (\frac{bx-b^2x^3}{2}) e^{bx^2}(\frac{\theta _2}{\theta _0})^{\frac{\upsilon }{2}}( 1 - F_{G}(x^2;\frac{\upsilon }{2},\theta _2)) \\- & {} ( \frac{2x - bx^3}{2}) e^{bx^2}(\frac{\theta _2}{\theta _0} )^{\frac{\upsilon }{2}}f_{G}(x^2;\frac{\upsilon }{2}, \theta _2) + (\frac{b^2x}{2}) e^{bx^2}(\frac{\theta _2}{\theta _0} )^{\frac{\upsilon +2}{2}}( 1 -F_{G}(x^2;\frac{\upsilon +2}{2},\theta _2)) \\- & {} (\frac{bx}{2}) e^{bx^2}(\frac{\theta _2}{\theta _0} )^{\frac{\upsilon +2}{2}} f_{G}(x^2;\frac{\upsilon +2}{2},\theta _2) ~]/ [~1 - \frac{1}{2}(\frac{\theta _2}{\theta _0})^{\frac{\upsilon }{2}} - \frac{b}{4}( \frac{\theta _2}{\theta _0})^{\frac{\upsilon +2}{2}}] \end{aligned}$$

where \(b=2\sigma ^2/\lambda _3\), \(\theta _0=\frac{2}{n_1 + n_2 -2}\), \(\theta _1=\frac{2}{n_1+ n_2-2-b}\), and \(\theta _2=\frac{2}{n_1+n_2-2+b}\).

Given the Laplace noise parameters \(\lambda _i\), we select the CDF and PDF of \(Z^a\) and \(X^a\) respectively. The distribution of the test statistic \(T^a\) given the value of \(\mu _1-\mu _2\) follows Eq. 5. Let \(t_{\frac{\alpha }{2},\upsilon }\) be the \((1-\frac{\alpha }{2})\) quantile of a t distribution with \(\upsilon \) degrees of freedom. The exact type I and type II errors again can be computed numerically. We show \(\alpha ^a\) and \(\beta ^a\) under the two sided alternative. Similarly we can obtain the revised errors for the one sided alternatives. Let \(\delta =\mu _1-\mu _2\).

$$ \alpha ^a = P\left( \left| T^a \right| > t_{\frac{\alpha }{2},\upsilon }~|~\delta =0 \right) = 1- F_{T}(t_{\frac{\alpha }{2},\upsilon }~|~\delta =0) + F_{T}(-t_{\frac{\alpha }{2},\upsilon }~|~\delta =0), $$
$$ \beta ^a = P\left( \left| T^a \right| < t_{\frac{\alpha }{2},\upsilon }~|~\delta \ne 0\right) = F_{T}(t_{\frac{\alpha }{2},\upsilon }~|~\delta \ne 0) - F_{T}(-t_{\frac{\alpha }{2},\upsilon }~|~\delta \ne 0). $$

3.4 Experimental Evaluation

To examine when the exact type I and II errors are less reliable, we run a set of experiments and provide the results in the following tables and figures. For all the experiments we set \(\alpha =0.05\), increase sample size n from 50 to 400 by steps of 25, and examine five \(\varepsilon \) values, 0.2, 0.4, 0.6, 0.8 and 1. \(\lambda =1/(n_i\varepsilon )\).

Fig. 4.
figure 4

Red line X is for one sample z test; blue line o is for one sample t test; pink line triangle is for two sample t test with equal sample size and equal variance; green line + is for two sample t test with unequal sample size and equal variance. \(n=200\). Left: \(\varepsilon =0.2\); Middle: \(\varepsilon =0.6\); Right: \(\varepsilon =1\). (Color figure online)

Fig. 5.
figure 5

Red line X is for one sample z test; blue line o is for one sample t test; pink line triangle is for two sample t test with equal sample size and equal variance; green line + is for two sample t test with unequal sample size and equal variance. \(n=400\). Left: \(\varepsilon =0.2\); Middle: \(\varepsilon =0.6\); Right: \(\varepsilon =1\). (Color figure online)

In Table 1, we show the exact type I errors for selected sample size n: 50, 100, 200, 300, and 400. Figure 1 shows the exact type I errors for the tests with increasing sample size n. As sample size increases and \(\varepsilon \) becomes larger, the exact type I errors is approaching \(\alpha =0.05\). Considering the exact type I error only, when users construct a test statistic with noise added mean and variance, the sample size needs to 100 or larger to provide a reliable result for moderate to small noise. For large noise, i.e. \(\varepsilon \le 0.2\), the sample size needs to be 400 or larger for a reliable test.

Figures 2, 3, 4 and 5 show the type II errors with noise added query results for selected n: 50, 100, 200, 400 and \(\varepsilon \): 0.2, 0.6, 1. Hypothesis tests often operate with far less samples than classification, since the test is always significant for large dataset. For the tests considered in this article, the type I errors based on noise added query results decrease sharply as sample size increases. Type II error depends on the difference between the true value and the hypothesized value. The type II error under differential privacy also improves significantly as sample size increases and \(\varepsilon \) becomes larger.

Notice users cannot know how much noises are added to the query results. Small noises can cause major distortion to the test results. We must apply differential privacy query results with caution in hypothesis tests. Often users have only a handful or a few dozen samples in a test, the direct noise addition makes the test result unreliable. With very small datasets, users need the clean query results or direct access to the raw data for a reliable output.

Table 1. (a) Z test type I error with added noises. \(\sigma =0.5\). (b) One sample t test type I error with added noises. \(\sigma =0.4\). (c) Two sample t test with equal sample size type I error with added noises. \(\sigma _1=\sigma _2=0.35\). \(n_1=n_2=n\). (d) Two sample t test with unequal sample size type I error with added noises. \(\sigma _1=\sigma _2=0.2\). \(n_1=n\) and \(n_2=1.1n\).

4 Differentially Private Bayesian Classifier for Gaussian Mixture Models

Let database \(D = \{X_1, \dots , X_d, W\}\), where W is a binary class label, \(Dom(W) = \{w_1, w_2\}\), and each \(X_i\), \(1 \le i \le d\) is a continuous attribute. A Bayesian classifier has the following decision rule:

$$\begin{aligned} \text {Assign } \text {a } \text {record }\,\mathbf {x} \text { to } w_1 \text { if } P(w_1| \mathbf {x}) > P(w_2 | \mathbf {x}) \,; \text { otherwise assign } \text {it } \text {to } w_2. \end{aligned}$$

The probabilities \(P(w_i | \mathbf {x})\) can be calculated as: \(P(w_i | \mathbf {x}) = p(\mathbf {x} | w_i)P(w_i)/p(\mathbf {x})\). If \(p(\mathbf {x} | w_i)\) follows multivariate Gaussian distribution, it is known as the Gaussian mixture model [4]. For each class \(w_i\), its mean \(\mu _i\) and the variance-covariance matrix \(\varSigma _i\) of \(p(\mathbf {x} | w_i) \sim N(\mu _i, \varSigma _i)\) are estimated from the data set D. For binary case, the Bayes error (i.e., the classification error) is calculated as [4]:

$$\begin{aligned} \mathrm {Bayes~Error} ~=~ \int _{\mathscr {R}_1} p(\mathbf {x} | w_2) P(w_2)d\mathbf {x} + \int _{\mathscr {R}_2} p(\mathbf {x} | w_1) P(w_1)d\mathbf {x}. \end{aligned}$$

\(\mathscr {R}_1\) is the region where records are labeled as class 1, and \(\mathscr {R}_2\) is the region where records are labeled as class 2.

In this article we examine Bayes error for Gaussian mixture models under differential privacy protection. The database D only needs to return the following for users to build a Bayesian classifier:

  • The sample size in D, which has sensitivity 0,

  • The proportions of the two classes, i.e., \(P(w_1)\) and \(P(w_2)\),

  • For each category, mean \(\mu _i\) and variance-covariance \(\varSigma _i\) of the multivariate Gaussian distribution for \(p(\mathbf {x} | w_i)\).

Bounded variables fit well into differential privacy mechanism. With unbounded variables one very large or small record can cause a significant increase the sensitivity. Notice Gaussian distribution is unbounded. Hence we work with truncated Gaussian distribution over interval \([\mu -6 \sigma ,\mu +6 \sigma ]\), a probability range of 0.999999998. Truncated Gaussian has density \(I_{\{\mu - 6\sigma \le x \le \mu +6\sigma \}}(x)\frac{f(x)}{\varPhi (6)-\varPhi (-6)}\).

4.1 Repair Noise Added Variance-Covariance Matrix

Let \(\hat{\varSigma } = (\hat{\sigma }_{ij})_{d\times d}\) be the sample variance-covariance matrix. When users query variances and covariances separately, independent Laplace noises are added to every element of \(\hat{\varSigma }\). Let \(A=(r_{ij})_{d\times d}\) be the matrix of independent Laplace noises, where \(r_{ij}=r_{ji}\). The returned query result is \(\varSigma _Q = \hat{\varSigma } + A\).

\(\varSigma _Q\) is the noise added variance-covariance matrix, which is the results that users can easily obtain to test their model. \(\varSigma _Q\) is still symmetric but seize to be positive definite. In order to have a valid variance-covariance matrix, we repair the noise added variance-covariance matrix, and obtain a positive definite matrix \(\varSigma _+\) close to \(\varSigma _Q\), since \(\hat{\varSigma }\) is not disclosed to users under differential privacy.

Let \((l_j,e_j)\), \(j=1,...,d\) be the eigenvalue and eigenvector pairs of \(\varSigma _Q\), where the eigenvalues follow the decreasing order, \(l_1\ge l_2\ge ... \ge l_d\). The last several eigenvalues of \(\varSigma _Q\) are negative. Let \(l_k, ..., l_d\) be the negative eigenvalues. The positive definite matrix \(\varSigma _+\) has eigenvalue and eigenvector pairs as the following: \((l_1,e_1)\), ..., \((l_{k-1},e_{k-1})\), \((l_k^+,e_k)\), ..., \((l_d^+,e_d)\). We keep the eigenvectors, and use an optimization algorithm to search over positive eigenvalues to find a \(\varSigma _+\) that minimizes the determinant of \(\varSigma _+ - \varSigma _Q\).

$$ (l_k^+,...,l_d^+) = \text {argmin}~ |\varSigma _+ - \varSigma _Q|. $$

Let \(E_j = e_je'_j\), \(j=1,...,d\). We have

$$ \varSigma _+ - \varSigma _Q = \sum _{j=k}^d (l_j^+ -l_j)E_j. $$

Therefore we perform a fine grid search over wide intervals to obtain positive eigenvalues that

$$ (l_k^+,...,l_d^+) = \text {argmin}_{\{w_k>0, ...,w_d>0\}} |\sum _{j=k}^d (w_j -l_j)E_j|. $$
Fig. 6.
figure 6

Small training sample LDA Bayes error. Left: 2 dimension; Middle: 5 dimension; Right: 10 dimension.

Fig. 7.
figure 7

Large training sample LDA Bayes error. Left: 2 dimension; Middle: 5 dimension; Right: 10 dimension.

Fig. 8.
figure 8

Small training sample QDA Bayes error. Left: 2 dimension; Middle: 5 dimension; Right: 10 dimension.

Fig. 9.
figure 9

Large training sample QDA Bayes error. Left: 2 dimension; Middle: 5 dimension; Right: 10 dimension.

4.2 Experimental Evaluation

We have conducted extensive experiments in this section. We consider binary classification scenario. To understand how differential privacy affects the Bayes error, we do not want to introduce any other errors. Note Gaussian distribution may not represent the underlying data accurately. To avoid additional errors due to modeling real data distribution inaccurately, we generate data sets from known Gaussian mixture parameters. The parameters are estimated from real life data in two experiments, and synthetic in the rest.

In Eq. 4, if the two Gaussian distributions have the same variance-covariance matrix, we perform a linear discriminant analysis (LDA). If the two Gaussian distributions have different variance-covariance matrices, we perform a quadratic discriminant analysis (QDA). Every experimental run has the following steps.

  1. 1.

    Given the parameters of the Gaussian mixture models, we generate a training set of n samples. We truncate the training samples to the \(\mu \pm ~6 \sigma \) interval, throwing away samples that fall out of the interval.

  2. 2.

    Using the truncated training set which has less than n samples, given a pre-specified \(\varepsilon \), we compute the sensitivity values according to [22], sample means and variance-covariance matrices. Then we add independent Laplace noises to each Gaussian component.

  3. 3.

    We repair the noise added variance-covariance matrices, and obtain positive definite matrices.

  4. 4.

    We generate a separate test data set of size 50,000 using the original parameters without the noises, and report the effectiveness of the Gaussian mixture models using the noise added sample means and the positive definite matrices from the previous step. Test data set of size 50,000 is chosen to make sure that the estimated Bayes errors are accurate.

Experiment 1. We set \(\mu _1 = 0.75\times 1_d\) and \(\mu _2 = 0.25 \times 1_d\), where \(1_d\) is a d-dimensional vector with elements all equal to 1. The two d-dimensional Gaussian distributions have the same variance-covariance matrix \(\varSigma \), where \(\sigma _{ii}=0.8^2\) and \(\sigma _{ij}=0.5\times 0.8^2\). The prior is \(p_1=p_2=0.5\). We pool the two classes to estimate the sample variance-covariance matrix. We compute the sensitivity for variances and covariances adjusted to the range of the pooled data. The sample means and the sensitivity values for sample means are computed. We run the experiments in 2-dimension, 5-dimension, and 10-dimension, \(d=2,5,10\). We have four \(\varepsilon \) values, \(\varepsilon =0.05, 0.3, 0.6, 1\). Meanwhile we gradually increase the training set size.

Table 2. True LDA and QDA Bayes errors

Using the prespecified parameter values, we have the true LDA classification rule, following Eq. 4. We generate 5 million samples using the prespecified parameter values without truncation, using the true LDA classification rules to estimate Bayes error. We take the average Bayes error of four such runs as the actual LDA Bayes error, shown in Table 2.

Figures 6 and 7 show the Bayes error under differential privacy for LDA experiment in increasing dimensions. For each combination \((\varepsilon ,n,d)\), we perform five runs. The average Bayes error of five runs is shown on the figures.

When two classes have the same variance-covariance matrix, the LDA Bayes error in general is not significantly affected by the noise added query results used in the classifier. For \(\varepsilon \) from 0.3 to 1, several thousand training samples are sufficient to return a preliminary Bayes error estimate which is very close to the actual LDA Bayes error. For this special case, we can obtain a fairly accurate idea about how well the LDA classifier performs using the noise added query results.

Experiment 2. We set \(\mu _1 = 0.75\times 1_d\) and \(\mu _2 = 0.25 \times 1_d\). We set \(\varSigma _1=I_d\), where \(I_d\) is a d-dimensional identity matrix, and set \(\varSigma _2\) as the one used in Experiment 1. We set the prior as \(p_1=p_2=0.5\). The sample means, variances, covariances, and the sensitivity values are computed. Again, we run the experiments in 2-dimension, 5-dimension, and 10-dimension, \(d=2,5,10\). We have four \(\varepsilon \) values, \(\varepsilon =0.05, 0.3, 0.6, 1\). Meanwhile we gradually increase the training set size.

Using the prespecified parameter values, we have the true QDA classification rule, following Eq. 4. We generate 5 million samples using the prespecified parameter values without truncation, using the true QDA classification rules to estimate Bayes error. We take the average Bayes error of four such runs as the actual QDA Bayes error, shown in Table 1.

Figures 8 and 9 show the Bayes error rate for QDA experiment in increasing dimensions. For each combination \((\varepsilon ,n,d)\), we perform five runs. The average Bayes error of five runs is shown on the figures.

When two classes have different variance-covariance matrices, dimensionality has a large impact on the Bayes error estimates obtained under differential privacy. For \(\varepsilon \) from 0.3 to 1, 2 dimensional experiment shows that three thousand training samples is sufficient to return a reasonable estimate of the actual Bayes error. 5 dimensional experiment needs 40,000 training samples to eliminate the impact of the added noises. 10 dimensional experiment needs even more training samples to return a reasonable estimate of the Bayes error under differential privacy.

Experiment 3. We used the Parkinson data set from the UCI Machine learning repository (https://archive.ics.uci.edu/ml/datasets/Parkinsons). We computed the mean and variance-covariance matrix of each class in the Parkinson data and used these parameters in our Gaussian mixture models. In all of the experiments, we set \(\varepsilon =0.6\). For the Parkinson data, the majority class equals to 75.38% of the total. There are 197 observations and 21 numerical variables besides the class label. Without differential privacy mechanism, directly using the sample estimates, the Bayes error is less than 0.01. On the other hand, the Gaussian mixture models with increasing sample sizes under differential privacy have Bayes error decreasing from 0.246 to 0.198. The Bayes error 0.198 is obtained from 50,000 training samples. The above results confirm that direct noise addition to Gaussian mixture parameters could cause significant distortion in higher dimensional space when two classes have different variance-covariance matrices. As dimensionality increases, we need a very large number of training samples to reduce the impact of the added noises.

Experiment 4. We also used the Adult data set from the UCI Machine learning repository (https://archive.ics.uci.edu/ml/datasets/Adult). The Adult data is much larger than the Parkinson data, with 32,561 observations. We used all the numerical variables in this experiment, i.e., 6 variables. We computed the mean and variance-covariance matrix of each class in the Adult data and used these parameters in our Gaussian mixture models. Again we set \(\varepsilon =0.6\). For the Adult data, the majority class equals to 75.92% of the total, similar to the Parkinson data. Without differential privacy mechanism, directly using the sample estimates, the Bayes error is 0.0309. With 50,000 training samples, the Gaussian mixture model under differential privacy has the Bayes error equal to 0.0747. The impact of the added noises is less severe for this lower dimensional data. Training sample size around 50,000 provides a reasonable result.

5 Summary

In this article we examine the performance of Bayesian classifier using the noise added mean and variance-covariance matrices. We also study the exact type I and type II errors under differential privacy for various hypothesis tests. In the process we identify an interesting issue associated with random noise addition: The variance-covariance matrix without the added noise is positive definite. However simply adding noise can only return a symmetric matrix, which is no longer positive definite. Consequently the query result cannot be used to construct a classifier. We implement a heuristic algorithm to repair the noise added matrix.

This is a general issue for random noise addition. Users may simply assemble basic query results without directly querying a complex statistic. Then adding noises causes the assembled result to no longer satisfy certain constraints. The query results need to be further modified in order to be used in subsequent studies.