Keywords

1 Introduction

Protection of private information is a key democratic value and so-called ‘privacy by design’ is core to the new European General Data Protection Regulatory (GDPR) [6].

Privacy by design as a concept has existed for years now, but it is only just becoming part of a legal requirement with the GDPR. At its core, privacy by design calls for the inclusion of data protection from the onset of the designing of systems, rather than an addition. More specifically - ‘The controller shall..implement appropriate technical and organisational measures..in an effective way.. in order to meet the requirements of this Regulation and protect the rights of data subjects’.

Differential privacy is one such tool allowing a ‘controller’ to train a machine learning model on inherently private data, but with mathematical bounds on the actual loss of privacy when results are released [4, 8]. Differential privacy is based on randomized algorithms using noise to reduce the probability of breach of privacy. The key idea is to secure that the randomized output does not in a significant way depend on any of the possible data subjects’ data.

Educational technology is important to serve the increasing needs for life-long learning [10]. Learning processes and tests are typically highly personal, yet, significant gains are conceivable from integrating and sharing such information. Sharing could, e.g., be used to provide more detailed feedback on tests, hence, enhance the learning process. The basic question addressed in the present work is if differentially private machine learning methods can be used to provide more detailed feedback on students’ tests, while still respecting the privacy of the individual students.

Fig. 1.
figure 1

Concept of the differentially private Rasch model and its use in enhanced feedback in teaching.

The concept is illustrated in Fig. 1. The use case concerns a class of students each answering a set of tasks. The teacher (‘the controller’) can by conventional means estimate each students performance and release this information in private to the given student. Here our aim is in addition to share a difficulty score for each task and investigate whether it is feasible to compute this score in a differentially private manner, hence, with mathematical bounds on the amount of individual information leaked by releasing the difficulty scores. Given the privatized difficulty scores, every student can then use their sensitive data to estimate their own ability scores and probabilities of passing a subject. The paper is organized as follows. We first present the differential privacy model in the educational technology context. Student performance and test scores are inferred using item response theory (‘Rasch model’). Next, we investigate the loss of accuracy when privacy is enforced at various privacy budgets. Finally, we demonstrate viability in a real world data set. The proof of the differential privacy mechanism (so-called objective perturbation) is provided in an appendix. The original contributions can be summarized as follows: (1) We define a workflow and model for privacy preserving machine learning of student performance and task difficulty. (2) We show by simulation that the student performance is well estimated for each student separately. (3) We give the first proof of differential privacy for the Rasch model based on so-called objective perturbation. (4) We derive the minimum noise level that allows us to release the task difficulty at a given privacy budget.

All code can be found at the following github repository.

2 Preliminaries

The concept of differential privacy is based on a privacy parameter or ‘budget’ \(\epsilon \). The algorithm \(\mathcal {A}\) is \(\epsilon \)-differentially private if for all data sets \(D_1\) and \(D_2\) that differ by a single entry (data subject)

$$\begin{aligned} P[\mathcal {A}(D_1)=w] \le P[\mathcal {A}(D_2)=w]\exp (\epsilon ), \end{aligned}$$
(1)

where P is the probability taken over the randomness used by the algorithm \(\mathcal {A}\), and w is the output of the algorithm. The privacy budget quantifies how likely it is that a well-informed adversarial can determine whether a specific data subject participated or not. The randomness is added to the algorithm to hinder this identification. This randomness is achieved e.g. through the addition of noise. This noise is scaled as \(\Delta f/ \epsilon \) where \(\Delta f\) is the sensitivity of a function f, defined as

$$\begin{aligned} \Delta f = \max \Vert f(D_1) - f(D_2) \Vert _1, \end{aligned}$$
(2)

where again \(D_1\) and \(D_2\) differ in a single entry [8].

The data sets we work with are arranged as a (number of students N) \(\times \) (number of test items I) matrix X, and every entry stands for a right or wrong answer of a student to an item. In this work, we consider differential privacy in the sense that the output from our model should not depend much on whether a particular student is in the set or not. That is, for X and \(\tilde{X}\) with \(X_{n,i}=\tilde{X}_{n,i}\) for all \(n=1,\dots , N-1\) and \(i=1,\dots , I\), so two data sets that can differ in at most one row (corresponding to one student), we want to achieve

$$\begin{aligned} \frac{P(w|X)}{P(w|\tilde{X})}\le e^{\epsilon }, \end{aligned}$$
(3)

where w is the output of the algorithm.

The Rasch model [3] is a simple example of item response theory (IRT). IRT concerns performance testing quantifying the probability that students can answer a specific test task in terms of the difficulty of the task and their general ability. The model is similar to the logistic regression and used to estimate the probability of passing a task

$$\begin{aligned} P(X_{n,i}=1|\beta _n,\delta _i) = \dfrac{\exp (\beta _n-\delta _i)}{1+\exp (\beta _n-\delta _i)}, \end{aligned}$$
(4)

where \(\beta _n\) models the ability of student n and \(\delta _i\) is the difficulty of task i. \(X_{ni}\) is a dichotomous observation of a student’s (n) correct or incorrect answer to a task (i), where 1 is a correct answer, and 0 is an incorrect answer. The model is generated by estimating \(\delta _i\) and \(\beta _n\) from the results of a particular test. The parameters are estimated by maximizing the likelihood

$$\begin{aligned} \Lambda = \prod _{n}\prod _{i}\frac{ e^{x_{ni}(\beta _n-\delta _i)}}{ 1+e^{(\beta _n-\delta _i)}}. \end{aligned}$$
(5)

In our work, we will introduce differentially private methods of estimating \(\beta \) and \(\delta \), but only the \(\delta \)-values will be released to the public. Every student can then, based on the public \(\delta \) and their personal results, estimate their own parameter \(\beta _n\).

3 Methods

We will implement this workflow, i.e., release differentially private \(\delta \) parameters and then re-estimate the parameters \(\beta _n\) on \(X_{n,:}\), by first calculating both parameters with differentially private algorithms, assuring a private \(\delta \), then re-estimating each \(\beta _n\) given that \(\delta \). The re-estimation was also proposed in Choppin [3]. We will investigate the impact of the re-estimation compared to the global parameter estimation in Sect. 4.

We consider two different methods for constructing a differentially private Rasch Model. The first one is the objective perturbation, first introduced by Chaudhuri and Monteleoni for logistic regression [1], and analyzed in more detail by Chaudhuri et al. [2]. For the differentially private Rasch model, we use a slightly modified version of the objective perturbation, and prove that it is \(\epsilon \)-differentially private as defined in Eq. (3).

We will also consider a simple reference method based on perturbing sufficient statistics as discussed in [7]. By adding enough noise to the sufficient statistics, we may release them as differentially private. Then any algorithm based on these statistics will be differentially private. The latter observation follows from the post-processing theorem [5]. For the Rasch model, the sufficient statistics are \(r_n=\sum _{i}^I X_{n,i}\) and \(s_i=\sum _{n}^N X_{n,i}\), since those are all that is needed for minimizing the regularized objective function. We add noise to the vectors r and s, scaled with their sensibilities: if student n in the data set is changed, \(r_n\) will change by at most I, and so the \(L_1\) norm of r will change by at most I. Similarly, \(s_i\) can change by at most 1 for every i, so again, the \(L_1\) norm of s changes by at most I. So the noise we add to both vectors is scaled with \(I/\epsilon \) as in [7]. Since making the sufficient statistics differentially private is more general than objective perturbation, and does not use the specific structure of the learning algorithm, we expect it to be weaker (adding more noise).

As we notice below, the objective perturbation approach effectively perturbs the sufficient statistics with noise scaling as \(\sqrt{I}/\epsilon \), while direct perturbation of the sufficient statistics requires noise scaling as \(I/\epsilon \) [7]. In the following we will show the costs of the less favorable scaling.

An important aspect of learning the Rasch model parameters is to quantify the available prior information. In the educational technology context we could imagine substantial prior information to be present from earlier exams etc. Here we for simplicity assume that the test difficulties and student abilities both follow normal distributions, hence, we add a regularization term \(\frac{\lambda }{2}w^Tw\), where w is the entire parameter vector \(w=[\beta ~\delta ]\), to the Log likelihood function in equation (5). A discussion on how to estimate parameter \(\lambda \) while preserving differential privacy can be found below.

Algorithm 1 describes the details of the modified objective perturbation algorithm for the differentially private Rasch model.

Algorithm 1:

  • Draw vector b with dimension I from density function \(h(b)\propto \exp {(-\frac{\epsilon }{\sqrt{I}})}\). To do that, draw direction uniformly at random from the I dimensional unit sphere, and draw the norm from \(\Gamma (I, \frac{\sqrt{I}}{\epsilon })\).

  • Minimize

    $$\begin{aligned} F(\beta ,\delta )=&\sum _n^{N}\sum _i^{I}\log (1+\exp (\beta _n-\delta _i))+\sum _i^I\delta _i\sum _n^N X_{n,i} \nonumber \\&-\sum _n^N\beta _n\sum _i^I X_{n,i} + \frac{\lambda }{2}(\beta ^T\beta + \delta ^T\delta ) + \sum _i^I b_i\delta _i \end{aligned}$$
    (6)

    with respect to \(\beta \), \(\delta \) for \(\lambda >0\).

Theorem 1

Algorithm 1 is \(\epsilon \)-differentially private.

The proof follows the strategy developed in Chaudhuri et al. [1, 2], details are found in the appendix.

Naive approaches of estimating the regularization parameters by f.e. cross validation can lead to a loss of privacy, since information is leaked by every evaluation of the model, and this information accumulates. Chaudhuri et al. [2] propose two different methods on how to handle this, which we can also apply. The first one is to use a small publicly available data set that follows roughly the same distribution for the estimation of \(\lambda \). The second one is an algorithm which splits the data set into \((m+1)\) subsets, calculates the model for m different guesses of \(\lambda \) on respective subsets, and evaluates the model on the last one. Then, based on the number \(z_i\) of errors made by the \(i^{ th }\) estimate, \(\lambda _i\) is chosen with probability

$$\begin{aligned} \frac{e^{-\epsilon z_i/2}}{\sum _{j=1}^m e^{-\epsilon z_j/2}}. \end{aligned}$$
(7)

For our model, we split the data into subsets of students. For the m different estimates of \(\delta \), we first compute the \(\beta \) values for the last subset and the corresponding probabilities. We then compare the rounded probabilities to the actual 0 or 1 entries in the \((m+1)^{ st }\) subset in the data.

4 Experiments

Experiments were run for simulated data, real data from Vahdat et al. [9], as well as a data set from a course at the Technical University of Denmark.

The first experiment is run on simulated data and compares the results of calculating both \(\beta \) and \(\delta \) globally to re-estimating \(\beta \).

The next experiments compare the performance of the two methods for introducing privacy, the objective perturbation and sufficient statistics. We use correlation coefficients between the estimated probabilities and the true probabilities (i.e., the ones used to simulate the data) resp. the non-private estimates with \(95\%\) confidence intervals. Further, we show test misclassification rates (how well do we predict if a student passes a test) on a new data set drawn from the same distribution as the training data. Experiments were run for the two real data sets. The first one had to be modified to fit the Rasch model, so the answers were rounded to 1 or 0 depending on whether half of the points for a question were scored. The DTU data set are the results of a multiple choice test, so the original data can be used. For the real data sets, we use bootstrapping with 1000 samples to calculate confidence intervals.

We used MATLAB’s fminunc function for minimizing the objective function in all experiments - with the following settings:

figure a

The experiments on simulated data were run with 50 repetitions, used a regularization parameter of 0.01, and privacy budgets of 1, 5 and 10. The amount of students (N) vary from 40 to 200 in steps of 40 and amount of questions (I) is fixed to 20. The parameters \(\beta _n\) and \(\delta _i\) were drawn from normal distributions with mean 0, and standard deviation 1 and 2, respectively. The Rasch probabilities were then calculated with the given \(\beta _n\) and \(\delta _i\) and used to simulate a data set by drawing from a Bernoulli distribution with the given probabilities.

The M. Vahdat et al. data set has 62 students and 16 questions. The DTU data set has 212 students and 27 questions.

In pilot experiments we found that fine-tuning of the regularization parameter \(\lambda \) was not essential. So for simplicity reasons we use a common regularization parameter in all experiments.

4.1 Results

To test the impact of introducing differential privacy to the Rasch model, we ran several experiments. In the first one we test the retraining of \(\beta \) in the non-private setting, to show the students can calculate their own abilities from a given, private \(\delta \). The second experiment show the performance of the two methods on simulated data, while the third and fourth show the performance on the M. Vahdat real data and DTU data, respectively.

Experiment 1: Global Vs. Re-estimated Rasch Parameters. In Fig. 2 we compare the Rasch model with global parameter estimation with the results obtained by re-estimating the student abilities. We plot correlation coefficients with \(95\%\) confidence intervals between the probabilities, also with respect to the true model parameters, as well as misclassification rates on a new data set drawn from the same distribution as the train data.

Fig. 2.
figure 2

The plots show: (a) Correlation coefficient for non-private global and re-estimation method compared to the true Rasch model. (b) Misclassification for non-private global and retrain estimation method compared to the true Rasch model.

From the correlation coefficients between the estimated probabilities and the ground truth probabilities used to simulate the data, we can see that there is practically no difference in accuracy between re-estimation and global estimation. From now on, we will only consider the re-estimation method, since this is the one defined by our workflow.

Experiment 2: Differential Privacy on Simulated Data. In Fig. 3 we show a comparison of the objective perturbation and sufficient statistics method with three different values of epsilon: 1, 5 and 10. We compare their performance by calculating the correlation coefficients between the private estimation to the non-private resp. true estimation. Again, we show \(95\%\) confidence intervals.

Fig. 3.
figure 3

Correlation coefficient of objective perturbation and sufficient statistics methods with \(\epsilon \) = (1, 5 and 10) to: (a) the non-Private estimates. (b) the true model.

In Fig. 4, we show the missclassification rates on a simulated data set of the same distribution as the training set.

Fig. 4.
figure 4

Misclassification rates of objective perturbation and sufficient statistics method with \(\epsilon \) = (1, 5 and 10): (a) objective perturbation estimates. (b) sufficient statistics estimates.

We make several observations. First, the objective perturbation performs better in general. This is due to the smaller amount of noise added to the objective perturbation, as mentioned in Sect. 3. Next, we see that for lower epsilon values, i.e. higher privacy, the model generally performs worse, but converges with larger class size. This is what we would expect and consistent with what is broadly observed in applications of differential privacy.

Experiment 3 and 4: Differential Privacy on Real Data. Experiment 2 illustrated the impact of privacy so for experiments 3 and 4, the privacy budget was fixed to \(\epsilon =5\) with changing data sizes. In experiment 3 we use the data set from Vahdat et al. [9]. Experiment 4 is run on the DTU data set.

Figure 5 shows the objective perturbation and sufficient statistics performance on real data with \(\epsilon = 5\). The misclassification rate here is calculated on the original data set (so corresponds to a train, not a test error).

Fig. 5.
figure 5

Objective perturbation and sufficient statistics methods on real data: (a) Correlation coefficients on Vahdat et al.’s data [9]. (b) misclassification on Vahdat et al.’s data [9]. (c) Correlation coefficients on DTU data. (d) misclassification on DTU data.

In experiment 3, Fig. 5 (a) and (b), we show that the impact of introducing privacy on real data sets is limited, even in relatively small data sets. For both the Vahdat et al. data and the DTU data we find useful correlation between the probabilities of passing the test as inferred in non-private and private models (\(\epsilon = 5\)).

In experiment 4, Fig. 5 (c) and (d), our results are comparable to those on the simulated data. In general, we see that the objective perturbation method performs better than the sufficient statistics method.

In Fig. 6, we illustrate the impact of data set size by showing comparisons of the probability estimates of the non-private model to the two private methods on the data set from Vahdat et al. [9], again with fixed \(\epsilon = 5\).

Fig. 6.
figure 6

Rasch estimates of objective perturbation and sufficient statistics methods on Vahdat et al.’s data [9]: (a) \(\text {nStudent}=21\). (b) \(\text {nStudent}=42\). (c) \(\text {nStudent}=62\)

We see how the estimates are very noisy for small data set sizes, but correlate strongly with the non private for a data set size of 62. Again, the objective perturbation yields more accurate results.

5 Conclusion

We have demonstrated viability of the proposed workflow for more detailed, yet differentially private, feedback for students. We proved analytically that objective perturbation for this model satisfies differential privacy and give the minimum noise level necessary. Our experiments based on simulated data suggest that the workflow provides estimates of similar quality as the non-private for medium sized classes and industry standard privacy budgetsFootnote 1. These findings were confirmed in two real data sets. As expected, the objective perturbation mechanism performs better than the sufficient statistic method as less noise is added.