Abstract
The Rasch model is used to estimate student performance and task difficulty in simple test scenarios. We design a workflow for enhancing student feedback by release of difficulty parameters in the Rasch model with privacy protection using differential privacy. We provide a first proof of differential privacy in Rasch models and derive the minimum noise level in objective perturbation to guarantee a given privacy budget. We test the workflow in simulations and in two real data sets.
This work was supported by the Danish Innovation Foundation through the Danish Center for Big Data Analytics and Innovation (DABAI).
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Protection of private information is a key democratic value and so-called ‘privacy by design’ is core to the new European General Data Protection Regulatory (GDPR) [6].
Privacy by design as a concept has existed for years now, but it is only just becoming part of a legal requirement with the GDPR. At its core, privacy by design calls for the inclusion of data protection from the onset of the designing of systems, rather than an addition. More specifically - ‘The controller shall..implement appropriate technical and organisational measures..in an effective way.. in order to meet the requirements of this Regulation and protect the rights of data subjects’.
Differential privacy is one such tool allowing a ‘controller’ to train a machine learning model on inherently private data, but with mathematical bounds on the actual loss of privacy when results are released [4, 8]. Differential privacy is based on randomized algorithms using noise to reduce the probability of breach of privacy. The key idea is to secure that the randomized output does not in a significant way depend on any of the possible data subjects’ data.
Educational technology is important to serve the increasing needs for life-long learning [10]. Learning processes and tests are typically highly personal, yet, significant gains are conceivable from integrating and sharing such information. Sharing could, e.g., be used to provide more detailed feedback on tests, hence, enhance the learning process. The basic question addressed in the present work is if differentially private machine learning methods can be used to provide more detailed feedback on students’ tests, while still respecting the privacy of the individual students.
The concept is illustrated in Fig. 1. The use case concerns a class of students each answering a set of tasks. The teacher (‘the controller’) can by conventional means estimate each students performance and release this information in private to the given student. Here our aim is in addition to share a difficulty score for each task and investigate whether it is feasible to compute this score in a differentially private manner, hence, with mathematical bounds on the amount of individual information leaked by releasing the difficulty scores. Given the privatized difficulty scores, every student can then use their sensitive data to estimate their own ability scores and probabilities of passing a subject. The paper is organized as follows. We first present the differential privacy model in the educational technology context. Student performance and test scores are inferred using item response theory (‘Rasch model’). Next, we investigate the loss of accuracy when privacy is enforced at various privacy budgets. Finally, we demonstrate viability in a real world data set. The proof of the differential privacy mechanism (so-called objective perturbation) is provided in an appendix. The original contributions can be summarized as follows: (1) We define a workflow and model for privacy preserving machine learning of student performance and task difficulty. (2) We show by simulation that the student performance is well estimated for each student separately. (3) We give the first proof of differential privacy for the Rasch model based on so-called objective perturbation. (4) We derive the minimum noise level that allows us to release the task difficulty at a given privacy budget.
All code can be found at the following github repository.
2 Preliminaries
The concept of differential privacy is based on a privacy parameter or ‘budget’ \(\epsilon \). The algorithm \(\mathcal {A}\) is \(\epsilon \)-differentially private if for all data sets \(D_1\) and \(D_2\) that differ by a single entry (data subject)
where P is the probability taken over the randomness used by the algorithm \(\mathcal {A}\), and w is the output of the algorithm. The privacy budget quantifies how likely it is that a well-informed adversarial can determine whether a specific data subject participated or not. The randomness is added to the algorithm to hinder this identification. This randomness is achieved e.g. through the addition of noise. This noise is scaled as \(\Delta f/ \epsilon \) where \(\Delta f\) is the sensitivity of a function f, defined as
where again \(D_1\) and \(D_2\) differ in a single entry [8].
The data sets we work with are arranged as a (number of students N) \(\times \) (number of test items I) matrix X, and every entry stands for a right or wrong answer of a student to an item. In this work, we consider differential privacy in the sense that the output from our model should not depend much on whether a particular student is in the set or not. That is, for X and \(\tilde{X}\) with \(X_{n,i}=\tilde{X}_{n,i}\) for all \(n=1,\dots , N-1\) and \(i=1,\dots , I\), so two data sets that can differ in at most one row (corresponding to one student), we want to achieve
where w is the output of the algorithm.
The Rasch model [3] is a simple example of item response theory (IRT). IRT concerns performance testing quantifying the probability that students can answer a specific test task in terms of the difficulty of the task and their general ability. The model is similar to the logistic regression and used to estimate the probability of passing a task
where \(\beta _n\) models the ability of student n and \(\delta _i\) is the difficulty of task i. \(X_{ni}\) is a dichotomous observation of a student’s (n) correct or incorrect answer to a task (i), where 1 is a correct answer, and 0 is an incorrect answer. The model is generated by estimating \(\delta _i\) and \(\beta _n\) from the results of a particular test. The parameters are estimated by maximizing the likelihood
In our work, we will introduce differentially private methods of estimating \(\beta \) and \(\delta \), but only the \(\delta \)-values will be released to the public. Every student can then, based on the public \(\delta \) and their personal results, estimate their own parameter \(\beta _n\).
3 Methods
We will implement this workflow, i.e., release differentially private \(\delta \) parameters and then re-estimate the parameters \(\beta _n\) on \(X_{n,:}\), by first calculating both parameters with differentially private algorithms, assuring a private \(\delta \), then re-estimating each \(\beta _n\) given that \(\delta \). The re-estimation was also proposed in Choppin [3]. We will investigate the impact of the re-estimation compared to the global parameter estimation in Sect. 4.
We consider two different methods for constructing a differentially private Rasch Model. The first one is the objective perturbation, first introduced by Chaudhuri and Monteleoni for logistic regression [1], and analyzed in more detail by Chaudhuri et al. [2]. For the differentially private Rasch model, we use a slightly modified version of the objective perturbation, and prove that it is \(\epsilon \)-differentially private as defined in Eq. (3).
We will also consider a simple reference method based on perturbing sufficient statistics as discussed in [7]. By adding enough noise to the sufficient statistics, we may release them as differentially private. Then any algorithm based on these statistics will be differentially private. The latter observation follows from the post-processing theorem [5]. For the Rasch model, the sufficient statistics are \(r_n=\sum _{i}^I X_{n,i}\) and \(s_i=\sum _{n}^N X_{n,i}\), since those are all that is needed for minimizing the regularized objective function. We add noise to the vectors r and s, scaled with their sensibilities: if student n in the data set is changed, \(r_n\) will change by at most I, and so the \(L_1\) norm of r will change by at most I. Similarly, \(s_i\) can change by at most 1 for every i, so again, the \(L_1\) norm of s changes by at most I. So the noise we add to both vectors is scaled with \(I/\epsilon \) as in [7]. Since making the sufficient statistics differentially private is more general than objective perturbation, and does not use the specific structure of the learning algorithm, we expect it to be weaker (adding more noise).
As we notice below, the objective perturbation approach effectively perturbs the sufficient statistics with noise scaling as \(\sqrt{I}/\epsilon \), while direct perturbation of the sufficient statistics requires noise scaling as \(I/\epsilon \) [7]. In the following we will show the costs of the less favorable scaling.
An important aspect of learning the Rasch model parameters is to quantify the available prior information. In the educational technology context we could imagine substantial prior information to be present from earlier exams etc. Here we for simplicity assume that the test difficulties and student abilities both follow normal distributions, hence, we add a regularization term \(\frac{\lambda }{2}w^Tw\), where w is the entire parameter vector \(w=[\beta ~\delta ]\), to the Log likelihood function in equation (5). A discussion on how to estimate parameter \(\lambda \) while preserving differential privacy can be found below.
Algorithm 1 describes the details of the modified objective perturbation algorithm for the differentially private Rasch model.
Algorithm 1:
-
Draw vector b with dimension I from density function \(h(b)\propto \exp {(-\frac{\epsilon }{\sqrt{I}})}\). To do that, draw direction uniformly at random from the I dimensional unit sphere, and draw the norm from \(\Gamma (I, \frac{\sqrt{I}}{\epsilon })\).
-
Minimize
$$\begin{aligned} F(\beta ,\delta )=&\sum _n^{N}\sum _i^{I}\log (1+\exp (\beta _n-\delta _i))+\sum _i^I\delta _i\sum _n^N X_{n,i} \nonumber \\&-\sum _n^N\beta _n\sum _i^I X_{n,i} + \frac{\lambda }{2}(\beta ^T\beta + \delta ^T\delta ) + \sum _i^I b_i\delta _i \end{aligned}$$(6)with respect to \(\beta \), \(\delta \) for \(\lambda >0\).
Theorem 1
Algorithm 1 is \(\epsilon \)-differentially private.
The proof follows the strategy developed in Chaudhuri et al. [1, 2], details are found in the appendix.
Naive approaches of estimating the regularization parameters by f.e. cross validation can lead to a loss of privacy, since information is leaked by every evaluation of the model, and this information accumulates. Chaudhuri et al. [2] propose two different methods on how to handle this, which we can also apply. The first one is to use a small publicly available data set that follows roughly the same distribution for the estimation of \(\lambda \). The second one is an algorithm which splits the data set into \((m+1)\) subsets, calculates the model for m different guesses of \(\lambda \) on respective subsets, and evaluates the model on the last one. Then, based on the number \(z_i\) of errors made by the \(i^{ th }\) estimate, \(\lambda _i\) is chosen with probability
For our model, we split the data into subsets of students. For the m different estimates of \(\delta \), we first compute the \(\beta \) values for the last subset and the corresponding probabilities. We then compare the rounded probabilities to the actual 0 or 1 entries in the \((m+1)^{ st }\) subset in the data.
4 Experiments
Experiments were run for simulated data, real data from Vahdat et al. [9], as well as a data set from a course at the Technical University of Denmark.
The first experiment is run on simulated data and compares the results of calculating both \(\beta \) and \(\delta \) globally to re-estimating \(\beta \).
The next experiments compare the performance of the two methods for introducing privacy, the objective perturbation and sufficient statistics. We use correlation coefficients between the estimated probabilities and the true probabilities (i.e., the ones used to simulate the data) resp. the non-private estimates with \(95\%\) confidence intervals. Further, we show test misclassification rates (how well do we predict if a student passes a test) on a new data set drawn from the same distribution as the training data. Experiments were run for the two real data sets. The first one had to be modified to fit the Rasch model, so the answers were rounded to 1 or 0 depending on whether half of the points for a question were scored. The DTU data set are the results of a multiple choice test, so the original data can be used. For the real data sets, we use bootstrapping with 1000 samples to calculate confidence intervals.
We used MATLAB’s fminunc function for minimizing the objective function in all experiments - with the following settings:
The experiments on simulated data were run with 50 repetitions, used a regularization parameter of 0.01, and privacy budgets of 1, 5 and 10. The amount of students (N) vary from 40 to 200 in steps of 40 and amount of questions (I) is fixed to 20. The parameters \(\beta _n\) and \(\delta _i\) were drawn from normal distributions with mean 0, and standard deviation 1 and 2, respectively. The Rasch probabilities were then calculated with the given \(\beta _n\) and \(\delta _i\) and used to simulate a data set by drawing from a Bernoulli distribution with the given probabilities.
The M. Vahdat et al. data set has 62 students and 16 questions. The DTU data set has 212 students and 27 questions.
In pilot experiments we found that fine-tuning of the regularization parameter \(\lambda \) was not essential. So for simplicity reasons we use a common regularization parameter in all experiments.
4.1 Results
To test the impact of introducing differential privacy to the Rasch model, we ran several experiments. In the first one we test the retraining of \(\beta \) in the non-private setting, to show the students can calculate their own abilities from a given, private \(\delta \). The second experiment show the performance of the two methods on simulated data, while the third and fourth show the performance on the M. Vahdat real data and DTU data, respectively.
Experiment 1: Global Vs. Re-estimated Rasch Parameters. In Fig. 2 we compare the Rasch model with global parameter estimation with the results obtained by re-estimating the student abilities. We plot correlation coefficients with \(95\%\) confidence intervals between the probabilities, also with respect to the true model parameters, as well as misclassification rates on a new data set drawn from the same distribution as the train data.
From the correlation coefficients between the estimated probabilities and the ground truth probabilities used to simulate the data, we can see that there is practically no difference in accuracy between re-estimation and global estimation. From now on, we will only consider the re-estimation method, since this is the one defined by our workflow.
Experiment 2: Differential Privacy on Simulated Data. In Fig. 3 we show a comparison of the objective perturbation and sufficient statistics method with three different values of epsilon: 1, 5 and 10. We compare their performance by calculating the correlation coefficients between the private estimation to the non-private resp. true estimation. Again, we show \(95\%\) confidence intervals.
In Fig. 4, we show the missclassification rates on a simulated data set of the same distribution as the training set.
We make several observations. First, the objective perturbation performs better in general. This is due to the smaller amount of noise added to the objective perturbation, as mentioned in Sect. 3. Next, we see that for lower epsilon values, i.e. higher privacy, the model generally performs worse, but converges with larger class size. This is what we would expect and consistent with what is broadly observed in applications of differential privacy.
Experiment 3 and 4: Differential Privacy on Real Data. Experiment 2 illustrated the impact of privacy so for experiments 3 and 4, the privacy budget was fixed to \(\epsilon =5\) with changing data sizes. In experiment 3 we use the data set from Vahdat et al. [9]. Experiment 4 is run on the DTU data set.
Figure 5 shows the objective perturbation and sufficient statistics performance on real data with \(\epsilon = 5\). The misclassification rate here is calculated on the original data set (so corresponds to a train, not a test error).
In experiment 3, Fig. 5 (a) and (b), we show that the impact of introducing privacy on real data sets is limited, even in relatively small data sets. For both the Vahdat et al. data and the DTU data we find useful correlation between the probabilities of passing the test as inferred in non-private and private models (\(\epsilon = 5\)).
In experiment 4, Fig. 5 (c) and (d), our results are comparable to those on the simulated data. In general, we see that the objective perturbation method performs better than the sufficient statistics method.
In Fig. 6, we illustrate the impact of data set size by showing comparisons of the probability estimates of the non-private model to the two private methods on the data set from Vahdat et al. [9], again with fixed \(\epsilon = 5\).
We see how the estimates are very noisy for small data set sizes, but correlate strongly with the non private for a data set size of 62. Again, the objective perturbation yields more accurate results.
5 Conclusion
We have demonstrated viability of the proposed workflow for more detailed, yet differentially private, feedback for students. We proved analytically that objective perturbation for this model satisfies differential privacy and give the minimum noise level necessary. Our experiments based on simulated data suggest that the workflow provides estimates of similar quality as the non-private for medium sized classes and industry standard privacy budgetsFootnote 1. These findings were confirmed in two real data sets. As expected, the objective perturbation mechanism performs better than the sufficient statistic method as less noise is added.
Notes
References
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 2008, pp. 289–296. Curran Associates Inc., USA (2008). http://dl.acm.org/citation.cfm?id=2981780.2981817
Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12(Mar), 1069–1109 (2011)
Choppin, B.: A fully conditional estimation procedure for Rasch model parameters (CSE report 196): University of California. Center for the Study of Evaluation (1983)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Dwork, C., Roth, A., et al.: The algorithmic foundations of differential privacy. Found. Trends® Theor. Comput. Sci. 9(3–4), 211–407 (2014)
EU GDPR Portal: GDPR key changes - an overview of the main changes under GDPR and how they differ from the previous directive. https://www.eugdpr.org/key-changes.html (2018). Accessed 19 May 2018
Foulds, J., Geumlek, J., Welling, M., Chaudhuri, K.: On the theory and practice of privacy-preserving bayesian data analysis. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 192–201. AUAI Press (2016)
Ji, Z., Lipton, Z.C., Elkan, C.: Differential privacy and machine learning: a survey and review. arXiv preprint arXiv:1412.7584 (2014)
Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: Conole, G., Klobučar, T., Rensing, C., Konert, J., Lavoué, É. (eds.) EC-TEL 2015. LNCS, vol. 9307, pp. 352–366. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24258-3_26. https://archive.ics.uci.edu/ml/machine-learning-databases/00346/
Uden, L., Liberona, D., Welzer, T.: Learning technology for education in cloud. In: Third International Workshop. LTEC 2014. Springer (2014)
Acknowledgements
We would like to thank Martin Søren Engmann Djurhuus, who worked with us on the project in its early stages during the course “Advanced Machine Learning” at DTU. Further, we would like to thank Morten Mørup for access to the DTU data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Proof
(of Theorem 1). Since the objective function is differentiable everywhere and a minimizing pair \((\beta ^*, \delta ^*)\) satisfies \(\nabla F(\beta ^*,\delta ^*)=0\), for every output \((\beta ^*,\delta ^*)\) there exists exactly one b which maps the input to the output. On the other hand, since the objective function (6) for \(\lambda >0\) is strongly convex (which can be seen by computing the Hessian matrix H and realizing that \(H-\lambda I\) is positive semidefinite), for any fixed b and X, there is exactly one pair \(\beta ^*,~\delta ^*\) which minimizes the function. As such there is a bijection between \((\beta ^*,\delta ^*)\) and b.
Now consider two data sets X and \(\tilde{X}\) that differ in exactly one student (w.l.o.g., the last one). For \(\beta ^*,~\delta ^*\) minimizing (6) for both X and \(\tilde{X}\) denote the corresponding noise vectors b and \(\tilde{b}\). By the transformation property of probability density functions, we get
where \(J_{(\beta ^*,\delta ^*,X)}(b)\) denotes the Jacobian matrix of b as a function of \((\beta ^*,\delta ^*)\), given input set X.
We get \(b_i\) as a function of \((\beta ^*,\delta ^*)\) by setting the gradient of the objective to zero:
Since the sum over X will disappear in any derivative for \(\delta _i^*\) and \(\beta _n^*\), the Jacobian matrices in (8) are identical and the determinants cancel.
Furthermore, since \(X_{ni}=\tilde{X}_{ni}\) for all \(i=1,\dots ,I\) and all \(n=1,\dots ,(N-1)\), equation (9) also gives
By the reverse triangle inequality we get
and thus
\(\square \)
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Steiner, T.A., Nyrnberg, D.E., Hansen, L.K. (2019). A Differential Privacy Workflow for Inference of Parameters in the Rasch Model. In: Alzate, C., et al. ECML PKDD 2018 Workshops. MIDAS PAP 2018 2018. Lecture Notes in Computer Science(), vol 11054. Springer, Cham. https://doi.org/10.1007/978-3-030-13463-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-13463-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13462-4
Online ISBN: 978-3-030-13463-1
eBook Packages: Computer ScienceComputer Science (R0)