A Differentially Private Kernel Two-Sample Test

  • Anant Raj
  • Ho Chung Leon LawEmail author
  • Dino Sejdinovic
  • Mijung Park
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11906)


Kernel two-sample testing is a useful statistical tool in determining whether data samples arise from different distributions without imposing any parametric assumptions on those distributions. However, raw data samples can expose sensitive information about individuals who participate in scientific studies, which makes the current tests vulnerable to privacy breaches. Hence, we design a new framework for kernel two-sample testing conforming to differential privacy constraints, in order to guarantee the privacy of subjects in the data. Unlike existing differentially private parametric tests that simply add noise to data, kernel-based testing imposes a challenge due to a complex dependence of test statistics on the raw data, as these statistics correspond to estimators of distances between representations of probability measures in Hilbert spaces. Our approach considers finite dimensional approximations to those representations. As a result, a simple chi-squared test is obtained, where a test statistic depends on a mean and covariance of empirical differences between the samples, which we perturb for a privacy guarantee. We investigate the utility of our framework in two realistic settings and conclude that our method requires only a relatively modest increase in sample size to achieve a similar level of power to the non-private tests in both settings.


Differential privacy Kernel two-sample test 

Supplementary material


  1. 1.
    Balle, B., Wang, Y.-X.: Improving the gaussian mechanism for differential privacy: analytical calibration and optimal denoising (2018)Google Scholar
  2. 2.
    Balog, M., Tolstikhin, I., Schölkopf, B.: Differentially private database release via kernel mean embeddings (2017). arXiv:1710.01641
  3. 3.
    Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.-P., Schölkopf, B., Smola, A.J.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14), e49–e57 (2006)CrossRefGoogle Scholar
  4. 4.
    Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. JMLR 12, 1069–1109 (2011)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Chen, X.: A new generalization of Chebyshev inequality for random vectors. arXiv preprint arXiv:0707.0805 (2007)
  6. 6.
    Chwialkowski, K.P., Ramdas, A., Sejdinovic, D., Gretton, A.: Fast two-sample testing with analytic representations of probability measures. In: NIPS, pp. 1981–1989 (2015)Google Scholar
  7. 7.
    Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). Scholar
  8. 8.
    Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). Scholar
  9. 9.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–407 (2014)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Dwork, C., Talwar, K., Thakurta, A., Zhang, L.: Analyze Gauss: optimal bounds for privacy-preserving principal component analysis. In: Symposium on Theory of Computing, STOC 2014, pp. 11–20 (2014)Google Scholar
  11. 11.
    Flaxman, S., Sejdinovic, D., Cunningham, J.P., Filippi, S.: Bayesian learning of kernel embeddings. In: UAI, pp. 182–191 (2016)Google Scholar
  12. 12.
    Gaboardi, M., Lim, H.W., Rogers, R., Vadhan, S.P.: Differentially private chi-squared hypothesis testing: goodness of fit and independence testing. In: ICML, vol. 48, ICML 2016, pp. 2111–2120 (2016)Google Scholar
  13. 13.
    Gaboardi, M., Rogers, R.M.: Local private hypothesis testing: Chi-square tests. CoRR, abs/1709.07155 (2017)Google Scholar
  14. 14.
    Goyal, V., Khurana, D., Mironov, I., Pandey, O., Sahai, A.: Do distributed differentially-private protocols require oblivious transfer?. In: ICALP, pp. 29:1–29:15 (2016)Google Scholar
  15. 15.
    Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) NIPS, pp. 513–520. MIT Press (2007)Google Scholar
  16. 16.
    Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. JMLR 13(1), 723–773 (2012)MathSciNetzbMATHGoogle Scholar
  17. 17.
    Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.K.: A fast, consistent kernel two-sample test. In: NIPS, pp. 673–681 (2009)Google Scholar
  18. 18.
    Gretton, A., et al.:. Optimal kernel choice for large-scale two-sample tests. In: NIPS (2012)Google Scholar
  19. 19.
    Hall, R., Rinaldo, A., Wasserman, L.: Differential privacy for functions and functional data. JMLR 14, 703–727 (2013) MathSciNetzbMATHGoogle Scholar
  20. 20.
    Homer, N.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4(8), 1–9 (2008)CrossRefGoogle Scholar
  21. 21.
    Jain, P., Thakurta, A.: Differentially private learning with kernels. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013, pp. 118–126, July 2013Google Scholar
  22. 22.
    Jitkrittum, W., Szabó, Z., Chwialkowski, K., Gretton, A.: Interpretable distribution features with maximum testing power. In: NIPS (2016)Google Scholar
  23. 23.
    Johnson, A., Shmatikov,V.: Privacy-preserving data exploration in genome-wide association studies. In: ACM SIGKDD 2013 (2013)Google Scholar
  24. 24.
    Law, H.C.L., Sutherland, D.J., Sejdinovic, D., Flaxman, S.: Bayesian approaches to distribution regression. In: UAI (2017)Google Scholar
  25. 25.
    McGregor, A., Mironov, I., Pitassi, T., Reingold, O., Talwar, K., Vadhan, S.: The limits of two-party differential privacy. In: IEEE, October 2010Google Scholar
  26. 26.
    Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: a review and beyond. Found. Trends® Mach. Learn. 10(1–2), 1–141 (2017)zbMATHGoogle Scholar
  27. 27.
    Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)Google Scholar
  28. 28.
    Rogers, R., Kifer, D.: A new class of private chi-square hypothesis tests. In: Artificial Intelligence and Statistics, pp. 991–1000 (2017)Google Scholar
  29. 29.
    Rothe, R., Timofte, R., Van Gool, L.: Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vision 126(2), 144–157 (2016). Scholar
  30. 30.
    Wahba, G.: Spline Models for Observational Data. Society for Industrial and Applied Mathematics (1990) Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Max Planck Institute for Intelligent SystemsTübingenGermany
  2. 2.Department of StatisticsUniversity of OxfordOxfordUK

Personalised recommendations