Abstract
This paper presents a fundamentally new approach to allowing learning algorithms to be applied to a dataset, while still keeping the records in the dataset confidential. Let D be the set of records to be kept private, and let E be a fixed set of records from a similar domain that is already public. The idea is to compute and publish a weight w(x) for each record x in E that measures how representative it is of the records in D. Data mining on E using these importance weights is then approximately equivalent to data mining directly on D. The dataset D is used by its owner to compute the weights, but not revealed in any other way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: Proceedings of the 24th ACM Symposium on Principles of Database Systems, pp. 128–138. ACM Press, New York (2005)
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 609–618. ACM Press, New York (2008)
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), pp. 289–296 (2008)
Chaudhuri, K., Sarwate, A.D.: Privacy constraints in regularized convex optimization. Arxiv preprint arXiv:0907.1413 (2009)
Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A.: Sample selection bias correction theory. In: Algorithmic Learning Theory, pp. 38–53. Springer, Heidelberg (2010)
Dwork, C.: Differential privacy: A survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Kearns, M.: Efficient noise-tolerant learning from statistical queries. Journal of the ACM 45(6), 983–1006 (1998)
Press, W.H.: How to use Markov chain Monte Carlo to do difficult integrals (including those for normalizing constants) (2004), Draft working paper available at http://www.nr.com/whp/workingpapers.html
Scott, D.W.: Multivariate density estimation: Theory, practice, and visualization. Wiley-Interscience, Hoboken (1992)
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90(2), 227–244 (2000)
Smith, A., Elkan, C.: Making generative classifiers robust to selection bias. In: Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 657–666. ACM Press, New York (2007)
Tsuboi, Y., Kashima, H., Bickel, S., Sugiyama, M.: Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation. Journal of Information Processing 17, 138–155 (2009)
Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st International Conference on Machine Learning, pp. 903–910. ACM Press, New York (2004)
Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th International Conference on Machine Learning, pp. 609–616. Morgan Kaufmann, San Francisco (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elkan, C. (2011). Preserving Privacy in Data Mining via Importance Weighting. In: Dimitrakakis, C., Gkoulalas-Divanis, A., Mitrokotsa, A., Verykios, V.S., Saygin, Y. (eds) Privacy and Security Issues in Data Mining and Machine Learning. PSDML 2010. Lecture Notes in Computer Science(), vol 6549. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19896-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-19896-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19895-3
Online ISBN: 978-3-642-19896-0
eBook Packages: Computer ScienceComputer Science (R0)