Abstract
We consider the clustering with diversity problem: given a set of colored points in a metric space, partition them into clusters such that each cluster has at least ℓ points, all of which have distinct colors. We give a 2-approximation to this problem for any ℓ when the objective is to minimize the maximum radius of any cluster. We show that the approximation ratio is optimal unless P = NP, by providing a matching lower bound. Several extensions to our algorithm have also been developed for handling outliers. This problem is mainly motivated by applications in privacy-preserving data publication.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, G., Feder, T., Kenthapadi, K., Khuller, S., Panigrahy, R., Thomas, D., Zhu, A.: Achieving anonymity via clustering. In: PODS, pp. 153–162 (2006)
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 246–258. Springer, Heidelberg (2004)
Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: Ranking and clustering. J. ACM 55(5), 1–27 (2008)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE, pp. 952–963 (2009)
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic acids research 25(1), 31 (1997)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1), 89–113 (2004)
Beresford, A., Stajano, F.: Location privacy in pervasive computing. IEEE Pervasive Computing, 46–55 (2003)
Wong, R.C.-W., Li, J., Fu, A.-C., Wang, K.: (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: SIGKDD, pp. 754–759 (2006)
Charikar, M., Khuller, S., Mount, D., Narasimhan, G.: Algorithms for facility location problems with outliers. In: SODA, pp. 642–651 (2001)
Davidson, I., Ravi, S.: Intractability and clustering with constraints. In: ICML, pp. 201–208 (2007)
Dwork, C., Naor, M., Reingold, O., Rothblum, G., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: STOC, pp. 381–390 (2009)
Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: STOC, pp. 361–370 (2009)
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: VLDB, pp. 758–769 (2007)
Giotis, I., Guruswami, V.: Correlation clustering with a fixed number of clusters. In: SODA, pp. 1176–1185 (2006)
Hoppner, F., Klawonn, F., Platz, R., Str, S.: Clustering with Size Constraints. Computational Intelligence Paradigms: Innovative Applications (2008)
Ji, X.: Graph Partition Problems with Minimum Size Constraints. PhD thesis, Rensselaer Polytechnic Institute (2004)
Kifer, D., Gehrke, J.: Injecting utility into anonymized datasets. In: SIGMOD, pp. 217–228 (2006)
Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms, 4th edn. Springer, Heidelberg (2007)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE, p. 25 (2006)
Li, J., Yi, K., Zhang, Q.: Clustering with diversity (2010), http://arxiv.org/abs/1004.2968
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. In: ICDE, p. 24 (2006)
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: PODS, pp. 223–228 (2004)
Alsuwaiyel, M.H.: Algorithms: Design Techniques and Analysis. World Scientific, Singapore (1998)
Park, H., Shim, K.: Approximate algorithms for k-anonymity. In: SIGMOD (2007)
Samarati, P.: Protecting respondents’ identities in microdata release. TKDE 13(6), 1010–1027 (2001)
Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: ICML, pp. 1103–1110 (2000)
Wagstaff, K., Cardie, C., Schroedl, S.: Constrained k-means clustering with background knowledge. In: ICML, pp. 577–584 (2001)
Xiao, X., Tao, Y.: Anatomy: Simple and effective privacy preservation. In: VLDB, pp. 139–150 (2006)
Xiao, X., Tao, Y.: m-invariance: Towards privacy preserving re-publication of dynamic datasets. In: SIGMOD, pp. 689–700 (2007)
Xiao, X., Yi, K., Tao, Y.: The hardness and approximation algorithms for l-diversity. In: EDBT (2010)
Xing, E., Ng, A., Jordan, M., Russell, S.: Distance metric learning, with application to clustering with side-information. In: NIPS, pp. 505–512 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, J., Yi, K., Zhang, Q. (2010). Clustering with Diversity. In: Abramsky, S., Gavoille, C., Kirchner, C., Meyer auf der Heide, F., Spirakis, P.G. (eds) Automata, Languages and Programming. ICALP 2010. Lecture Notes in Computer Science, vol 6198. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14165-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-14165-2_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14164-5
Online ISBN: 978-3-642-14165-2
eBook Packages: Computer ScienceComputer Science (R0)