Abstract
We propose a new formulation of the clustering problem that differs from previous work in several aspects. First, the goal is to explicitly output a collection of simple and meaningful conjunctive descriptions of the clusters. Second, the clusters might overlap, i.e., a point can belong to multiple clusters. Third, the clusters might not cover all points, i.e., not every point is clustered. Finally, we allow a point to be assigned to a conjunctive cluster description even if it does not completely satisfy all of the attributes, but rather only satisfies most.
A convenient way to view our clustering problem is that of finding a collection of large bicliques in a bipartite graph. Identifying one largest conjunctive cluster is equivalent to finding a maximum edge biclique. Since this problem is NP-hard [28] and there is evidence that it is difficult to approximate [12], we solve a relaxed version where the objective is to find a large subgraph that is close to being a biclique. We give a randomized algorithm that finds a relaxed biclique with almost as many edges as the maximum biclique. We then extend this algorithm to identify a good collection of large relaxed bicliques. A key property of these algorithms is that their running time is independent of the number of data points and linear in the number of attributes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Gehrke, J.E., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of SIGMOD, pp. 94–105 (1998)
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD, pp. 207–216 (1993)
Alon, N., Fischer, E., Krivelevich, M., Szegedy, M.: Efficient testing of large graphs. Combinatorica 20, 451–476 (2000)
Arora, S., Karger, D., Karpinski, M.: Polynomial time approximation schemes for dense instances of NP-hard problems. Journal of Computer and System Sciences 58, 193–210 (1999)
Arya, Garg, Khandekar, Munagala, Pandit: Local search heuristic for k-median and facility location problems. In: Proceedings of STOC (2001)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of FOCS, pp. 938–247 (2002)
Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, pp. 84–95 (2000)
Charikar, M., Guha, S.: Improved combinatorial algorithms for the facility location and k-median problems. In: Proceedings of FOCS, pp. 378–388 (1999)
Fernandez de la Vega, W.: MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms 8, 187–198 (1996)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society series B 39, 1–38 (1977)
Feder, T., Greene, D.: Optimal algorithms for approximate clustering. In: Proceedings of STOC, pp. 434–444 (1988)
Feige, U.: Average case complexity and approximation complexity. In: Proceedings of STOC (2002)
Flake, G., Lawrence, S., Lee Giles, C.: Efficient identification of web communities. In: Proceedings of KDD, pp. 150–160 (2000)
Frieze, A., Kannan, R.: Quick approximation to matrices and applications. Combinatorica 19(2), 175–220 (1999)
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext, Structural Queries, pp. 225–234 (1998)
Goldberg, A.V.: Finding a maximum density subgraph. UC Berkeley Tech Report, CSD-84-171 (1984)
Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. Journal of the ACM 45(4), 653–750 (1998)
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38(2-3), 293–306 (1985)
Gunopulos, D., Mannila, H., Khardon, R., Toivonen, H.: Data mining, hypergraph transversals, and machine learning (extended abstract). In: Proceedings of PODS, pp. 209–216 (1997)
Hochbaum, D., Shmoys, D.: A unified approach to approximate algorithms for bottleneck problems. Journal of the ACM 33(3), 533–550 (1986)
Jain, N., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings of FOCS, pp. 2–13 (1999)
Kannan, R., Vempala, S., Vetta, A.: On clusterings — good, bad and spectral. In: IEEE (ed.) Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 367–377 (2000)
Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for emerging cyber-communities. Computer Networks (Amsterdam, Netherlands: 1999) 31(11–16), 1481–1493 (1999)
Michalski, R.S.: Knowledge acquisition through conceptual clustering: A theoretical framework and an algorithm for partitioning data into conjunctive concepts. Technical Report 1026, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1980)
Mishra, N., Oblinger, D., Pitt, L.: Sublinear time approximate clustering. In: Proceedings of SODA, pp. 439–447 (2001)
Mishra, N., Ron, D., Swaminathan, R.: Large conjunctive clusters and bicliques (2002) (available from the authors)
Ostrovsky, R., Rabani, Y.: Polynomial time approximation schemes for geometric k-clustering. In: IEEE (ed.) 41st Annual Symposium on Foundations of Computer Science, pp. 349–358 (2000)
Peeters, R.: The maximum edge biclique problem is NP-complete (2000) (unpublished manuscript)
Pitt, L., Reinke, R.E.: Criteria for polynomial-time (conceptual) clustering. Machine Learning 2, 371 (1987)
Peleg, D., Feige, U., Kortsarz, G.: The dense-k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mishra, N., Ron, D., Swaminathan, R. (2003). On Finding Large Conjunctive Clusters. In: Schölkopf, B., Warmuth, M.K. (eds) Learning Theory and Kernel Machines. Lecture Notes in Computer Science(), vol 2777. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45167-9_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-45167-9_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40720-1
Online ISBN: 978-3-540-45167-9
eBook Packages: Springer Book Archive