On Finding Large Conjunctive Clusters

Mishra, Nina; Ron, Dana; Swaminathan, Ram

doi:10.1007/978-3-540-45167-9_33

Nina Mishra⁸,
Dana Ron⁹ &
Ram Swaminathan¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2777))

5279 Accesses
18 Citations
3 Altmetric

Abstract

We propose a new formulation of the clustering problem that differs from previous work in several aspects. First, the goal is to explicitly output a collection of simple and meaningful conjunctive descriptions of the clusters. Second, the clusters might overlap, i.e., a point can belong to multiple clusters. Third, the clusters might not cover all points, i.e., not every point is clustered. Finally, we allow a point to be assigned to a conjunctive cluster description even if it does not completely satisfy all of the attributes, but rather only satisfies most.

A convenient way to view our clustering problem is that of finding a collection of large bicliques in a bipartite graph. Identifying one largest conjunctive cluster is equivalent to finding a maximum edge biclique. Since this problem is NP-hard [28] and there is evidence that it is difficult to approximate [12], we solve a relaxed version where the objective is to find a large subgraph that is close to being a biclique. We give a randomized algorithm that finds a relaxed biclique with almost as many edges as the maximum biclique. We then extend this algorithm to identify a good collection of large relaxed bicliques. A key property of these algorithms is that their running time is independent of the number of data points and linear in the number of attributes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Gehrke, J.E., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of SIGMOD, pp. 94–105 (1998)
Google Scholar
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD, pp. 207–216 (1993)
Google Scholar
Alon, N., Fischer, E., Krivelevich, M., Szegedy, M.: Efficient testing of large graphs. Combinatorica 20, 451–476 (2000)
Article MATH MathSciNet Google Scholar
Arora, S., Karger, D., Karpinski, M.: Polynomial time approximation schemes for dense instances of NP-hard problems. Journal of Computer and System Sciences 58, 193–210 (1999)
Article MATH MathSciNet Google Scholar
Arya, Garg, Khandekar, Munagala, Pandit: Local search heuristic for k-median and facility location problems. In: Proceedings of STOC (2001)
Google Scholar
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of FOCS, pp. 938–247 (2002)
Google Scholar
Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, pp. 84–95 (2000)
Google Scholar
Charikar, M., Guha, S.: Improved combinatorial algorithms for the facility location and k-median problems. In: Proceedings of FOCS, pp. 378–388 (1999)
Google Scholar
Fernandez de la Vega, W.: MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms 8, 187–198 (1996)
Article MATH MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society series B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Feder, T., Greene, D.: Optimal algorithms for approximate clustering. In: Proceedings of STOC, pp. 434–444 (1988)
Google Scholar
Feige, U.: Average case complexity and approximation complexity. In: Proceedings of STOC (2002)
Google Scholar
Flake, G., Lawrence, S., Lee Giles, C.: Efficient identification of web communities. In: Proceedings of KDD, pp. 150–160 (2000)
Google Scholar
Frieze, A., Kannan, R.: Quick approximation to matrices and applications. Combinatorica 19(2), 175–220 (1999)
Article MATH MathSciNet Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext, Structural Queries, pp. 225–234 (1998)
Google Scholar
Goldberg, A.V.: Finding a maximum density subgraph. UC Berkeley Tech Report, CSD-84-171 (1984)
Google Scholar
Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. Journal of the ACM 45(4), 653–750 (1998)
Article MATH MathSciNet Google Scholar
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38(2-3), 293–306 (1985)
Article MATH MathSciNet Google Scholar
Gunopulos, D., Mannila, H., Khardon, R., Toivonen, H.: Data mining, hypergraph transversals, and machine learning (extended abstract). In: Proceedings of PODS, pp. 209–216 (1997)
Google Scholar
Hochbaum, D., Shmoys, D.: A unified approach to approximate algorithms for bottleneck problems. Journal of the ACM 33(3), 533–550 (1986)
Article MathSciNet Google Scholar
Jain, N., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings of FOCS, pp. 2–13 (1999)
Google Scholar
Kannan, R., Vempala, S., Vetta, A.: On clusterings — good, bad and spectral. In: IEEE (ed.) Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 367–377 (2000)
Google Scholar
Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for emerging cyber-communities. Computer Networks (Amsterdam, Netherlands: 1999) 31(11–16), 1481–1493 (1999)
Google Scholar
Michalski, R.S.: Knowledge acquisition through conceptual clustering: A theoretical framework and an algorithm for partitioning data into conjunctive concepts. Technical Report 1026, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1980)
Google Scholar
Mishra, N., Oblinger, D., Pitt, L.: Sublinear time approximate clustering. In: Proceedings of SODA, pp. 439–447 (2001)
Google Scholar
Mishra, N., Ron, D., Swaminathan, R.: Large conjunctive clusters and bicliques (2002) (available from the authors)
Google Scholar
Ostrovsky, R., Rabani, Y.: Polynomial time approximation schemes for geometric k-clustering. In: IEEE (ed.) 41st Annual Symposium on Foundations of Computer Science, pp. 349–358 (2000)
Google Scholar
Peeters, R.: The maximum edge biclique problem is NP-complete (2000) (unpublished manuscript)
Google Scholar
Pitt, L., Reinke, R.E.: Criteria for polynomial-time (conceptual) clustering. Machine Learning 2, 371 (1987)
Google Scholar
Peleg, D., Feige, U., Kortsarz, G.: The dense-k-subgraph problem. Algorithmica 29(3), 410–421 (2001)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

HP Labs and Stanford University,
Nina Mishra
Tel-Aviv University,
Dana Ron
HP Labs,
Ram Swaminathan

Authors

Nina Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Dana Ron
View author publications
You can also search for this author in PubMed Google Scholar
Ram Swaminathan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MPI for Biological Cybernetics, Spemannstr. 38, 72076, Tübingen, Germany
Bernhard Schölkopf
University of California, Santa Cruz
Manfred K. Warmuth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mishra, N., Ron, D., Swaminathan, R. (2003). On Finding Large Conjunctive Clusters. In: Schölkopf, B., Warmuth, M.K. (eds) Learning Theory and Kernel Machines. Lecture Notes in Computer Science(), vol 2777. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45167-9_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-45167-9_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40720-1
Online ISBN: 978-3-540-45167-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics