Skip to main content

On Finding Large Conjunctive Clusters

  • Conference paper
Learning Theory and Kernel Machines

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2777))

Abstract

We propose a new formulation of the clustering problem that differs from previous work in several aspects. First, the goal is to explicitly output a collection of simple and meaningful conjunctive descriptions of the clusters. Second, the clusters might overlap, i.e., a point can belong to multiple clusters. Third, the clusters might not cover all points, i.e., not every point is clustered. Finally, we allow a point to be assigned to a conjunctive cluster description even if it does not completely satisfy all of the attributes, but rather only satisfies most.

A convenient way to view our clustering problem is that of finding a collection of large bicliques in a bipartite graph. Identifying one largest conjunctive cluster is equivalent to finding a maximum edge biclique. Since this problem is NP-hard [28] and there is evidence that it is difficult to approximate [12], we solve a relaxed version where the objective is to find a large subgraph that is close to being a biclique. We give a randomized algorithm that finds a relaxed biclique with almost as many edges as the maximum biclique. We then extend this algorithm to identify a good collection of large relaxed bicliques. A key property of these algorithms is that their running time is independent of the number of data points and linear in the number of attributes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Gehrke, J.E., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of SIGMOD, pp. 94–105 (1998)

    Google Scholar 

  2. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD, pp. 207–216 (1993)

    Google Scholar 

  3. Alon, N., Fischer, E., Krivelevich, M., Szegedy, M.: Efficient testing of large graphs. Combinatorica 20, 451–476 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  4. Arora, S., Karger, D., Karpinski, M.: Polynomial time approximation schemes for dense instances of NP-hard problems. Journal of Computer and System Sciences 58, 193–210 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  5. Arya, Garg, Khandekar, Munagala, Pandit: Local search heuristic for k-median and facility location problems. In: Proceedings of STOC (2001)

    Google Scholar 

  6. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: Proceedings of FOCS, pp. 938–247 (2002)

    Google Scholar 

  7. Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, pp. 84–95 (2000)

    Google Scholar 

  8. Charikar, M., Guha, S.: Improved combinatorial algorithms for the facility location and k-median problems. In: Proceedings of FOCS, pp. 378–388 (1999)

    Google Scholar 

  9. Fernandez de la Vega, W.: MAX-CUT has a randomized approximation scheme in dense graphs. Random Structures and Algorithms 8, 187–198 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  10. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society series B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  11. Feder, T., Greene, D.: Optimal algorithms for approximate clustering. In: Proceedings of STOC, pp. 434–444 (1988)

    Google Scholar 

  12. Feige, U.: Average case complexity and approximation complexity. In: Proceedings of STOC (2002)

    Google Scholar 

  13. Flake, G., Lawrence, S., Lee Giles, C.: Efficient identification of web communities. In: Proceedings of KDD, pp. 150–160 (2000)

    Google Scholar 

  14. Frieze, A., Kannan, R.: Quick approximation to matrices and applications. Combinatorica 19(2), 175–220 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  15. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the 9th ACM Conference on Hypertext, Structural Queries, pp. 225–234 (1998)

    Google Scholar 

  16. Goldberg, A.V.: Finding a maximum density subgraph. UC Berkeley Tech Report, CSD-84-171 (1984)

    Google Scholar 

  17. Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. Journal of the ACM 45(4), 653–750 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  18. Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38(2-3), 293–306 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  19. Gunopulos, D., Mannila, H., Khardon, R., Toivonen, H.: Data mining, hypergraph transversals, and machine learning (extended abstract). In: Proceedings of PODS, pp. 209–216 (1997)

    Google Scholar 

  20. Hochbaum, D., Shmoys, D.: A unified approach to approximate algorithms for bottleneck problems. Journal of the ACM 33(3), 533–550 (1986)

    Article  MathSciNet  Google Scholar 

  21. Jain, N., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings of FOCS, pp. 2–13 (1999)

    Google Scholar 

  22. Kannan, R., Vempala, S., Vetta, A.: On clusterings — good, bad and spectral. In: IEEE (ed.) Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 367–377 (2000)

    Google Scholar 

  23. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for emerging cyber-communities. Computer Networks (Amsterdam, Netherlands: 1999) 31(11–16), 1481–1493 (1999)

    Google Scholar 

  24. Michalski, R.S.: Knowledge acquisition through conceptual clustering: A theoretical framework and an algorithm for partitioning data into conjunctive concepts. Technical Report 1026, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois (1980)

    Google Scholar 

  25. Mishra, N., Oblinger, D., Pitt, L.: Sublinear time approximate clustering. In: Proceedings of SODA, pp. 439–447 (2001)

    Google Scholar 

  26. Mishra, N., Ron, D., Swaminathan, R.: Large conjunctive clusters and bicliques (2002) (available from the authors)

    Google Scholar 

  27. Ostrovsky, R., Rabani, Y.: Polynomial time approximation schemes for geometric k-clustering. In: IEEE (ed.) 41st Annual Symposium on Foundations of Computer Science, pp. 349–358 (2000)

    Google Scholar 

  28. Peeters, R.: The maximum edge biclique problem is NP-complete (2000) (unpublished manuscript)

    Google Scholar 

  29. Pitt, L., Reinke, R.E.: Criteria for polynomial-time (conceptual) clustering. Machine Learning 2, 371 (1987)

    Google Scholar 

  30. Peleg, D., Feige, U., Kortsarz, G.: The dense-k-subgraph problem. Algorithmica 29(3), 410–421 (2001)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mishra, N., Ron, D., Swaminathan, R. (2003). On Finding Large Conjunctive Clusters. In: Schölkopf, B., Warmuth, M.K. (eds) Learning Theory and Kernel Machines. Lecture Notes in Computer Science(), vol 2777. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45167-9_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45167-9_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40720-1

  • Online ISBN: 978-3-540-45167-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics