Skip to main content

Scalable Co-clustering Algorithms

  • Conference paper
Algorithms and Architectures for Parallel Processing (ICA3PP 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6081))

Abstract

Co-clustering has been extensively used in varied applications because of its potential to discover latent local patterns that are otherwise unapparent by usual unsupervised algorithms such as k-means. Recently, a unified view of co-clustering algorithms, called Bregman co-clustering (BCC), provides a general framework that even contains several existing co-clustering algorithms, thus we expect to have more applications of this framework to varied data types. However, the amount of data collected from real-life application domains easily grows too big to fit in the main memory of a single processor machine. Accordingly, enhancing the scalability of BCC can be a critical challenge in practice. To address this and eventually enhance its potential for rapid deployment to wider applications with larger data, we parallelize all the twelve co-clustering algorithms in the BCC framework using message passing interface (MPI). In addition, we validate their scalability on eleven synthetic datasets as well as one real-life dataset, where we demonstrate their speedup performance in terms of varied parameter settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahmad, W., Zhou, J., Khokhar, A.: SPHier: scalable parallel biclustering using weighted bigraph crossing minimization. Technical report, Dept. of ECE, University of Illinois at Chicago (2004)

    Google Scholar 

  2. Banerjee, A., Dhillon, I.S., Ghosh, J., Merugu, S., Modha, D.S.: A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. Journal of Machine Learning Research 8, 1919–1986 (2007)

    MathSciNet  Google Scholar 

  3. Cheng, Y., Church, G.M.: Biclustering of expression data. In: ISMB, vol. 8, pp. 93–103 (2000)

    Google Scholar 

  4. Cho, H., Dhillon, I.S.: Co-clustering of human cancer microarrays using minimum sum-squared residue co-clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics (IEEE/ACM TCBB) 5(3), 385–400 (2008)

    Article  Google Scholar 

  5. Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum sum squared residue based co-clustering of gene expression data. In: SDM, pp. 114–125 (2004)

    Google Scholar 

  6. Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS (2006)

    Google Scholar 

  7. Dhillon, I.S., Modha, D.S.: A data clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  8. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: SIGKDD, pp. 89–98 (2003)

    Google Scholar 

  9. George, T., Merugu, S.: A scalable collaborative filtering framework based on coclustering. In: ICDM, pp. 625–628 (2005)

    Google Scholar 

  10. IBM Quest synthetic data generation code for classification, http://www.almaden.ibm.com/cs/projects/iis/hdb/Projects/data_mining/datasets/syndata.html

  11. Nagesh, H., Goil, S., Choudhary, A.: Parallel alogrithms for clustering high-dimensional large-scale datasets. In: Grossmen, R.L., Kamth, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds.) Data Mining for Scientific for Engineering Applications, pp. 335–356. Kluwer Academy Publishers, Dordrecht (2001)

    Google Scholar 

  12. Pizzuti, C., Talia, D.: P-AutoClass: scalable parallel clustering for mining large data sets. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) 15(3), 629–641 (2003)

    Article  Google Scholar 

  13. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: HPCA (2007)

    Google Scholar 

  14. Zhou, J., Khokar, A.: ParRescue: scalable parallel algorithm and implementation for biclustering over large distributed datasets. In: ICDCS (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kwon, B., Cho, H. (2010). Scalable Co-clustering Algorithms. In: Hsu, CH., Yang, L.T., Park, J.H., Yeo, SS. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2010. Lecture Notes in Computer Science, vol 6081. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13119-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13119-6_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13118-9

  • Online ISBN: 978-3-642-13119-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics