Skip to main content
Log in

SUBic: A Scalable Unsupervised Framework for Discovering High Quality Biclusters

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

A biclustering algorithm extends conventional clustering techniques to extract all of the meaningful subgroups of genes and conditions in the expression matrix of a microarray dataset. However, such algorithms are very sensitive to input parameters and show poor scalability. This paper proposes a scalable unsupervised biclustering framework, SUBic, to find high quality constant-row biclusters in an expression matrix effectively. A one-dimensional clustering algorithm is proposed to partition the attributes, that is, columns of an expression matrix into disjoint groups based on the similarity of expression values. These groups form a set of short transactions and are used to discover a set of frequent itemsets each of which corresponds to a bicluster. However, a bicluster may include any attribute whose expression value is not similar enough to others, so a bicluster refinement is used to enhance the quality of a bicluster by removing those attributes based on its distribution of expression values. The performance of the proposed method is comparatively analyzed through a series of experiments on synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Pandey G, Kumar V, Steinbach M. Computational approaches for protein function prediction. In Bioinformatics: Computational Techniques and Engineering, Pan Y, Zomaya A Y (eds.), Wiley, 2010.

  2. Pu S Y, Ronen K, Vlasblom J, Greenblatt J, Wodak S J. Local coherence in genetic interaction patterns reveals prevalent functional versatility. Bioinformatics, 2008, 24(20): 2376–2383.

    Article  Google Scholar 

  3. Abraham V C, Taylor D L, Haskins J R. High content screening applied to large-scale cell biology. Trends in Biotechnology, 2004, 22(1): 15–22.

    Article  Google Scholar 

  4. Bleicher K H, Bohm H J, Muller K, Alanine A I. Hit and lead generation: Beyond high-throughput screening. Nature Review Drug Discovery, 2003, 2(5): 369–378.

    Article  Google Scholar 

  5. Cheng Y, Church G M. Biclustering of expression data. In Proc. the 8th International Conference on Intelligent Systems for Molecular Biology, August 2000, pp. 93–103.

  6. Kotsiantis S B, Pintelas P E. Recent advances in clustering: A brief survey. WSEAS Transactions on Information Science and Applications, 2004, 1(1): 73–81.

    Google Scholar 

  7. Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transcations on Computational Biology and Bioinformatics, 2004, 1(1): 24–45.

    Article  Google Scholar 

  8. Dalal M A, Harale N D. A survey on clustering in data mining. In Proc. International Conference and Workshop on Emerging Trends in Technology, February 2011, pp. 559–562.

  9. Kantardzic M. Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons, 2003, pp. 115–123.

  10. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 2006, 22(9): 1122–1129.

    Article  Google Scholar 

  11. Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: The order-preserving sub-matrix problem. In Proc. the 6th Annual International Conference on Computational Biology, April 2002, pp. 49–57.

  12. Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proceeding of the National Academy of Sciences of the United States of America, 2000, 97(22): 12 079–12 084.

    Article  Google Scholar 

  13. Bergmann S, Ihmels J, Barkai N. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 2003, 67(3): 031 902.

    Article  Google Scholar 

  14. Okada Y, Fujibuchi W, Horton P. Module discovery in gene expression data using closed itemset mining algorithm. In Proc. the 17th International Conference on Genome Informatics, December 2006.

  15. Pandey G, Atluri G, Steinbach M, Myers C L, Kumar V. An association analysis approach to biclustering. In Proc. the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 2009, pp.677-686.

  16. Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In Proc. the 20th International Conference on Very Large Data Bases, September 1994, pp.487-499.

  17. Tang C, Zhang L, Zhang A, Ramanathan M. Interrelated two-way clustering: An unsupervised approach for gene expression data analysis. In Proc. the 2nd International Symposium on Bioinformatics and Bioengineering Conference, November 2001, pp.41-48.

  18. Busygin S, Jacobsen G, Krämer E. Double conjugated clustering applied to Leukemia microarray data. In Proc. the 2nd SIAM ICDM Workshop on Clustering High-Dimensional Data and its Applications, April 2002.

  19. Yang J, Jiong Y, Wang H, Wang W, Yu P. Enhanced biclustering on expression data. In Proc. the 3rd IEEE Symposium on Bioinformatics and Bioengineering, March 2003, pp.321-327.

  20. Mahfouz M A, Ismail M A. BIDENS: Iterative density based biclustering algorithm with application to gene expression analysis. Proc. World Academy of Science, Engineering and Technology, 2009, 37: 342–348.

    Google Scholar 

  21. Gupta N, Aggarwal S. SISA: Seeded iterative signature algorithm for biclustering gene expression data. In Proc. IADIS European Conference on Data Mining, July 2008, pp.124-128.

  22. Duffy D, Quiroz A. A permutation-based algorithm for block clustering. Journal of Classification, 1991, 8(1): 65–91.

    Article  MathSciNet  Google Scholar 

  23. Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 2002, 18(Suppl.1): 136–144.

    Article  Google Scholar 

  24. Uno T, Asai T, Uchida Y, Arimura H. An efficient algorithm for enumerating closed patterns in transaction databases. In Lecture Notes in Computer Science 3245, Suzuki E, Arikawa S (eds.), Springer-Verlag, 2004, pp.16-31.

  25. Li G J, Ma Q, Tang H B, Paterson A H, Xu Y. QUBIC: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Research, 2009, 37(15): e101.

    Article  Google Scholar 

  26. Gupta R, Rao N, Kumar V. Discovery of error-tolerant biclusters from noisy gene expression data. BMC Bioinformatics, 2011, 12(12).

  27. Gasch A P, Huang M, Metzner S et al. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Molecular Biology of the Cell, 2001, 12(10): 2987–3003.

    Article  Google Scholar 

  28. Triola M F, Goodman W M, Law R. Elementary Statistics (4th edition). Addison-Weslay, 1999.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Won Suk Lee.

Additional information

This work was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST) of Korea under Grant No. 2011–0016648.

The preliminary version of the paper was published in the Proceedings of EDB2012.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(DOC 36 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J., Jin, Y. & Lee, W.S. SUBic: A Scalable Unsupervised Framework for Discovering High Quality Biclusters. J. Comput. Sci. Technol. 28, 636–646 (2013). https://doi.org/10.1007/s11390-013-1364-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-013-1364-y

Keywords

Navigation