Abstract
A biclustering algorithm extends conventional clustering techniques to extract all of the meaningful subgroups of genes and conditions in the expression matrix of a microarray dataset. However, such algorithms are very sensitive to input parameters and show poor scalability. This paper proposes a scalable unsupervised biclustering framework, SUBic, to find high quality constant-row biclusters in an expression matrix effectively. A one-dimensional clustering algorithm is proposed to partition the attributes, that is, columns of an expression matrix into disjoint groups based on the similarity of expression values. These groups form a set of short transactions and are used to discover a set of frequent itemsets each of which corresponds to a bicluster. However, a bicluster may include any attribute whose expression value is not similar enough to others, so a bicluster refinement is used to enhance the quality of a bicluster by removing those attributes based on its distribution of expression values. The performance of the proposed method is comparatively analyzed through a series of experiments on synthetic and real datasets.
Similar content being viewed by others
References
Pandey G, Kumar V, Steinbach M. Computational approaches for protein function prediction. In Bioinformatics: Computational Techniques and Engineering, Pan Y, Zomaya A Y (eds.), Wiley, 2010.
Pu S Y, Ronen K, Vlasblom J, Greenblatt J, Wodak S J. Local coherence in genetic interaction patterns reveals prevalent functional versatility. Bioinformatics, 2008, 24(20): 2376–2383.
Abraham V C, Taylor D L, Haskins J R. High content screening applied to large-scale cell biology. Trends in Biotechnology, 2004, 22(1): 15–22.
Bleicher K H, Bohm H J, Muller K, Alanine A I. Hit and lead generation: Beyond high-throughput screening. Nature Review Drug Discovery, 2003, 2(5): 369–378.
Cheng Y, Church G M. Biclustering of expression data. In Proc. the 8th International Conference on Intelligent Systems for Molecular Biology, August 2000, pp. 93–103.
Kotsiantis S B, Pintelas P E. Recent advances in clustering: A brief survey. WSEAS Transactions on Information Science and Applications, 2004, 1(1): 73–81.
Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transcations on Computational Biology and Bioinformatics, 2004, 1(1): 24–45.
Dalal M A, Harale N D. A survey on clustering in data mining. In Proc. International Conference and Workshop on Emerging Trends in Technology, February 2011, pp. 559–562.
Kantardzic M. Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons, 2003, pp. 115–123.
Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 2006, 22(9): 1122–1129.
Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local structure in gene expression data: The order-preserving sub-matrix problem. In Proc. the 6th Annual International Conference on Computational Biology, April 2002, pp. 49–57.
Getz G, Levine E, Domany E. Coupled two-way clustering analysis of gene microarray data. Proceeding of the National Academy of Sciences of the United States of America, 2000, 97(22): 12 079–12 084.
Bergmann S, Ihmels J, Barkai N. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 2003, 67(3): 031 902.
Okada Y, Fujibuchi W, Horton P. Module discovery in gene expression data using closed itemset mining algorithm. In Proc. the 17th International Conference on Genome Informatics, December 2006.
Pandey G, Atluri G, Steinbach M, Myers C L, Kumar V. An association analysis approach to biclustering. In Proc. the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 2009, pp.677-686.
Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In Proc. the 20th International Conference on Very Large Data Bases, September 1994, pp.487-499.
Tang C, Zhang L, Zhang A, Ramanathan M. Interrelated two-way clustering: An unsupervised approach for gene expression data analysis. In Proc. the 2nd International Symposium on Bioinformatics and Bioengineering Conference, November 2001, pp.41-48.
Busygin S, Jacobsen G, Krämer E. Double conjugated clustering applied to Leukemia microarray data. In Proc. the 2nd SIAM ICDM Workshop on Clustering High-Dimensional Data and its Applications, April 2002.
Yang J, Jiong Y, Wang H, Wang W, Yu P. Enhanced biclustering on expression data. In Proc. the 3rd IEEE Symposium on Bioinformatics and Bioengineering, March 2003, pp.321-327.
Mahfouz M A, Ismail M A. BIDENS: Iterative density based biclustering algorithm with application to gene expression analysis. Proc. World Academy of Science, Engineering and Technology, 2009, 37: 342–348.
Gupta N, Aggarwal S. SISA: Seeded iterative signature algorithm for biclustering gene expression data. In Proc. IADIS European Conference on Data Mining, July 2008, pp.124-128.
Duffy D, Quiroz A. A permutation-based algorithm for block clustering. Journal of Classification, 1991, 8(1): 65–91.
Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 2002, 18(Suppl.1): 136–144.
Uno T, Asai T, Uchida Y, Arimura H. An efficient algorithm for enumerating closed patterns in transaction databases. In Lecture Notes in Computer Science 3245, Suzuki E, Arikawa S (eds.), Springer-Verlag, 2004, pp.16-31.
Li G J, Ma Q, Tang H B, Paterson A H, Xu Y. QUBIC: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Research, 2009, 37(15): e101.
Gupta R, Rao N, Kumar V. Discovery of error-tolerant biclusters from noisy gene expression data. BMC Bioinformatics, 2011, 12(12).
Gasch A P, Huang M, Metzner S et al. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Molecular Biology of the Cell, 2001, 12(10): 2987–3003.
Triola M F, Goodman W M, Law R. Elementary Statistics (4th edition). Addison-Weslay, 1999.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (MEST) of Korea under Grant No. 2011–0016648.
The preliminary version of the paper was published in the Proceedings of EDB2012.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Lee, J., Jin, Y. & Lee, W.S. SUBic: A Scalable Unsupervised Framework for Discovering High Quality Biclusters. J. Comput. Sci. Technol. 28, 636–646 (2013). https://doi.org/10.1007/s11390-013-1364-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-013-1364-y