Finding Additive Biclusters with Random Background

(Extended Abstract)
  • Jing Xiao
  • Lusheng Wang
  • Xiaowen Liu
  • Tao Jiang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5029)


The biclustering problem has been extensively studied in many areas including e-commerce, data mining, machine learning, pattern recognition, statistics, and more recently in computational biology. Given an n ×m matrix A (n ≥ m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard under various mathematical models. In this paper, we study a probabilistic model of the implanted additive bicluster problem, where each element in the n×m background matrix is a random number from [0, L − 1], and a k×k implanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, L − 1] with probability θ. We propose an O(n 2 m) time voting algorithm to solve the problem. We show that for any constant δ such that \((1-\delta)(1-\theta)^2 -\frac 1 L >0\), when \(k \ge \max \left\{\frac 8 \alpha \sqrt{n\log n},~ \frac {8 \log n} c + \log (2L)\right\}\), where c is a constant number, the voting algorithm can correctly find the implanted bicluster with probability at least \(1 - \frac{9}{n^{2}}\). We also implement our algorithm as a software tool for finding novel biclusters in microarray gene expression data, called VOTE. The implementation incorporates several nontrivial ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with multiple (and overlapping) implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed.


bicluster Chernoff bound polynomial-time algorithm probability model computational biology gene expression data analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alon, N., Krivelevich, M., Sudakov, B.: Finding a Large Hidden Clique in a Random Graph. Random Structures and Algorithms 13(3-4), 457–466 (1998)zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Barkow, S., Bleuler, S., Prelić, A., Zimmermann, P., Zitzler, E.: BicAT: a biclustering analysis toolbox. Bioinformatics 22(10), 1282–1283 (2006)CrossRefGoogle Scholar
  3. 3.
    Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving submatrix problem. In: Proceedings of Sixth International Conference on Computational Molecular Biology (RECOMB), pp. 45–55. ACM Press, New York (2002)Google Scholar
  4. 4.
    Berriz, G.F., King, O.D., Bryant, B., Sander, C., Roth, F.P.: Charactering gene sets with FuncAssociate. Bioinformatics 19, 2502–2504 (2003)CrossRefGoogle Scholar
  5. 5.
    Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB 2000), pp. 93–103. AAAI Press, Menlo Park (2000)Google Scholar
  6. 6.
    Feige, U., Krauthgamer, R.: Finding and certifying a large hidden clique in a semirandom graph. Random Structures and Algorithms 16(2), 195–208 (2000)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.O.: Genomic expression programs in the response of yeast cells to enviormental changes. Molecular Biology of the Cell 11, 4241–4257 (2000)Google Scholar
  8. 8.
    Hartigan, J.A.: Direct clustering of a data matrix. J. of the American Statistical Association 67, 123–129 (1972)CrossRefGoogle Scholar
  9. 9.
    Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y., Barkai, N.: Revealing modular organization in the yeast transcriptional network. Nature Genetics 31, 370–377 (2002)Google Scholar
  10. 10.
    Ihmels, J., Bergmann, S., Barkai, N.: Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13), 1993–2003 (2004)CrossRefGoogle Scholar
  11. 11.
    Kluger, Y., Basri, R., Chang, J., Gerstein, M.: Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research 13, 703–716 (2003)CrossRefGoogle Scholar
  12. 12.
    Kucera, L.: Expected complexity of graph partitioning problems. Disc. Appl. Math. 57, 193–212 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Li, H., Chen, X., Zhang, K., Jiang, T.: A general framework for biclustering gene expression data. Journal of Bioinformatics and Computational Biology 4(4), 911–933 (2006)CrossRefGoogle Scholar
  14. 14.
    Li, M., Ma, B., Wang, L.: On the closest string and substring problems. J. ACM 49(2), 157–171 (2002)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Liu, X., Wang, L.: Computing the maximum similarity biclusters of gene expression data. Bioinformatics 23(1), 50–56 (2007)CrossRefGoogle Scholar
  16. 16.
    Lonardi, S., Szpankowski, W., Yang, Q.: Finding biclusters by random projections. In: Proceedings of the Fifteenth Annual Symposium on Combinatorial Pattern Matching, pp. 102–116 (2004)Google Scholar
  17. 17.
    Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1), 24–45 (2004)CrossRefGoogle Scholar
  18. 18.
    Motwani, R., Raghavan, P.: Randomized algorithms. Cambridge University Press, Cambridge (1995)zbMATHGoogle Scholar
  19. 19.
    Peeters, R.: The maximum edge biclique problem is NP-complete. Disc. Appl. Math. 131(3), 651–654 (2003)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)CrossRefGoogle Scholar
  21. 21.
    Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y., Elkon, R.: EXPANDER - an integrative program suite for microarray data analysis. BMC Bioinformatics 6, 232 (2005)CrossRefGoogle Scholar
  22. 22.
    Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, suppl. 1, 136–144 (2002)Google Scholar
  23. 23.
    Westfall, P.H., Young, S.S.: Resampling-based multiple testing. Wiley, New York (1993)Google Scholar
  24. 24.
    Yang, J., Wang, W., Wang, H., Yu, P.: δ-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 18th International Conference on Data Engineering, pp. 517–528 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jing Xiao
    • 1
  • Lusheng Wang
    • 2
  • Xiaowen Liu
    • 3
  • Tao Jiang
    • 4
  1. 1.Department of Computer Science and TechnologyTsinghua University 
  2. 2.Department of Computer ScienceCity University of Hong KongHong Kong
  3. 3.Department of Computer ScienceUniversity of Western OntarioLondonCanada
  4. 4.Department of Computer Science and EngineeringUniversity of CaliforniaRiverside

Personalised recommendations