Abstract
The biclustering problem has been extensively studied in many areas including e-commerce, data mining, machine learning, pattern recognition, statistics, and more recently in computational biology. Given an n ×m matrix A (n ≥ m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard under various mathematical models. In this paper, we study a probabilistic model of the implanted additive bicluster problem, where each element in the n×m background matrix is a random number from [0, L − 1], and a k×k implanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, L − 1] with probability θ. We propose an O(n 2 m) time voting algorithm to solve the problem. We show that for any constant δ such that \((1-\delta)(1-\theta)^2 -\frac 1 L >0\), when \(k \ge \max \left\{\frac 8 \alpha \sqrt{n\log n},~ \frac {8 \log n} c + \log (2L)\right\}\), where c is a constant number, the voting algorithm can correctly find the implanted bicluster with probability at least \(1 - \frac{9}{n^{2}}\). We also implement our algorithm as a software tool for finding novel biclusters in microarray gene expression data, called VOTE. The implementation incorporates several nontrivial ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with multiple (and overlapping) implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alon, N., Krivelevich, M., Sudakov, B.: Finding a Large Hidden Clique in a Random Graph. Random Structures and Algorithms 13(3-4), 457–466 (1998)
Barkow, S., Bleuler, S., Prelić, A., Zimmermann, P., Zitzler, E.: BicAT: a biclustering analysis toolbox. Bioinformatics 22(10), 1282–1283 (2006)
Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving submatrix problem. In: Proceedings of Sixth International Conference on Computational Molecular Biology (RECOMB), pp. 45–55. ACM Press, New York (2002)
Berriz, G.F., King, O.D., Bryant, B., Sander, C., Roth, F.P.: Charactering gene sets with FuncAssociate. Bioinformatics 19, 2502–2504 (2003)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular (ISMB 2000), pp. 93–103. AAAI Press, Menlo Park (2000)
Feige, U., Krauthgamer, R.: Finding and certifying a large hidden clique in a semirandom graph. Random Structures and Algorithms 16(2), 195–208 (2000)
Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.O.: Genomic expression programs in the response of yeast cells to enviormental changes. Molecular Biology of the Cell 11, 4241–4257 (2000)
Hartigan, J.A.: Direct clustering of a data matrix. J. of the American Statistical Association 67, 123–129 (1972)
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y., Barkai, N.: Revealing modular organization in the yeast transcriptional network. Nature Genetics 31, 370–377 (2002)
Ihmels, J., Bergmann, S., Barkai, N.: Defining transcription modules using large-scale gene expression data. Bioinformatics 20(13), 1993–2003 (2004)
Kluger, Y., Basri, R., Chang, J., Gerstein, M.: Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research 13, 703–716 (2003)
Kucera, L.: Expected complexity of graph partitioning problems. Disc. Appl. Math. 57, 193–212 (1995)
Li, H., Chen, X., Zhang, K., Jiang, T.: A general framework for biclustering gene expression data. Journal of Bioinformatics and Computational Biology 4(4), 911–933 (2006)
Li, M., Ma, B., Wang, L.: On the closest string and substring problems. J. ACM 49(2), 157–171 (2002)
Liu, X., Wang, L.: Computing the maximum similarity biclusters of gene expression data. Bioinformatics 23(1), 50–56 (2007)
Lonardi, S., Szpankowski, W., Yang, Q.: Finding biclusters by random projections. In: Proceedings of the Fifteenth Annual Symposium on Combinatorial Pattern Matching, pp. 102–116 (2004)
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(1), 24–45 (2004)
Motwani, R., Raghavan, P.: Randomized algorithms. Cambridge University Press, Cambridge (1995)
Peeters, R.: The maximum edge biclique problem is NP-complete. Disc. Appl. Math. 131(3), 651–654 (2003)
Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)
Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y., Elkon, R.: EXPANDER - an integrative program suite for microarray data analysis. BMC Bioinformatics 6, 232 (2005)
Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, suppl. 1, 136–144 (2002)
Westfall, P.H., Young, S.S.: Resampling-based multiple testing. Wiley, New York (1993)
Yang, J., Wang, W., Wang, H., Yu, P.: δ-clusters: capturing subspace correlation in a large data set. In: Proceedings of the 18th International Conference on Data Engineering, pp. 517–528 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiao, J., Wang, L., Liu, X., Jiang, T. (2008). Finding Additive Biclusters with Random Background. In: Ferragina, P., Landau, G.M. (eds) Combinatorial Pattern Matching. CPM 2008. Lecture Notes in Computer Science, vol 5029. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69068-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-69068-9_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69066-5
Online ISBN: 978-3-540-69068-9
eBook Packages: Computer ScienceComputer Science (R0)