Advertisement

Recent Advances of Data Biclustering with Application in Computational Neuroscience

  • Neng Fan
  • Nikita Boyko
  • Panos M. Pardalos
Chapter
Part of the Springer Optimization and Its Applications book series (SOIA, volume 38)

Abstract

Clustering and biclustering are important techniques arising in data mining. Different from clustering, biclustering simultaneously groups the objects and features according their expression levels. In this review, the backgrounds, motivation, data input, objective tasks, and history of data biclustering are carefully studied. The bicluster types and biclustering structures of data matrix are defined mathematically. Most recent algorithms, including OREO, nsNMF, BBC, cMonkey, etc., are reviewed with formal mathematical models. Additionally, a match score between biclusters is defined to compare algorithms. The application of biclustering in computational neuroscience is also reviewed in this chapter.

Keywords

Lyapunov Exponent Bipartite Graph Data Matrix Vagus Nerve Stimulation Nonnegative Matrix Factorization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

6.1.1 Motivation

With the number of database appearing in computational biology, biomedical engineering, consumers’ behavior survey, and social networks, finding the useful information behind these data and grouping the data are important issues nowadays. Clustering is a method to classify the objects into different groups, so that the objects in each group share some common traits [15, 31, 57]. After this step, the data is reduced to small subgroups and research on each subgroup will be easier and more direct. Clustering has been widely studied in past 20 years, and a general review of clustering is by Jain et al. in [31] while a survey of clustering algorithms is also available by Xu et al. in [57]. The future challenges in biological networks are available in the book edited by Chaovalitwongse et al. in [9].

However, clustering only does the work of objects without considering the features of each object may have. In other words, clustering compares two objects by the features that two share, without depicting the different features of the two. A method simultaneously groups the objects and features is called biclustering such that a specific group of objects has a special kind group of features. More precisely, a biclustering is to find a subset of objects and features satisfying these objects are related to features to some level. Such kind of subsets are called biclusters. Meantime, biclustering does not require objects in the same bicluster to behave similarly over all possible features, but to highly have specific features in this bicluster.

Besides the differences from clustering mentioned above, biclustering also has the abilities to find the hide features and specify them to some subsets of objects. We should also realize that biclustering also has relations but differences from other techniques, such as classification, feature selection, and outlier detection in data mining. Classification is a kind of supervised clustering while most algorithms used in biclustering are unsupervised, and for some supervised biclustering see [4, 40].

The biclustering problem is to find biclusters in data sets, and it may have different names such as co-clustering, two-mode clustering in some literatures.

6.1.2 Data Input

Usually, we call the objects as samples. Samples have different features and each sample may have or may not have some features. The level of a sample having some specific feature is called expression level. In real world, the samples may have quantitative features or qualitative features. The expression levels of quantitative features can be easily expressed in numerical data, while qualitative features have to use some scale measurement to be transformed into data. For some algorithms of biclustering, qualitative features are allowed.

Mainly, the biclustering algorithms are starting with matrices. There are two kinds of them usually used, and the first is more possible to be used in biclustering.
  • Expression Matrix. This data matrix has rows corresponding to samples, columns to features, with entry measuring the expression level of a feature in a sample. Each row is called a feature vector of the sample. We can also call this matrix as sample-by-feature matrix.

    Sometimes, the matrix is formed from all samples’ feature vectors, and the features’ level in this sample will be observed directly. Generally we just scale and then put these vectors together to form a matrix if all vectors have the same length, which means they have the same set of features. However, the feature vectors may not conform each other. In this case, we should add values (may be 0) to vectors with no corresponding features in order to form same-length vectors. In some applications, there are always large set of samples with limited features.

  • Similarity Matrix. This data matrix has both rows and columns corresponding to a set of samples, with each entry measuring the similarity between two corresponding samples. It has same number of rows and columns, and it is symmetric. This matrix can be called sample-by-sample matrix.

    Note: this matrix can also be used as dissimilarity matrix with entry denoting the dissimilarity between a pair of samples. There are many similarity measurement functions to compute the (dis)similarity entries, such as Euclidean distance, Mahalanobis distance. So the similarity matrix can be computed from the expression matrix.

Since the developments of biclustering are including some time series models [38, 52], another kind of time series data is also used in biclustering. This data also can be viewed as stored in a matrix with that rows denote samples, while columns from left to right denote observed time points.

For some qualitative features in some cases, the data matrix is a kind of sign matrix. Some biclustering algorithms are still used.

Sometimes, before processing algorithms on the matrix, some steps are used, such as normalization, discretization, value mapping, and aggression, and the details of these data preparation operations are available at [16].

In the following, the data matrix usually refers to the first kind of expression matrix without explanation.

6.1.3 Objective of Task

Obviously, the objective of biclustering is to find biclusters in data. In clustering, the obtained clusters should have the propositions that the similarities among the samples within each cluster are maximized and the similarities between samples from different clusters are minimized.

For biclustering, the samples and features in each bicluster are highly related. But this does not mean the samples in this bicluster do not have other features, they just have the features in this bicluster more obvious and they still share other features. Thus, in each bicluster, the relations between the samples and the features are closer rather than relations between samples (features) from this bicluster and features (samples) from another bicluster.

Some biclustering algorithms allow that one sample or feature can belong to several biclusters (called overlapping) while some others produce exclusive biclusters. In addition, some algorithms have the property that each sample or feature must have its corresponding bicluster, while some others need not to be exhaustive and can allow only find one submatrix or several ones from data matrix to form the biclusters.

As we mentioned above, most of biclustering algorithms are unsupervised classification and it does not need to have any training sets. But supervised biclustering methods are also useful in some cases of biomedicine applications [5, 4, 40].

In this chapter, an optimization prospective of biclustering will be studied, and different objective functions will be used for different algorithms to satisfy part of objectives above. There is no such algorithm that can satisfy all objectives, and additionally, there is no such standard of justifying the algorithms. In distinct applications of biclustering, a specific or several objectives should be met so some algorithms are designed to satisfy these requirements. There are some methods trying to compare different algorithms, and we refer to [37, 44, 47, 61].

6.1.4 History

The first approach to biclustering is “direct clustering of data matrix” by Hartigan [28] in 1972. But the term “biclustering” was famous after Cheng and Church [11] using this technique to do gene expression analysis. After that, many biclustering algorithms are designed in different areas’ applications, such as biological network, microarray data, word-document co-clustering, biomedical engineering, of which the most popular applications are in microarray data and gene expression data.

In 2004, Madeira and Oliveira [37] surveyed the biclustering algorithms for biological data analysis. In this survey, they identified the biclusters into four major classes: biclusters with constant values, with constant values on rows or columns, with coherent values, and with coherent evolutions. The biclustering structures of a data matrix are classified into nine groups according to algorithms: single bicluster, exclusive row and column biclusters, checkerboard structure, exclusive rows biclusters, exclusive columns biclusters, nonoverlapping biclusters with tree structure, nonoverlapping nonexclusive biclusters, overlapping biclusters with hierarchical structure, and arbitrarily positioned overlapping biclusters. In addition, the authors have also divided the algorithms into five classes: Iterative row and column clustering combination, divide and conquer, greedy iterative search, exhaustive bicluster enumeration, and distribution parameter identification. A comparison of these algorithms according to the above three classes is given in this survey.

Another review about biclustering algorithms is by Tanay et al. in [55] in 2004. In this survey, nine mostly used algorithms are reviewed and given with their pseudocodes. Mostly recent review of biclustering is by Busygin et al. in [5], and 16 algorithms are reviewed with their applications in biomedicine and text mining. In this chapter, the authors mentioned that “many of the approaches rely on not mathematically strict arguments and there is a lack of methods to justify the quality of the obtained biclusters.”

In this chapter, we are trying to review and study the biclustering algorithms in mathematical and optimization prospectives. Not all of the algorithms will be covered, but most recent valuable algorithms are covered.

Since the development of biclustering algorithms, many softwares are designed to include several algorithms, including BicAT [2], BicOverlapper [48], BiVisu [10], toolbox by R(biclust) [32] etc. These software or packages allow to do data processing, bicluster analysis, and visualization of results and can be used directly to construct images.

In the toolbox named BicAT [2], it provides different facilities for data preparation, inspection, and postprocessing such as discretization, filtering of biclusters accordingly. Several algorithms of biclustering such as Bimax, CC, XMotifs, OPSM are included, and three methods of viewing data including matrix (heatmap), expression, and analysis are presented. The software BicOverlapper [48] is a tool for overlapping biclusters visualization. It can use three different kinds of data files of original data matrix and resulted biclusters to construct beautiful and colorful images such as heatmaps, parallel coordinates, TRN graph, bubble map, and overlapper. The BiVisu [10] is also a software tool for bicluster detection and visualization. Besides bicluster detection, BiVisu also provides functions for preprocessing, filtering, and bicluster analysis. Another software is a package written by R [32], biclust, which contains a collection of bicluster algorithms, such as Bimax, CC, plaid, spectral, xMotifs, preprocessing methods for two way data, and validation and visualization techniques for bicluster results. For individual biclustering software, there are also some packages available [55, 5].

6.1.5 Outline

In this chapter, we will follow the reviews of [37, 55, 5] and try to include the most recent algorithms and advancements of biclustering. The perspective of this chapter is of mathematical view, including linear algebra, optimization programming, bipartite graphs, probabilistic or statistical models, information theory, and time series. Section 6.1 has reviewed the motivation, data, objective, history, and softwares of biclustering. In Section 6.2, the bicluster type and biclustering structures are formally defined in a mathematical way. The most recent biclustering algorithms are reviewed in Section 6.3 and a comparison score is also defined. The application of biclustering in computational neuroscience will be reviewed in Section 6.4 and conclusions and future works are in Section 6.5.

6.2 Biclustering Types and Structures

6.2.1 Notations

As mentioned in Section 6.1.2, the expression matrix is mostly used in biclustering. Let \(A=(a_{ij})_{n\times m}\) denote the sample-feature expression matrix, where there are n rows representing n samples, m columns representing m features, and the entry a ij denoting the expression level of feature j in sample i. Mostly, the matrix A is the required input of an algorithm, but some algorithms also use the space of samples or features.

Let \(\mathcal{S}=\{S_1,S_2,{\cdots},S_n\}\) be the sample set, where \(S_i=(a_{i1},a_{i2},{\cdots},a_{im})\) is also called the feature vector of sample i. Similarly, for the features, it is denoted by \(\mathcal{F}=\{F_1,F_2,{\cdots},F_m\}\) with each vector \(F_j=(a_{1j},a_{2j},{\cdots},a_{nj})^T\), a column vector. Thus, the matrix \(A=(S_1,S_2,{\cdots},S_n)^T=(F_1,F_2,{\cdots},F_m)\).

A bicluster is a submatrix of data matrix. It is denoted by \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) satisfying that \(\mathcal{S}_k\subseteq \mathcal{S}\), \(\mathcal{F}_k\subseteq \mathcal{F}\) and the entry denotes intersection entry with corresponding row (sample) and column in both A and B k . Assume that there are K biclusters founded in data matrix A; the set of biclusters is denoted by \(\mathcal{B}=\{B_k: k=1,2,\cdots,K\}\). Sometimes, we use \((\mathcal{S}_k,\mathcal{F})\) to denote a cluster of rows (samples) and use \((\mathcal{S},\mathcal{F}_k)\) a cluster of columns (features). In some algorithms, the number of row clusters is not equal to that of column clusters. Let K,K ′ denote the number of row clusters, column clusters, respectively, the set of biclusters is \(B=\{(\mathcal{S}_k,\mathcal{F}_{k^{\prime}}):\,k=1,{\cdots},K,k^{\prime}=1,{\cdots},K^{\prime}\}\). Without explanation, we assume that \(K=K^{\prime}\).

Additionally, \(|\mathcal{S}_k|\) denotes the cardinality of itself, i.e., the number of samples in bicluster \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) while for \(|\mathcal{F}_k|\), similarly, the number of features. Clearly, \(|\mathcal{S}|=n,|\mathcal{F}|=m\). In the following, the notation \(i\in \mathcal{S}_k (j\in \mathcal{F}_k)\) is short for \(S_i\in \mathcal{S}_k (F_j\in \mathcal{F}_k)\) without misleading.

Given a data matrix A, the biclustering problem is to design algorithms to find biclusters \(\mathcal{B}=\{B_k: k=1,2,{\cdots},K\}\) of it, i.e., a subset of matrices of A such that samples (rows, \(\mathcal{S}_k\)) of each bicluster B k exhibit some similar behavior under the corresponding features (columns, \(\mathcal{F}_k\)). From this point, a bicluster problem now is transformed into a mathematical problem satisfying some requirements (which will be defined in the following under different bicluster types and structures). Usually, after finding biclusters in a data matrix, the rows and columns are rearranged so that the samples/features in a same bicluster will be together, the resulted matrix is called a proper rearrangement matrix. In the following discussions of bicluster types and biclustering structures, the requirements are all based on the rearrangement of data matrix.

6.2.2 Bicluster Types

The types of a bicluster is defined to be the relationships of entries within a bicluster. As mentioned in Section 1.4, Madeira and Oliveira [37] have identified bicluster types into following four major classes and here we follow their classification and give the mathematical representations. For first three cases, the data matrix A is required that \(A\in R^2\), i.e., all entries in A are real numbers.
  1. 1.
    Bicluster with constant values. For a bicluster \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\), the following identity should be satisfied:
    $$a_{ij}=\mu, \forall i\in \mathcal{S}_k, \forall j\in \mathcal{F}_k,$$
    where μ is a constant number.
     
  2. 2.
    Bicluster with constant values on rows or columns. For a bicluster \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) with constant values on rows, the identity for it is
    $$a_{ij}=\mu+\alpha_i,\ \textrm{or}\ a_{ij}=\mu\times \alpha_i, \forall i\in \mathcal{S}_k, \forall j\in \mathcal{F}_k,$$
    where μ is a constant and α i is an adjustment number for row i. The first identity is additive and the second one is multiplicative. Note in some data processing steps, the two are equivalent, for example, if doing logarithmic transformation on the second data matrix case. For the case of constant values on columns, the identity is
    $$a_{ij}=\mu+\beta_j,\ \textrm{or}\ a_{ij}=\mu\times \beta_j, \forall i\in \mathcal{S}_k, \forall j\in \mathcal{F}_k,$$
    where μ is a constant and β j is an adjustment number for column j.
     
  3. 3.
    Bicluster with coherent values. For a bicluster \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) with coherent values, there are two transferable expressions. The first one is additive,
    $$a_{ij}=\mu+\alpha_i+\beta_j, \forall i\in \mathcal{S}_k, \forall j\in \mathcal{F}_k,$$
    and the second one is multiplicative,
    $$a_{ij}=\mu\times\alpha_i\times\beta_j, \forall i\in \mathcal{S}_k, \forall j\in \mathcal{F}_k.$$
    The method to transform the second into the first is still doing logarithmic transformation on the second data matrix.
     
  4. 4.

    Bicluster with coherent evolutions. In the above three cases, the data matrix AR 2. But for some cases, the algorithms are finding relationships of data on rows or columns without considering the real value. For example, in order-preserving submatrix (OPSM) algorithm, a bicluster is a group of rows whose values induce a linear order across a subset of columns. Thus, the value of a ij is not always required in this situation since here the relationships between entries are considered. For other cases, the bicluster with coherent evolutions will be discussed in the following algorithms.

     

Although the biclusters are classified into these four classes, there are still other forms if the output bicluster was considered to reflect some relationships between the rows and columns within this bicluster. For example, in [7], a δ-valid pattern of bicluster is defined to satisfy \(\max(a_{ij})-\min(a_{ij})<\delta, \forall j \in \mathcal{F}_k\) for row i.

Besides this, data initialization influences bicluster types, for example, row normalizing a bicluster with constant values on rows (type 2) will result a bicluster constant values (type 1). Similarly, column normalizing a bicluster with constant values on columns (type 2) will result a bicluster constant values (type 1).

6.2.3 Biclustering Structures

The structure of biclustering is defined to be the relationships between biclusters from \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k): k=1,2,{\cdots},K\}\) based on the data matrix A.

For the structures of biclustering, there are some properties which should be noticed: exclusive, overlapping, and exhaustive, although some concepts or terms have been used previously. For a data matrix A, and the corresponding set of biclusters \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k): k=1,2,{\cdots},K\}\), we have the following formal definitions.
  • Exclusive (nonexclusive). A biclustering structure is said to be row exclusive if \(\mathcal{S}_k \cap \mathcal{S}_{k^{\prime}}=\emptyset\) for any \(k,k^{\prime}\in \{1,{\cdots},K\}, k\neq k^{\prime}\); to be column exclusive if \(\mathcal{F}_k \cap\mathcal{F}_{k^{\prime}}=\emptyset\) for any \(k,k^{\prime} \{1,{\cdots},K\}, k\neq k^{\prime}\); to be exclusive if it is both row exclusive and column exclusive.

  • Overlapping (nonoverlapping). A biclustering structure is said to be overlapping if some entry a ij belongs to two or more biclusters; otherwise, it is nonoverlapping.

  • Exhaustive (nonexhaustive). A biclustering structure is said to be row exhaustive if any row S i belongs to at least one bicluster; to be column exhaustive if any column F j belongs to at least one bicluster; to be exhaustive if it is both row and column exhaustive. Otherwise, it is said to be nonexhaustive if some row or column does not belong to any bicluster.

Here, exclusive and overlapping are not opposite to each other, and it can found from structure 7. The following biclustering structures are based on these three properties.

Still following the classification of Madeira and Oliveira in [37], the biclustering structures are identified into following nine groups.
  1. 1.

    Single bicluster. In this single biclustering structure, only one submatrix is found, i.e., \(k=1\) and \(\mathcal{B}=\{B_1=(\mathcal{S}_1,\mathcal{F}_1)\}\), from A.

     
  2. 2.
    Exclusive row and column biclusters. Given a data matrix A, as Definition 1 in [5], the structure of exclusive row and column biclusters \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k): k=1,2,{\cdots},K\}\) should satisfy the requirements as follows: For rows
    $$\begin{cases} \mathcal{S}_k\subseteq\mathcal{S},(k=1,{\cdots},K),\\ \mathcal{S}_1\cup \mathcal{S}_2\cup\cdots\cup \mathcal{S}_K=\mathcal{S},\\ \mathcal{S}_k\cap\mathcal{S}_{k^{\prime}}=\emptyset, k,k^{\prime}=1,{\cdots},K, k\neq k^{\prime}, \end{cases}$$
    (6.1)
    and for corresponding columns
    $$\begin{cases} \mathcal{F}_k\subseteq\mathcal{F},(k=1,\cdots,K),\\ \mathcal{F}_1\cup \mathcal{F}_2\cup\cdots\cup \mathcal{F}_K=\mathcal{F},\\ \mathcal{F}_k\cap \mathcal{F}_{k^{\prime}}=\emptyset, k,k^{\prime}=1,\cdots\!\!,K, k\neq k^{\prime}. \end{cases}$$
    (6.2)

    In proper rearrangement of rows and columns of data matrix A, the biclusters are the submatrices in a diagonal way without overlap between any two biclusters.

     
  3. 3.
    Checkerboard biclusters. The clusters \({\mathcal{S}_k:k=1,{\cdots},K}\) of samples \(\mathcal{S}\) and the clusters of \({\mathcal{F}_k:k=1,{\cdots},K}\) of features \(\mathcal{F}\) satisfy the same requirements (Equations (6.1) and (6.2)) as in structure 2. The set of checkerboard biclusters is
    $$\mathcal{B}=\{B_{{kk}^{\prime}}= (\mathcal{S}_k,\mathcal{F}_k^{\prime}):k,k^{\prime}=1,{\cdots},K,\}$$

    i.e., any entry of A is in someone's biclusters.

    Considering each bicluster as an entry, the proper rearrangement matrix of A is a K × K matrix with entry B k,k’ . In some cases, the number of samples' clusters \(\mathcal{S}_k\textrm{S}\) do not need to be the same as that of features' clusters \(\mathcal{F}_k\textrm{S}\). This will imply a rectangle not a square matrix.

     
  4. 4.
    Exclusive rows biclusters. Given a data matrix A, the structure of exclusive rows' biclusters \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k): k=1,2,{\cdots},K\}\) should satisfy the requirements as follows: For rows
    $$\begin{cases} \mathcal{S}_k\subseteq\mathcal{S},(k=1,{\cdots},K),\\ \mathcal{S}_1\cup \mathcal{S}_2\cup\cdots\cup \mathcal{S}_K=\mathcal{S},\\ \mathcal{S}_k\cap\mathcal{S}_{k}^{\prime}=\emptyset, k,k^{\prime}=1,{\cdots},K, k\neq k^{\prime}, \end{cases}$$
    (6.3)
    and for corresponding columns
    $$\begin{cases} \mathcal{F}_k\subseteq\mathcal{F},(k=1,{\cdots},K),\\ \mathcal{F}_1\cup \mathcal{F}_2\cup\cdots\cup \mathcal{F}_K=\mathcal{F}. \end{cases}$$
    (6.4)

    Comparing Equations (6.1) and (6.2) in structure 2, requirements for rows are same, but for columns, Equation (6.4) has no disjoint requirement between \(\mathcal{F}_k\) and \(\mathcal{F}_{k^{\prime}},k^{\prime} \neq k\). In this structure, some features (columns) may belong to two or more biclusters (submatrices), while any sample (row) should belong to exactly one bicluster (submatrix).

     
  5. 5.
    Exclusive columns biclusters. Given a data matrix A, the structure of exclusive columns' biclusters \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k): k=1,2,{\cdots},K\}\) should satisfy the requirements as follows:For rows
    $$\begin{cases} \mathcal{S}_k\subseteq\mathcal{S},(k=1,{\cdots},K),\\ \mathcal{S}_1\cup \mathcal{S}_2\cup\cdots\cup \mathcal{S}_K=\mathcal{S}, \end{cases}$$
    (6.5)
    and for corresponding columns
    $$\begin{cases} \mathcal{F}_k\subseteq\mathcal{F},(k=1,{\cdots},K),\\ \mathcal{F}_1\cup \mathcal{F}_2\cup\cdots\cup \mathcal{F}_K=\mathcal{F},\\ \mathcal{F}_k\cap \mathcal{F}_{k^{\prime}}=\emptyset, k,k^{\prime}=1,{\cdots},K, k\neq k^{\prime}. \end{cases}$$
    (6.6)

    Comparing Equations (6.1) and (6.2) in structure 2, requirements for columns are same, but for rows, Equation (6.5) has no disjoint requirement between \(\mathcal{S}_k\) and \(\mathcal{S}_{k^{\prime}},k^{\prime}\neq k\). In this structure, some samples (rows) may belong to two or more biclusters (submatrices), while any feature (column) should belong to exactly one bicluster (submatrix).

     
  6. 6.

    Nonoverlapping with tree-structured biclusters. For a data matrix A, nonoverlapping means no entry can belong to more than one bicluster. Thus some entries may not belong to any bicluster. Tree structure means in the proper rearrangement matrix, the blocks of submatrices (biclusters) are not crossing each other.

     
  7. 7.

    Nonoverlapping nonexclusive biclusters. Nonoverlapping is same as above. Non-exclusive means a sample or feature can belong to more than one biclusters, and a sample can belong to two sets of important features in two biclusters, and vice versa.

     
  8. 8.

    Nonoverlapping hierarchically structured biclusters. Nonoverlapping is same as above. Hierarchically structured means a bicluster may belong to some other “bigger” biclusters, i.e., in the set of biclusters \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k)\!: k=1,2,{\cdots},K\!\}\) of data matrix A, there exists some biclusters \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) and \(B_{k^{\prime}}=(\mathcal{S}_{k^{\prime}},\mathcal{F}_{k^{\prime}})\) such that \(\mathcal{S}_k \subseteq \mathcal{S}_{k^{\prime}}\) or \(\mathcal{F}_{k}\subseteq \mathcal{F}_{k^{\prime}}\).

     
  9. 9.

    Arbitrary positioned overlapping biclusters. In the set of biclusters \(\mathcal{B}=\{B_k=(\mathcal{S}_k,\mathcal{F}_k): k=1,2,{\cdots},K\}\) of data matrix A, there exists some entry a ij such that \(a_{ij}\in B_k\) and \(a_{ij}\in B_{k^{\prime}}\) with k ≠ k ’. In the meantime, biclusters \(B_k,B_{k^{\prime}}\) may share some common samples or features.

     

To check the nine biclustering structures, and according to above definitions of exclusive and exhaustive, structures 1, 2 are exclusive; structure 3 is nonoverlapping; structure 1 is nonexhaustive; structures 2, 3, 4, and 5 are exhaustive; and the properties for some other structures can be found from its classification. Note that these structures are not always strict. For example, structures 2, 3, 4, and 5 also have nonexclusive versions (which will not satisfy above formal requirements), and for details we refer to [37].

6.3 Biclustering Techniques and Algorithms

In this section, the biclustering techniques and algorithms are divided into several class based on the methods used for different areas of mathematics, probability, or other optimization methods. Here we are concentrating on mathematical backgrounds.

6.3.1 Based on Matrix Means and Residues

For a bicluster \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\), several means based on the bicluster are defined. The mean of row i of B k is
$$\mu_{ik}^{(r)}=\frac{1}{|\mathcal{F}_k|}\sum_{j\in \mathcal{F}_k}a_{ij},$$
(6.7)
the mean of column j of B k is
$$\mu_{jk}^{(c)}=\frac{1}{|\mathcal{S}_k|}\sum_{i\in \mathcal{S}_k}a_{ij},$$
(6.8)
and the mean of all the entries in B k is
$$\mu_k=\frac{\sum_{i\in \mathcal{S}_k}\sum_{j\in \mathcal{F}_k} a_{ij}}{|\mathcal{F}_k||\mathcal{S}_k|}.$$
(6.9)
The residue of the entry a ij in bicluster B k is
$$r_{ij}=a_{ij}-\mu_{ik}^{(r)}-\mu_{jk}^{(c)}+\mu_k,$$
(6.10)
the variance of bicluster B k is
$$\textrm{Var}(B_k)=\sum_{i\in \mathcal{S}_k}\sum_{j\in \mathcal{F}_k}(a_{ij}-\mu_k)^2,$$
(6.11)
and mean squared residue of the bicluster B k is
$$H_k=\frac{\sum_{i\in \mathcal{S}_k}\sum_{j\in \mathcal{F}_k}r_{ij}^2}{|\mathcal{F}_k||\mathcal{S}_k|}.$$
(6.12)
The first approach of biclustering by Hartigan [28] is known as block clustering, with the objective function as
$$\min \textrm{Var}(\mathcal{B})=\sum_{k=1}^K \textrm{Var}(B_k)=\sum_{k=1}^K\sum_{i\in \mathcal{S}_k}\sum_{j\in \mathcal{F}_k}(a_{ij}-\mu_k)^2,$$
where the number of biclusters is a given number. For each bicluster, the variance \(\textrm{Var}(B_k)\) is 0 if it is constant.
CC. Cheng and Church's Algorithm (CC) [11] defines a bicluster to be a submatrix for which the mean squared residue score is below a user-defined threshold δ, i.e., H k δ, where δ represents the minimum possible value. To find the largest bicluster in A, they propose a two-phase strategy: removing rows and columns and then adding the removed rows and columns with some rules. First, the row to be removed is the one
$$\arg \max_i \frac{1}{|\mathcal{F}_k|}\sum_{j\in \mathcal{F}_k} r_{ij}^2,$$
and column is
$$\arg \max_j \frac{1}{|\mathcal{S}_k|}\sum_{j\in \mathcal{S}_k} r_{ij}^2.$$
Repeating these removing steps until the bicluster with H k δ obtained. Then some previously removed rows and columns can be added without violating the requirement of H k δ. Yang et al. [58, 59] proposed an improved version of this algorithm which allows missing data entry of A with a heuristic flexible overlapped clustering (FLOC) algorithm.

RWC. Angiulli et al. [1] proposed a random walk biclustering algorithm (RWC) based on a greedy technique and enriched with a local search strategy to escape poor local minima. The algorithm starts with an initial random bicluster B k and searches for a δ-bicluster by successive transformations of B k , until a gain function is improved. The transformations consist in the change of membership (called flip or move) of the row/column that leads to the largest increase of the gain function. If a bit is set from 0 to 1 it means that the corresponding sample or feature, which was not included in the bicluster B k , is added to B k . Vice versa, if a bit is set from 1 to 0 it means that the corresponding sample or feature is removed from the bicluster.

The gain function combines mean squared residue, row variance, and size of the bicluster by means of user-provided weights \(w_{\textrm{res}}, w_{\textrm{var}}\), and \(w_{\textrm{vol}}( w_{\textrm{res}}+w_{\textrm{var}}+w_{\textrm{vol}}=1,0\leq w_{\textrm{res}},w_{\textrm{var}},w_{\textrm{vol}}\leq 1)\). The gain function is defined as
$$\textrm{gain}=w_{\textrm{res}}(2^{\Delta \textrm{res}}-1)-w_{\textrm{var}}(2^{\Delta \textrm{var}}-1)-w_{\textrm{vol}}(2^{\Delta \textrm{vol}}-1),$$
where Δ res, Δ var, Δ vol are relative changes of mean squared residue, row variance, and size between a new bicluster and an old bicluster, respectively. This function assumes values in the interval [−1,1]. Decreasing w res and increasing w var and w vol, biclusters with higher row variance and larger size can be obtained.

6.3.2 Based on Matrix Ordering, Reordering, and Decomposition

The following several biclustering algorithms are based on matrix reordering or decomposition.

OPSM. Ben-Dor et al. [3] proposed order-preserving submatrix algorithm (OPSM) for biclustering. A bicluster is defined as a submatrix that preserves the order of the selected columns for all of the selected rows. In other words, the expression values of the samples within a bicluster induce an identical linear ordering across the selected features. Based on a stochastic model, the authors [3] developed a deterministic algorithm to find large and statistically significant biclusters. This concept has been taken up in a recent study by Liu and Wang [36] as OP-cluster.

ISA. Ihmels et al. [30] proposed the iterative signature algorithm (ISA) for biclustering. Given the data matrix A, the two matrices A s ,A f are obtained by normalizing A such that \(\sum_i a^s_{ij}=0,\sum_i (a_{ij}^s)^2=1\) (mean, variance) for each feature F j and similarly for sample S i , \(\sum_j a_{ij}^f=0,\sum_j (a_{ij}^f)^2=1\).

Starting with an initial set of samples, all features are scored with respect to this sample set and those features are chosen for which the score exceeds a predefined threshold. In the same way, all samples are scored regarding the selected features and a new set of samples is selected based on another predefined threshold. The entire procedure is repeated until the set of samples and the set of features do not change anymore. Multiple biclusters can be identified by running the iterative signature algorithm on several initial sample sets.

xMotif. In the framework proposed by Murali and Kasif [39], biclusters are defined such that samples are nearly constantly expressed across the selection of features. In first step, the input matrix is preprocessed by assigning each sample a set of statistically significant states. These states define the set of valid biclusters: A bicluster is a submatrix where each sample is exactly in the same state for all selected features. To identify the largest valid biclusters, an iterative search method is proposed that is run on different random seeds, similarly to ISA.

OREO. DiMaggio Jr. et al. [19] proposed an algorithm of optimal re-ordering (OREO) of the rows and columns of the data matrix A to biclustering. The idea of OREO is to optimally rearrange the rows and columns of data matrix A to minimize the similarities between rows and columns in the rearranged matrix. The algorithm has three main iterative steps: optimally re-ordering rows (or columns) of the data matrix; computing the median for each pair of neighboring rows (or columns) in the final rearranged matrix, sorting these values from highest to lowest and classifying cluster boundaries between the rows (or columns) to obtain submatrices; and optimally re-ordering the columns (or rows) of each submatrix and computing the cluster boundaries for the re-ordered columns (or rows) analogous to the second step.

Here we use rows to reorder, and the authors [19] defined three associated cost measurement functions between row i and row i ’:
$$c_{ii}^{\prime}=\sum_{j=1}^m |a_{ij}-a_{i^{\prime}j}|, \sum_{j=1}^m (a_{ij}-a_{i^{\prime}j})^2, \sqrt{\frac{\sum_j (a_{ij}-a_{i^{\prime}j})^2}{m}}.$$
The authors [19] use two models to reorder rows in order to minimize the total similarities between rows of final rearranged matrix: the network flow model and TSP model, which are ideas from network optimization. In the network flow model, defining the binary variables
$$y_{{ii}^{\prime}}^{\textrm{row}}=\begin{cases} 1, \ \hbox{if row}\ i\ \hbox{is adjacent and above}\ i^{\prime}\ \hbox{in the final ordering};\\ 0, \ \hbox{otherwise,} \end{cases}$$
and two additional ones for the topmost and bottommost rows
$$\begin{aligned} &y\_{\textrm{source}_i^{\textrm{row}}}= \begin{cases} 1, \ \hbox{if row}\ i\ \hbox{is the topmost row in the final ordering;} \\ 0, \ \hbox{otherwise,}\end{cases}\\ &y\_\textrm{sink}_i^{\textrm{row}}= \begin{cases} 1, \ \hbox{ if row}\ i\ \hbox{is the bottommost row in the final ordering;} \\ 0, \ \hbox{otherwise,} \end{cases} \end{aligned}$$
and choosing one of the three associated cost measurement functions, the optimization problem is to find solution to binary variables \(y_{{ii}^{\prime}}^{\textrm{row}},y\_{\textrm{source}}_ i^{\textrm{row}},y\_{\textrm{sink}}_i^{\textrm{row}}\),
$$\begin{aligned} \min \quad &\sum_i \sum_{i^{\prime}} c_{{ii}^{\prime}} y_{{ii}^{\prime}}^{\textrm{row}}\\ s.t. \quad &\sum_{i\neq i^{\prime}}y_{{ii}^{\prime}}^{\textrm{row}}+y\_\textrm{source}_i^{\textrm{row}}=1 \quad \forall i\\ &\sum_{i^{\prime}\neq i}y_{{ii}^{\prime}}^{\textrm{row}}+y\_\textrm{sink}_i^{\textrm{row}}=1 \quad \forall i\\ &\sum_i y\_\textrm{source}_i^{\textrm{row}}=1\\ &\sum_i y\_\textrm{sink}_i^{\textrm{row}}=1\\ &f\_\textrm{source}_i^{\textrm{row}}=n\cdot y\_\textrm{source}_i^{\textrm{row}} \quad \forall i\\ &\sum_{{i}^{\prime}} (f_{i^{\prime}i}^{\textrm{row}}-f_{{ii}^{\prime}}^{\textrm{row}})+f\_\textrm{source}_i^{\textrm{row}}-f\_\textrm{sink}_i^{\textrm{row}}=1 \quad \forall i\\ &f_{{ii}^{\prime}}^{\textrm{row}}\leq (n-1)\cdot y_{{ii}^{\prime}}^{\textrm{row}} \quad \forall (i,i^{\prime})\\ &f_{{ii}^{\prime}}^{\textrm{row}}\geq y_{{ii}^{\prime}}^{\textrm{row}} \quad \forall (i,i^{\prime})\\ &y_{{ii}^{\prime}}^{\textrm{row}}, y\_\textrm{source}_i^{\textrm{row}}, y\_\textrm{sink}_i^{\textrm{row}}\in \{0,1\}. \end{aligned}$$
In the TSP model, the variables are the same as network flow model except including variables \(y\_{\textrm{source}}_i^{\textrm{row}}, y\_\textrm{sink}_i^{\textrm{row}}\), and the optimization problem is
$$\begin{aligned} \min \quad & \sum_i \sum_{{i}^{\prime}} c_{{ii}^{\prime}} y_{{{ii}^{\prime}}}^{\textrm{row}}\\ s.t. \quad & \sum_{{i}^{\prime}} y_{{ii}^{\prime}}^{\textrm{row}}=1 \quad \forall i\\ &\sum_{{i}^{\prime}}y_{i^{\prime}i}^{\textrm{row}}=1 \quad \forall i\\ &y_{{ii}^{\prime}}^{\textrm{row}}\in \{0,1\}. \end{aligned}$$

The two optimization problems induced by the models are mixed integer linear programming and can be solved by CPLEX [14].

After reordering the rows of data matrix, for rows i and i + 1 in the final rearranged matrix, the median of each pairwise term of the objective function \(\phi(a_{i,j},a_{i+1,j})\) is computed by \(\textrm{MEDIAN}_j\phi (a_{i,j},a_{i+1,j})\). In [19], top 10% of largest median values are suggested to be boundaries between re-ordered rows.

nsNMF. Pascual-Montano et al. [43] and Carmona-Saez et al. [8] proposed a biclustering algorithm based on nonsmooth nonnegative matrix factorization (nsNMF). The method nsNMF approximates the data matrix A as a product of two submatrices, W and H. Rows of H constitute basis samples, while columns of W are basis features. Coefficients in each pair of basis samples and features are used to sort features and samples in the original matrix, respectively. The biclusters are the submatrices of the sorted matrix.

Originally, the nonnegative matrix factorization is used to analyze facial images [35]. The nonnegative matrix factorization (NMF) is to decompose matrix \(A=(a_{ij})_{n \times m}\) into two matrices, i.e.,
$$A\approx WH,$$
where \( W=(w_{ia})_{n \times k}\) are the reduced k (km) basis vectors (factors), and \(H=(h_{aj})_{k\times m}\) contains the coefficients of the linear combinations of the basis vectors (encoding vectors). All matrices A,W,H are nonnegative and the columns of W are normalized. Thus, the entry a ij can be expressed as
$$a_{ij}\approx (WH)_{ij}=\sum_{a=1}^k w_{ia}h_{aj}.$$
Based on Poisson likelihood, the objective function of this factorization is to minimize the divergence function, i.e.,
$$\min D(A,WH)=\sum_{i=1}^n\sum_{j=1}^m \left(a_{ij}\log \frac{a_{ij}}{(WH)_{ij}}-a_{ij}+(WH)_{ij}\right).$$
The solution to this objective function of finding W,H uses an iterative algorithm with random number initialization [8].
The nsNMF method, which will [8] “produce more compact and localized feature representation of the data than standard NMF” of finding sparse structures in data matrix, is an improvement of NMF. The nsNMF method introduces a smooth distribution of the factors to get sparseness, and the decomposition of data matrix A is
$$A\approx WSH,$$
where the matrix \(S=(1-\theta)I+\theta \frac{ee^T}{k}\) is a positive smothness matrix, I is the identity matrix, e is a row vector of k 1s, and θ controls the sparseness of the model, satisfying 0 ⩽ θ 1. And now the objective function for nsNMF method is
$$\min D(A,WSH)=\sum_{i=1}^n\sum_{j=1}^m \left(a_{ij}\log \frac{a_{ij}}{(WSH)_{ij}}-a_{ij}+(WSH)_{ij}\right).$$

When \(\theta = 0\), the nsNMF backs to NMF; when θ → 1, the vector SX (X is a positive nonzero vector) tends to the constant with all elements almost equal to the average of the elements of X and all entries are equal to the same nonzero value, which is the smoothest possible vector, in the sense of “nonsparseness.” The algorithm to solve this objective function can be done as the same way of previous function with small changes [8].

Bimax. Prelic et al. [44] presented a fast-and-conquer approach, binary inclusion-maximal biclustering algorithm (Bimax). This algorithm assumes that the data matrix A is binary with \(a_{ij}\in \{0,1\}\) where an entry 1 means feature j is important in sample i.

In this algorithm, a named inclusion-maximal bicluster is defined to be \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) such that \(a_{ij}=1\) for any \(i \in \mathcal{S}_k,j \in \mathcal{F}_k\), and there does not exist another bicluster \(B_{k}^{\prime}=(\mathcal{S}_{k}^{\prime},\mathcal{F}_{k}^{\prime})\) of A with \(a_{ij}=1\) for any entry in \(B_{k}^{\prime}\) and \(\mathcal{S}_k \subseteq \mathcal{S}_{k}^{\prime},\mathcal{F}_k\subseteq \mathcal{F}_{k}^{\prime},(\mathcal{S}_k,\mathcal{F}_k)\neq(\mathcal{S}_{k}^{\prime},\mathcal{F}_{k}^{\prime})\).

The Bimax algorithm is to find such inclusion-maximal bicluster of A, which is different from the SAMBA, where 0 entry can be contained in a bicluster. More specifically, the idea behind the Bimax algorithm is to partition A into three submatrices, one of which contains only 0-cells. Therefore, it can be disregarded in the following. The algorithm is then recursively applied to the remaining two submatrices U and V; the recursion ends if the current matrix represents a bicluster, i.e., contains only 1s. If U and V do not share any rows and columns of A, the two matrices can be processed independently from each other. If U and V have a set of rows in common as shown, special care is necessary to only generate those biclusters in V that share at least one common column.

6.3.3 Based on Bipartite Graphs

The following two algorithms are based on bipartite graphs since there is a close relationship between expression matrix of samples and features and weighted bipartite graph.

A bipartite graph is defined as a graph \(G=(U,V,E)\), where U,V are two disjoint sets of vertices, and E is the set of edges between vertices from U and V, while no edge appears between any two vertices from U or V.

In order to do biclustering problem, the data matrix A can be transformed into a bipartite graph where each vertex in one set U denotes a sample while vertex from another set V denotes a feature. The expression level a ij between samples and features is denoted by the weighted edges \((u_i,v_j)\in E\) between vertices u i U and v j V with weight \(w_{ij}=a_{ij}\). A bicluster corresponds to a subgraph \(H_k=(U_k,V_k,E_k)\) of \(G=(U,V,E)\) where \(U_k \subseteq U,V_k \subseteq V\) and \(E_k \subseteq E\) and edges in E k induced by vertices from U k ,V k . Thus, the set \((\mathcal{S},\mathcal{F},A)\) is corresponding to bipartite graph \(G=(U,V,E)\) and the bicluster \(B_k=(\mathcal{S}_k,\mathcal{F}_k)\) is to subgraph \(H_k=(U_k,V_k,E_k)\). Sometimes, we may only consider one subgraph of G and denote it as \(H=(U ^{\prime},V ^{\prime},E ^{\prime})\). Clearly, here \(|U|=n,|V|=m\).

Spectral biclustering. The first algorithm of biclustering based on bipartite graph is called spectral biclustering, proposed by Dhillon [17]. Since this biclustering algorithm has some close relationships, which will be shown later, with spectral graph theory [13], it got its name spectral biclustering. Before presenting this algorithm, several matrices are based on A and bipartite graph \(G=(U,V,E)\) with edges' weight \(w_{ij}=a_{ij}\).

The adjacency weighted matrix of the bipartite graph \(G=(U,V,E)\) is expressed in the form of data matrix A as
$$W=(w_{ij})_{(n+m)\times (n+m)}=\left(\begin{array}{cc} 0 & A \\ A^T & 0 \\ \end{array}\right),$$
and the weighted degree d i of vertex u i is defined as \(d_i=\sum_{j:(i,j)\in E} w_{ij},\) and the degree matrix \(D_u=(d_{ij})_{n \times n}\) of the graph is a diagonal matrix as
$$d_{ij}=\begin{cases} d_i, & \textrm{if}\ i=j , \\ 0, & \textrm{otherwise.} \end{cases}$$
Similarly, we can get the degree matrix D v . The degree matrix of the bipartite graph \(G =(U,V,E)\) is
$$D=\left(\begin{array}{cc} D_u & 0 \\ 0 & D_v \\ \end{array}\right),$$
where the diagonal elements of D u and D v are weighted degree of vertices belonging to U and V, and all other elements are 0. The Laplacian matrix of the bipartite graph \(G=(U,V,E)\) for data set A is defined as
$$L=D-W=\left(\begin{array}{cc} D_u & -A \\ -A^T & D_v \\ \end{array}\right).$$
The production of spectral clustering is exclusive row and column biclusters. Therefore, the corresponding subgraphs H k of G are disjoint with each other. The weight of edges between such subgraphs is defined as cut. Without loss of generality, assume there are two subgraphs \(H_1=(U_1,V_1,E_1)\) and \(H_2=(U_2,V_2,E_2)\) such that \(U_1 \cup U_2=U,U_1 \cap U_1= \emptyset, V_1 \cap V_2=V, V_1\cap V_2= \emptyset\), and E i E induced all edges between U i and V i . Subgraphs \(H_1=(U_1,V_1,E_1)\) and \(H_2=(U_2,V_2,E_2)\) are called a partition of G. The cut of such partition of bipartite graph is the sum of weights of edges between U 1,V 2 and U 2,V 1, i.e.,
$$\textrm{cut}(H_1,H_2)=\sum_{\scriptsize{\begin{array}{c} {i\in U_1,j \in V_2,(i,j) \in E} \\ \textrm{and}\ {i\in U_2,j\in V_1,(i,j)\in E} \end{array}}} w_{ij}.$$

Obviously, the objective of biclustering is to minimize such intersimilarities between biclusters (subgraphs). At the same time, the similarities within each bicluster should be maximized. The intrasimilarity of bicluster(subgraph) is defined as ∑ k . In order to balance the intersimilarities and intrasimilarities of biclusters, several different cuts are defined, such as ratio cut [27, 17, 33], normalized cut [51, 33], minimax cut [60], ICA cut [45]. The most popularly used are ratio cut and normalized cut.

For a partition \(H_1=(U_1,V_1,E_1), H_2=(U_2,V_2,E_2)\) of the bipartite graph \(G=(U,V,E)\), the ratio cut is defined as
$$\frac{\textrm{cut}(H_1,H_2)}{|U_1\cup V_1|}+\frac{\textrm{cut}(H_2,H_1)}{|U_2\cup V_2|},$$
and the normalized cut is defined as
$$\frac{\textrm{cut}(H_1,H_2)}{d_{p_1}}+\frac{\textrm{cut}(H_2,H_1)}{d_{p_2}},$$
where \(d_{p_1}=\sum_{i \in(U_1\cup V_1)}d_i, d_{p_2}=\sum_{j\in (U_2 \cup V_2)}d_j\).
Define the indicator vector as
$$y_i=\begin{cases} \sqrt{(n_2+m_2)/((n_1+m_1)(n+m))}, &i\in U_1\cup V_1,\\ -\sqrt{(n_1+m_1)/((n_2+m_2)(n+m))}, &i\in U_2\cup V_2, \end{cases}$$
where \(|U_1|=n_1,|U_2|=n_2,|V_1|=m_1,|V_2|=m_2\), the objective of minimizing the ratio cut of partition \(H_1=(U_1,V_1,E_1),H_2=(U_2,V_2,E_2)\) can be expressed as
$$\begin{aligned} \min \quad &y^T Ly,\\ s.t. \quad &y^T y=1,y^T e=0. \end{aligned}$$
Relax y to any real number, the solution is the eigenvector corresponding to the second smallest eigenvalue of L [13, 17]. Thus, after obtaining the indicator for each vertex of U,V, the corresponding subgraphs can be easily transformed back into biclusters. Similarly, for normalized cut, define the indicator vector as
$$y_i=\begin{cases} \sqrt{d_{U_2\cup V_2}/(d_{U_1\cup V_1}d)}, &i\in {U_1\cup V_1},\\ -\sqrt{d_{U_1\cup V_1}/(d_{U_2\cup V_2}d)}, &i\in {U_2\cup V_2}, \end{cases}$$
where \(d_{U_1 \cup V_1}=\sum_{i \in U_1 \cup V_1}d_i, d_{U_2\cup V_2}=\sum_{j \in {U_2 \cup V_2}}d_j\), the objective of minimizing the normalized cut of partition \(H_1=(U_1,V_1,E_1), H_2=(U_2,V_2,E_2)\) can be expressed as
$$\min \quad y^T Ly,$$
(6.13)
$$s.t.\quad y^T Dy=1, y^T De=0.$$
(6.14)

Now the solution of this programming is the eigenvector corresponding to the generalized eigenvalue problem \(Ly=\lambda Dy\) [51]. The above programming problems can be also modeled to mixed integer programming.

For large data matrix A, the solution of its eigenvector problem is very difficult and a method proposed by [17]. For more details of spectral biclustering, see [22]. In above, only two biclusters are obtained instead of K ones. For K biclusters, Dhillon [17] used k-means algorithm [31, 57] after obtaining the indicator vector y, and another direct approach is from [23] by defining an indicator matrix.

SAMBA. Tanay et al. [54] presented a statistical algorithmic method for bicluster analysis (SAMBA) based on bipartite graph and probabilistic modeling. Under a bipartite graph model, the weight of each edge is assigned according to a probabilistic model, thus, to find biclusters of A become to find heavy subgraphs of G with high likelihood. This method is motivated by finding the complete bipartite subgraph(biclique) of G. The idea of SAMBA has three steps: forming the bipartite graph and calculating weights of edges and nonedges (two models introduced in this step: a simple model and a refined model); applying a hashing technique to find heaviest bicliques(biclusters) in the graph; and performing a local improvement procedure on the biclusters in each heap.

Given a data matrix A, the corresponding bipartite graph is \(G=(U,V,E)\). A bicluster corresponds to a subgraph \(H=(U^{\prime},V^{\prime},E^{\prime})\) as introduced above. The weight of a subgraph is the sum of the assigned weights of edges \((u,v)\in E^{\prime}\) and nonedges \((u,v)\in \bar{E}^{\prime}=(U ^{\prime}\times V ^{\prime})\,\backslash\,E^{\prime}\). The subgraph with assigned weights has its statistical significance and finding a bicluster is to search heavy subgraph with respect to the weight of subgraph. There are two models introduced in [54]: a simple model and a refined model.

In the simple model, let \(|E|=k, p=k/mn\) and assume that edges occur independently and equiprobability with density p. Let BT(k,p,n), binomial distribution, be the probability of observing k or more success occurs independently with p, the probability of observing a graph at least as dense as H is \(p(H)=BT(k^{\prime},p,n^{\prime} m^{\prime})\), where k ’,n ’,m ’ are corresponding notations in \(H=(U^{\prime},V^{\prime},E^{\prime})\). Finding a maximum weight subgraph of G is equivalent of finding a subgraph H with lowest p(H). In the refined model, each edge (u,v) is an independent Bernoulli variable p u,v , which is fraction of bipartite graphs with degree sequence identical to G that contains edge (u,v). The probability of observing H is
$$p(H)=\left(\prod_{(u,v)\in E^{\prime}} p_{u,v}\right)\left(\prod_{(u,v)\in \bar{E^{\prime}}}(1-p_{u,v})\right).$$
In practice, a likelihood ratio is chosen, i.e.,
$$\log L(H)=\sum_{(u,v)\in E^{\prime}}\log\frac{p_c}{p_{u,v}}+\sum_{(u,v)\in \bar{E^{\prime}}}\log\frac{1-p_c}{1-p_{u,v}},$$
where \(p_c \geq \max_{(u,v) \in U \times V} p_{u,v}\), which corresponds to the weight of subgraph H with weight \(\log\frac{p_c}{p_{u,v}}>0\) of each edge (u,v) and \(\log\frac{1-p_c}{1-p_{u,v}} <0\) for each nonedge (u,v). Then a hash technique is applied to solve the maximum biclique problem in order to find the heavy subgraphs (biclusters). The final step of local improvement iteratively applies the best modification to the bicluster.

In a recent study of Tanay et al. [53], this SAMBA has been extended to integrate multiple types of experimental data.

6.3.4 Based on Information Theory

In [18], Dhillon et al. proposed a biclustering algorithm based on information theory. This information theoretic biclustering algorithm that simultaneously clusters both the rows and the columns is called co-clustering by Dhillon et al.

By proper transformation, the data matrix A is to be a joint probability distribution matrix \(p(\mathcal{S},\mathcal{F})\) between two discrete random variables \(\mathcal{S},\mathcal{F}\). Let K be the number of disjoint clusters of samples and K’ the number of disjoint features. The set of biclusters is \(\mathcal{B}=(\mathcal{S}^{\prime},\mathcal{F}^{\prime})=(\{\mathcal{S}_k,:k=1,\cdots,K\},\{\mathcal{F}_{k^{\prime}:k^{\prime}=1,\cdots,K^{\prime}}\})\). The mappings of C S ,C F are objectives to find in this biclustering algorithm such that
$$\begin{gathered} C_S: \{S_1,S_2,\cdots,S_n\}\rightarrow \{\mathcal{S}_1,\cdots,\mathcal{S}_K\},\\ C_F: \{F_1,F_2,\cdots,F_m\}\rightarrow \{\mathcal{F}_1,\cdots,\mathcal{F}_{K}^{\prime}\}. \end{gathered}$$
The mutual information \(I(\mathcal{S},\mathcal{F})\) of two random variables \(\mathcal{S},\mathcal{F}\) is the amount of information shared between these two variables and is defined as in information theory
$$I(\mathcal{S},\mathcal{F})=\sum_{i=1}^n\sum_{j=1}^m p(S_i,F_j)\log \frac{p(S_i,F_j)}{p(S_i)p(F_j)}=D(p(S,F)||p(S)p(F)),$$
where \(p(S_i,F_j),p(S_i),p(F_j)\) are probabilities from distribution matrix \(p(\mathcal{S},\mathcal{F})\), and \(D(p_1||p_2)=\sum_x p_1(x)\log\frac{p_1(x)}{p_2(x)}\) is the relative entropy between two probability distributions p 1 (x) and p 2 (x).
The objective of this biclustering is to find optimal biclusters of A such that the loss in mutual information is minimized, i.e.,
$$\min I(\mathcal{S},\mathcal{F})-I(\mathcal{S}^{\prime},\mathcal{F}^{\prime}).$$
In order to solve this objective function, \(q(x,y)=p(x^{\prime},y^{\prime})p(x^{\prime},y^{\prime})p(x|x^{\prime})p(y|y^{\prime})\) is defined so that the objective function can be written as
$$\min I(\mathcal{S},\mathcal{F})-I(\mathcal{S}^{\prime},\mathcal{F}^{\prime})=D(p(\mathcal{S},\mathcal{F})||q(\mathcal{S},\mathcal{F})).$$

For proof of this result, we refer to [18]. Then an iterative way is used to solve by transformed the objective function [18].

6.3.5 Based on Probability

The following two biclustering algorithms (named as BBC and cMonkey) use the theory of probability.

BBC. Gu and Liu [26] proposed a Bayesian biclustering model (BBC) and implemented a Gibbs sampling [34] procedure for its statistical inference. This model can also consider an implementation of plain model [50] of biclustering.

Given data matrix A, assume the entry
$$a_{ij}=\sum_{k=1}^K ((\mu_k+\alpha_{ik}+\beta_{jk}+\epsilon_{ijk})\delta_{ik}\kappa_{jk})+e_{ij}\left(1-\sum_{k=1}^K \delta_{ik}\kappa_{jk}\right),$$
where μ k is the main effect of bicluster k, and α ik and β jk are the effects of sample i and feature j, respectively, in bicluster k, \(\epsilon_{ijk}\) is the noise term for bicluster k, and e ij models the data points that do not belong to any bicluster. Here \(\delta_{ik},\kappa_{jk}\) are binary variables: \(\delta_{ik} = 1\) indicates that row i belongs to bicluster k, and \(\delta_{ik} = 0\) otherwise; similarly, \(\kappa_{jk} = 1\) indicates that column j is in cluster k, and \(\kappa_{jk} = 0\) otherwise. In plain model [50], the entry a ij has similar assumption with less factors to be considered.

In nonoverlapping feature biclustering, \(\sum_{k=1}^K\kappa_{jk}\leq 1\), and in nonoverlapping sample biclustering, \(\sum_{k=1}^K\delta_{jk}\leq 1\). Here, nonoverlapping sample is discussed. The priors of the indicators κ and δ are set so that a feature can be in multiple biclusters while sample is at more than one.

In this model, an observation a ij can belong to either one or none of the biclusters, and the probability distribution of a ij conditional on the bicluster indicators can be rewritten as
$$a_{ij}|\delta_{ik}=1,\kappa_{jk}=1\sim N(\mu_k+\alpha_{ik}+\beta_{jk},\sigma_{\epsilon k}^2)$$
if a ij belongs to bicluster k; otherwise,
$$a_{ij}|\delta_{ik}\kappa_{jk}=0\ \textrm{for\ all}\ k\sim N(0,\sigma_e^2).$$
With Gaussian zero-mean priors on the effect parameters, the marginal distribution of the a ij conditional on the indicators is
$$\mathcal{B}|\delta,\kappa\sim N(0,\Sigma),$$
where Σ is the covariance of matrix of \(\mathcal{B}\) and \(\mathcal{B}=\{B_0,B_1,B_2,\cdots,B_K\}^T\) with \(B_k=\{a_{ij}:\delta_{ik}\kappa_{jk}=1\}, k \geq 1\) and B 0 being the vector of data points belonging to no bicluster. More specifically, Σ is a sparse matrix of the form
$$\Sigma=\left( \begin{array}{cccc} \sigma_e^2I & 0 & \cdots & 0 \\ 0 & \Sigma_1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \Sigma_K \\ \end{array} \right),$$
where \(\Sigma_k=\textrm{Cov}(B_k,B_k)\) is the covariance matrix of all data points belonging to cluster k.
To make inference form above BBC model, the implemented Gibbs sampling method is used. Initializing from a set of randomly assigned values of δ's and κ's, the column indicators κ are sampled by calculating the log-probability ratio
$$\log \frac{P(V_2|\kappa_{jk}=1,\sigma_{\mu k}^2, \sigma_{\alpha k}^2, \sigma_{\beta k}^2,\sigma_{\epsilon k}^2,\sigma_{e}^2)P(\kappa_{jk}=1)} {P(V_2|\kappa_{jk}=0,\sigma_{\mu k}^2, \sigma_{\alpha k}^2,\sigma_{\beta k}^2, \sigma_{\epsilon k}^2,\sigma_{e}^2)P(\kappa_{jk}=0)},$$
where \(V_1 = \{a_{il}: \delta_{ik} = 0\ \textrm{or}\ \kappa_{lk} =0,l \neq j\}\), the set contains data points not in cluster k, and \( V_2 = \{a_{il}: \delta_{ik} = 1,\kappa_{lk} = 1,l \neq j\}\cup \{a_{ij}:\delta_{ik} = 1\}\), the set contains data points that are or can in bicluster k. This notation follows that in [26].

In order to calculate the likelihood term in the above ratio, we need to take the inverse and determinant of the covariance matrices for the vector V 2 in both cases. For details of rest of BBC algorithm, we refer to [26].

cMonkey. Reiss et al. [46] proposed an integrated biclustering algorithm (named cMonkey) used in heterogeneous genome-wide data sets for the inference of global regulatory networks. In this model, each bicluster is modeled via a Markov chain process, in which the bicluster is iteratively optimized, and its state is updated based upon conditional probability distributions computed using the cluster's previous state. Three major distinct data types are used (gene expression, upstream sequences, and association networks), and accordingly p-values for three such model components are computed: the expression component, the sequence component, and the network component. Here we only reviewed the expression component.

Given the expression data matrix A, the variance in the measured levels of feature j is \(\sigma_j^2=\frac{1}{n}\sum_{i=1}^n (a_{ij}-\bar{a_j})^2\), where \(\bar{a_j}=\sum_{i=1}^n a_{ij}/n\). The mean expression level of feature j over the bicluster's samples \(\mathcal{S}_k\) is \(\bar{a_{jk}}=\mu_{ik}^{(r)}\) as defined previously. As defined in [46] the likelihood of an arbitrary measurement a ij relative to this mean expression level is
$$p(a_{ij})=\frac{1}{\sqrt{2\pi (\sigma_j^2+\varepsilon^2)}}\exp \left[-\frac{(a_{ij}-\bar{a_{jk}})^2+\varepsilon^2}{2(\sigma_j^2+\varepsilon^2)}\right],$$
where ε for an unknown systematic error in condition j, here assumed to be the same for all j. The likelihood of the measurements of an arbitrary sample i among the conditions in bicluster k is \(p(S_i)=\prod_{j\in \mathcal{F}_k}p(a_{ij})\), and similarly the likelihood of a feature j's measurements is \(p(F_j)=\prod_{i\in \mathcal{S}_k} p(a_{ij})\).

Before the following iterative steps, the Markov chain process by which a bicluster is optimized requires “seeding” of the bicluster to start the procedure. The iterative steps include searching for motifs in bicluster, computing conditional probability that each sample/feature is a member of the bicluster, and performing moves sampled from the conditional probability.

6.3.6 Comparison of Biclustering Algorithms

Since the biclustering algorithms are designed based on different bases and used in different data, and the requirements are different for different applications, there is no standard rule to judge which biclusters produced are better. In [44], Prelic et al. defined match score of two clusters \(\mathcal{S}_i,\mathcal{S}^{\prime}_i\) of samples as
$$S(B_1,B_2)=\frac{|\mathcal{S}_i\cap\mathcal{S}^{\prime}_i|}{|\mathcal{S}_i\cup\mathcal{S}^{\prime}_i|},$$
and match score between two sets \(\mathcal{B},\mathcal{B}^{\prime}\) of biclusters for matrix A as
$$S^{\ast}(\mathcal{B},\mathcal{B}^{\prime})=\frac{1}{|\mathcal{B}|}\sum_{(\mathcal{S}_i,\mathcal{F}_i)\in \mathcal{B}}\max_{(\mathcal{S}^{\prime}_i,\mathcal{F}^{\prime}_i)\in \mathcal{B}^{\prime}}\frac{|\mathcal{S}_i \cap\mathcal{S}^{\prime}_i|}{|\mathcal{S}_i\cup\mathcal{S}^{\prime}_i|},$$
which reflects the average of the maximum match scores for all biclusters in \(\mathcal{B}\) with respect to the biclusters in \(\mathcal{B}^{\prime}\).

In [44], Prelic et al. used this score to comparing the algorithms of Bimax, CC, OPSM, SAMBA, xMotifs, and ISA with respect to the data set of a metabolic pathway map. And in [12], Cho and Dhillon also use this score to compare several biclustering algorithms on human cancer microarrays data sets.

6.4 Application of Biclustering in Computational Neuroscience

Epilepsy is one of the most common nervous system disorders. It affects about 1% of the world's population with the highest incidence among infants and the elderly [20, 21]. For many years there have been attempts to control epileptic seizures by electrically stimulating the brain [25]. This alternate method of treatment is the subject of much study since the approval of the chronic vagus nerve stimulation (VNS) implant for treatment of intractable seizures [56, 24, 49]. The device consists of an electric stimulator implanted subcutaneously in the chest and connected, via subcutaneous electrical wires, to the left cervical vagus nerve. The VNS is programmed to deliver electrical stimulation at a set intensity, duration, pulse width, and frequency. Optimal parameters are determined on a case-by-case basis, depending on clinical efficacy (seizure frequency) and tolerability.

Busygin et al. used supervised consistent biclustering [6] to develop a physiologic marker for optimal VNS parameters (e.g., output current, signal frequency) using measures of scalp EEG signals.

The raw EEG data was obtained from two patients A and B at 512 Hz sampling rate from 26 scalp EEG channels arranged in the standard international 10–20 system (see Fig. 6.1). Then the EEG was transformed into a sequence of short-term largest Lyapunov exponents (\(\textrm{STL}_{\max}\)) values. A famous practical application of \(\textrm{STL}_{\max}\) measure of EEG signal time series is to predict epileptic seizures, see [29, 41, 42]. Thus, Lyapunov exponents are considered to be a perfect descriptor of such extremely complex dynamic system as human brain.
Fig. 6.1

Montage for scalp electrode placement.

\(\textrm{STL}_{\max}\) values were computed for each scalp EEG channel recorded from two epileptic patients using the algorithm developed by Iasemidis et al. [29]. Then the \(\textrm{STL}_{\max}\) values were used as features of the two data sets. The averaged samples from stimulation periods were then separated from averaged samples from nonstimulation periods by feature selection performed within the consistent biclustering routine.

As each stimulation lasted for 30 s and a 4-s time window was used to compute one element of the Lyapunov exponent time series, each stimulation provided seven data points. Since the EEG patterns of a patient may have been changing throughout the observed period due to changes in his/her conditions not relevant to the investigated phenomenon, each of the seven samples across all stimulation cycles were averaged. Thus, seven Lyapunov exponent samples have been created to represent the positive class. To create the negative class, 10 Lyapunov exponent data points were considered 250 s after each stimulation. In the similar way, these 10 samples were averaged across all stimulation cycles. So, the created negative class contains 10 averaged Lyapunov exponent data samples from nonstimulation time intervals.

Then, the biclustering experiment was done on two 26 × 17 matrices representing patients A and B. The patient A data were conditionally biclustering admitting with respect to given stimulation and nonstimulation classes without excluding any features. All but one feature were classified into the nonstimulation class, which indicates that for almost all EEG channels the Lyapunov exponent was consistently decreasing during the stimulation with one channel being the only exception.

Cross-validation was performed for the obtained biclustering by leave-one-out method examining for each sample whether it would be classified in the appropriate class if the feature selection was performed without it. It turned out that all classes of all 17 samples are confirmed by this method.

To make the patient B data set conditionally biclustering admitting with respect to given stimulation and nonstimulation classes only five features were selected. The one-leave-out experiment classified correctly all but four samples. The biclustering heatmaps are presented in Fig. 6.2.
Fig. 6.2

Heatmaps for patients A and B.

The obtained biclustering results allow to assume that signals from certain parts of the brain consistently change their characteristics when VNS is switched on and could provide a basis for desirable VNS stimulation parameters. A physiologic marker of optimal VNS effect could greatly reduce the cost, time, and risk of calibrating VNS stimulation parameters in newly implanted patients compared to the current method of clinical response.

6.5 Conclusions

In this review, the formal definitions of biclustering with its different types and structures are given and the algorithms are reviewed in mathematical prospective.

Biclustering is recently a hot research area with its applications in bioinformatics. Other application areas are text mining, marketing analysis, etc. In practical applications, some problems, such as the data missing, the noise of data, and data processing, influence a lot to the results of biclustering. Besides, the comparisons of biclustering algorithms are still another direction to be studied.

References

  1. 1.
    Angiulli, F., Cesario, E., Pizzuti, C. Random walk biclustering for microarray data. Inf Sci: Int J 178(6), 1479–1497 (2008)MATHGoogle Scholar
  2. 2.
    Barkow, S., et al. BicAT: A biclustering analysis toolbox. Bioinformatics 22, 1282–1283 (2006)CrossRefGoogle Scholar
  3. 3.
    Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z. Discovering local structure in gene expression data: The order-preserving submatrix problem. J Comput Biol 10, 373–384 (2003)CrossRefGoogle Scholar
  4. 4.
    Busygin, S., Prokopyev, O.A., Pardalos, P.M. Feature selection for consistent biclustering via fractional 0–1 programming. J Comb Optim 10/1, 7–21 (2005)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Busygin, S., Prokopyev, O.A., Pardalos, P.M. Biclustering in datamining. Comput Oper Res 35, 2964–2987 (2008)MathSciNetMATHCrossRefGoogle Scholar
  6. 6.
    Busygin, S., Boyko, N., Pardalos, P., Bewernitz, M., Ghacibehc, G. Biclustering EEG data from epileptic patients treated with vagus nerve stimulation. AIP Conference Proceedings of the Data Mining, Systems Analysis and Optimization in Biomedicine, 220–231 (2007)Google Scholar
  7. 7.
    Califano, A., Stolovitzky, G., Tu, Y. Analysis of gene expression microarays for phenotype classification. Proceedings of International Conference on Computational Molecular Biology, 75–85 (2000)Google Scholar
  8. 8.
    Carmona-Saez, P., Pascual-Marqui, R.D., Tirado, F., Carazo, J.M., Pascual-Montano, A. Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7, 78 (2006)CrossRefGoogle Scholar
  9. 9.
    Chaovalitwongse, W.A., Butenko, S., Pardalos, P.M. Clustering Challenges in Biological Networks, World Scientific Publishing, Singapore (2008)Google Scholar
  10. 10.
    Cheng, K.O., et al. Bivisu: Software tool for bicluster detection and visualization. Bioinformatics 23, 2342–2344 (2007)CrossRefGoogle Scholar
  11. 11.
    Cheng, Y., Church, G.M. Biclustering of expression data. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, 93–103 (2000)Google Scholar
  12. 12.
    Cho, H., Dhillon, I.S. Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform 5(3), 385–400 (2008)CrossRefGoogle Scholar
  13. 13.
    Chung, F.R.K. Spectral graph theory. Conference Board of the Mathematical Sciences, Number 92, American Mathematical Society (1997)Google Scholar
  14. 14.
    CPLEX: ILOG CPLEX 9.0 Users Manual (2005)Google Scholar
  15. 15.
    Data Clustering. http://en.wikipedia.org/wiki/Data clustering, access at Dec. 8 (2008)
  16. 16.
    Data Transformation Steps. http://www.dmg.org/v2–0/Transformations.html, access at Dec. 8 (2008)Google Scholar
  17. 17.
    Dhillon, I.S. Co-clustering documents and words using bipartite spectral graph partitioning. Proceedings of the 7th ACM SIGKDD International Conference on Knowledging Discovery and Data Mining (KDD), 26–29 (2001)Google Scholar
  18. 18.
    Dhillon, I.S., Mallela, S., Modha, D.S. Information theoretic co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 89–98 (2003)Google Scholar
  19. 19.
    DiMaggio, P.A., McAllister, S.R., Floudas, C.A., Feng, X.J., Rabinowitz, J.D., Rabitz, H.A. Biclustering via optimal re-ordering of data matrices in systems biology: Rigorous methods and comparative studies. BMC Bioinformatics 9, 458 (2008)CrossRefGoogle Scholar
  20. 20.
    Engel, J. Jr. Seizures and Epilepsy. F. A. Davis Co., Philadelphia, PA (1989)Google Scholar
  21. 21.
    Engel, J. Jr., Pedley, T.A. Epilepsy: A Comprehensive Textbook. Lippincott-Raven, Philadelphia, PA (1997)Google Scholar
  22. 22.
    Fan, N., Chinchuluun, A., Pardalos, P.M. Integer programming of biclustering based on graph models, In: Chinchuluun, A., Pardalos, P.M., Enkhbat, R. and Tseveendorj, I. (eds.) Optimization and Optimal Control: Theory and Applications, Springer (2009)Google Scholar
  23. 23.
    Fan, N., Pardalos, P.M. Linear and quadratic programming approaches for the general graph partitioning problem, J Global Optim, DOI 10.1007/s10898-009-9520-1, (2010)Google Scholar
  24. 24.
    Fisher, R.S., Krauss, G.L., Ramsay, E., Laxer, K., Gates, J. Assessment of vagus nerve stimulation for epilepsy: Report of the therapeutics and technology assessment subcommittee of the American academy of neurology. Neurology 49, 293–297 (1997)Google Scholar
  25. 25.
    Fisher, R.S., Theodore W.H. Brain stimulation for epilepsy. Lancet Neurol 3(2), 111–118 (2004)CrossRefGoogle Scholar
  26. 26.
    Gu, J., Liu, J.S. Bayesian biclustering of gene expression data. BMC Genom 9(Suppl 1), S4 (2008)Google Scholar
  27. 27.
    Hagen, L., Kahng, A.B. New spectral methods for ratio cut partitioning and clustering. IEEE Trans Computer-Aided Design 11(9), 1074–1085 (1992)CrossRefGoogle Scholar
  28. 28.
    Hartigan, J.A. Direct clustering of a data matrix. J Am Stat Assoc 67, 123–129 (1972)CrossRefGoogle Scholar
  29. 29.
    Iasemidis, L.D., Principe, J.C., Sackellares, J.C. Measurement and quantification of spatiotemporal dynamics of human epilepic seizures. In: Akay, M. (ed.) Nonlinear Signal Processing in Medicine, IEEE Press (1999)Google Scholar
  30. 30.
    Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y., Barkai, N. Revealing modular organization in the yeast transcriptional network. Nat Genet 31(4), 370–377 (2002)Google Scholar
  31. 31.
    Jain, A.K., Murty, M.N., Flynn, P.J. Data clustering: A review. ACM Comput Survey 31(3), 264–323 (1999)CrossRefGoogle Scholar
  32. 32.
    Kaiser, S., Leisch, F. A toolbox for bicluster analysis in r. Tech. Rep. 028, Ludwing-Maximilians-Universitat Mnchen (2008)Google Scholar
  33. 33.
    Kluger, Y., Basri, R., Chang, J.T., Gerstein, M. Spectral biclustering of microarray cancer data: Co-clustering genes and conditions. Genome Res 13, 703–716 (2003)CrossRefGoogle Scholar
  34. 34.
    Lazzeroni, L., Owen, A. Plaid models for gene expression data. Stat Sinica 12, 61C86 (2002)Google Scholar
  35. 35.
    Lee, D.D., Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)CrossRefGoogle Scholar
  36. 36.
    Liu, J., Wang, W. OP-cluster: Clustering by tendency in high dimensional space. Proceedings of the Third IEEE International Conference on Data Mining, 187–194 (2003)Google Scholar
  37. 37.
    Madeira, S.C., Oliveira, A.L. Biclustering algorithms for biological data analysis: A survey. IEEE Trans Comput Biol Bioinform 1(1), 24–45 (2004)CrossRefGoogle Scholar
  38. 38.
    Madeira, S.C., Oliveira, A.L. A linear time biclustering algorithm for time series gene expression data. Lect Notes Comput Sci 3692, 39–52, (2005)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Murali, T.M., Kasif, S. Extracting conserved gene expression motifs from gene expression data. Pacific Symp Biocomput 8, 77–88 (2003)Google Scholar
  40. 40.
    Pardalos, P.M., Busygin, S., Prokopyev, O.A. On biclustering with feature selection for microarray data sets. In: Mondaini, R. (ed.) BIOMAT 2005łinternational Symposium on Mathematical and Computational Biology, pp. 367–378. World Scientific, Singapore (2006)Google Scholar
  41. 41.
    Pardalos, P.M., Chaovalitwongse, W., Iasemidis, L.D., Sackellares, J.C., Shiau, D.-S., Carney, P.R., Prokopyev, O.A., Yatsenko, V.A. Seizure warning algorithm based on optimization and nonlinear dynamics. Math Prog 101(2), 365–385 (2004)MathSciNetMATHCrossRefGoogle Scholar
  42. 42.
    Pardalos, P.M., Chaovalitwongse, W., Prokopyev, O. Electroencephalogram (EEG) time series classification: Application in epilepsy. Ann Oper Res (2006)Google Scholar
  43. 43.
    Pascual-Montano, A., Carazo, J.M., Kochi, K., Lehmann, D., Pascual-Marqui, R.D. Non-smooth Non-negative matrix factorization (nsNMF). IEEE Trans Pattern Anal Mach Intell 28, 403–415 (2006)CrossRefGoogle Scholar
  44. 44.
    Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E. A systematic comparison and evaluation of biclusteringmethods for gene expression data. Bioinformatics 22(9), 1122–1129, (2006)CrossRefGoogle Scholar
  45. 45.
    Rege, M., Dong, M., Fotouhi, F. Bipartite isoperimetric graph partitioning for data co-clustering. Data Min Know Disc 16, 276–312 (2008)MathSciNetCrossRefGoogle Scholar
  46. 46.
    Reiss, D.J., Baliga, N.S., Bonneau, R. Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics 7, 280 (2006)CrossRefGoogle Scholar
  47. 47.
    Richards, A.L., Holmans, P.A., O'Donovan, M.C., Owen, M.J., Jones, L. A comparison of four clustering methods for brain expression microarray data. BMC Bioinformatics 9, 490 (2008)CrossRefGoogle Scholar
  48. 48.
    Santamaria, R., Theron, R., Quintales, L. BicOverlapper: A tool for bicluster visualization Rodrigo. Bioinformatics 24, 1212–1213 (2008)CrossRefGoogle Scholar
  49. 49.
    Schachter, S.C., Wheless, J.W. (eds.) Vagus nerve stimulation therapy 5 years after approval: A comprehensive update. Neurology S4, 59 (2002)Google Scholar
  50. 50.
    Sheng, Q., Moreau, Y., De Moor, B. Biclustering microarray data by Gibbs sampling. Bioinformatics 19, 196–205 (2003)CrossRefGoogle Scholar
  51. 51.
    Shi, J., Malik, J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell, 22(8), 888–905 (2000)CrossRefGoogle Scholar
  52. 52.
    Supper, J., Strauch, M., Wanke, D., Harter, K., Zell, A. EDISA: Extracting biclusters from multiple time-series of gene expression profiles. BMC Bioinformatics 8, 334 (2007)CrossRefGoogle Scholar
  53. 53.
    Tanay, A., Sharan, R., Kupiec, M., Shamir, R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci USA 101, 2981–2986 (2004)CrossRefGoogle Scholar
  54. 54.
    Tanay, A., Sharan, R., Shamir, R. Discovering statistically significant bilcusters in gene expression data. Bioinformatics 18, S136–S144 (2002)Google Scholar
  55. 55.
    Tanay, A., Sharan, R., Shamir, R. Biclustering algorithms: A survey. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology. Chapman Hall, London (2005)Google Scholar
  56. 56.
    Uthman, B.M., Wilder, B.J., Penry, J.K., Dean, C., Ramsay, R.E., Reid, S.A., Hammond, E.J., Tarver, W.B., Wernicke, J.F. Treatment of epilepsy by stimulation of the vagus nerve. Neurology 43, 1338–1345 (1993)CrossRefGoogle Scholar
  57. 57.
    Xu, R., Wunsch, D. II. Survey of clustering algorithms. IEEE Trans Neural Netw 16(3), 645–678 (2005)CrossRefGoogle Scholar
  58. 58.
    Yang, J., Wang, W., Wang, H., Yu, P. δ -Clusters: Capturing subspace correlation in a large data set. Proceedings of the 18th IEEE International Conference on Data Engineering, 517–528 (2002)Google Scholar
  59. 59.
    Yang, J., Wang, W., Wang, H., Yu, P. Enhanced biclustering on expression data. Proceedings of the Third IEEE Conference on Bioinformatics and Bioengineering, 321–327 (2003)Google Scholar
  60. 60.
    Zha, H., He, X., Ding, C., Simon, H., Gu, M. Bipartite graph partitioning and data clustering. Proceedings of the Tenth International Conference on Information and Knowledge Management, 25–32 (2001)Google Scholar
  61. 61.
    Zhao, H., Liew, A.W.-C., Xie, X., Yan, H. A new geometric biclustering based on the Hough transform for analysis of large-scale microarray data. J Theor Biol 251, 264–274 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Industrial and Systems Engineering, Center for Applied OptimizationUniversity of FloridaGainesvilleUSA

Personalised recommendations