Factor analysis for gene regulatory networks and transcription factor activity profiles
- 7.6k Downloads
Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes.
In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result.
Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information.
KeywordsPosterior Distribution Independent Component Analysis Connectivity Matrix Factor Analysis Model Noise Covariance Matrix
Factor analysis (FA) as well as principal component analysis (PCA) is used to describe a number of observed variables by a smaller number of unobserved variables. Unlike PCA, FA also includes independent additive measurement errors on the observed variables. FA assumes that the observed variables become uncorrelated given a set of hidden variables called factors. It can also be seen as a clustering method where the variables described by the same factors are highly correlated, thus belonging to the same cluster, while the variables depending on different factors are uncorrelated and placed in different clusters.
FA has been successfully used in a number of areas such as computer vision, pattern recognition, economics and more recently in bioinformatics [1, 2, 3, 4]. The suitability of FA for gene expression analysis is also the motivation of this work. Genes are transcribed into mRNAs which in turn are translated into proteins. Some of these proteins activate or inhibit, as transcription factors (TFs), the transcription of a number of other genes creating a complex gene regulatory network. The number of transcription factors is much smaller than the number of transcribed genes and most genes are regulated only by a small number of transcription factors. Hence, the matrix that describes the connections between the transcription factors and the regulated genes is sparse. Using microarrays, mRNA levels of thousands of genes can be measured simultaneously, but no direct information is obtained about TF activity. Our aim is two-fold: to identify the genes regulated by a common TF, that is, to reconstruct the connectivity structure and weights in a two-layer network, and to reconstruct the activity profile of each TF.
Liao et al.  have suggested the use of a network component analysis (NCA) algorithm for reconstructing the profiles of the TFs (see also  and ), while Boulesteix and Strimmer  have used an approach based on partial least squares regression. They have both shown that such methods can faithfully reconstruct the expression profiles of the TFs. However, both methods rely heavily on the availability of connectivity information. Nonzero positions in the factor loadings matrix, which describes the connections between the factors and the genes, need to be specified in advance. The algorithms then estimate the values at these positions (which might turn out to be zero). This is a strong limitation since often only little information about genes regulated by specific TFs is available. FA models are faced with a much harder task where both the structure of the factor loadings matrix and the activity profiles of the factors have to be reconstructed. Independent component analysis (ICA) has also been widely used in bioinformatics (see for example [9, 10] and ). This approach assumes that the transcription factors are statistically independent. A comparison of NCA and ICA can be found in Liao et al. , and thus ICA will not be considered further here. A further advantage of the Bayesian FA models is that any information about the underlying structure can be easily incorporated through priors. This improves performance, but is not required for the algorithms to be applicable in the first place, as in the case of NCA and its generalisations.
Hinton et al.  first introduced an EM algorithm for factor analysis in order to model the manifolds of digitised images of handwritten digits. Later Ghahramani and Hinton  presented an exact EM algorithm for both factor analyzers and mixtures of factor analyzers. More recently, Utsugi and Kumagai  used a Gibbs sampler instead of the EM algorithm suggested by Ghahramani and Hinton  for mixtures of factor analyzers. West  was the first to introduce Bayesian factor analysis in the bioinformatics field. To accommodate the required sparsity regarding the connections between the factors and the genes, he suggested the use of a mixture prior on the factor loadings matrix. As is shown in the results section, the predicted factor loadings matrix has the desired sparsity, at the expense of increasing computing time as the number of hidden variables increases. Recently, Sabatti and James  have used the framework by West  for the reconstruction of transcription factor profiles. In order to avoid the computational burden of estimating the factor loadings matrix at each step of the Gibbs sampler and to facilitate the reconstruction process, they set a large number of entries to zero based on information obtained from the Vocabulon algorithm . This algorithm scans DNA sequences for multiple motifs and associates with each transcription factor a probability of binding to a specific site. This approach resembles the approach of Liao et al. , and Boulesteix and Strimmer  where the structure of the factor loadings matrix is given in advance.
Note that the algorithms of Ghahramani and Hinton , and Utsugi and Kumagai  have not previously been applied to biological data, and that the algorithm of Sabatti and James  is an adaptation of the algorithm of West  with the difference that an informative prior is used for the factor loadings matrix. Also, Sabatti and James  applied the FA model to yeast and E. coli data, while West  applied his algorithm to cancer data.
In this paper, we suggest the use of Fokoue's algorithm  as an alternative to West's algorithm . This algorithm utilises a Gamma prior distribution on the variance of the factor loadings matrix that imposes the required sparsity but, at the same time, avoids the computational burden introduced by the use of a mixture prior . Since this algorithm avoids the combinatorial problem of West's algorithm, a prior knowledge on the underlying model is not required. At the same time, we give a thorough review of all FA algorithms mentioned above and examine the applicability of those algorithms to biological data. To the best of our knowledge such a comparison of FA algorithms in the scope of analyzing microarray data has not been presented before. Moreover, we extend these algorithms by suggesting a further factor rotation analysis which produces additional sparsity of the factor loadings matrix. This additional sparsity not only facilitates the interpretation of the results, but it is also useful in a biological context where a very sparse matrix is required. Finally, we show that merging the information provided by each algorithm to obtain a combined result leads to better performance. The algorithms are compared based on their ability to reconstruct the underlying factor loadings matrix and the profiles of the transcription factors.
The comparison is done on both simulated data where the true answer is known and on experimental data. We evaluate the performance of the algorithms on the Hemoglobin data obtained by Liao et al.  and on the Escherichia coli (E. coli) data in Kao et al. . Although time series data show correlation that is ignored in a factor analysis, which in fact assumes independence across data points, we used these data sets for comparison of our results with that in Liao et al. , Kao et al. , and Boulesteix and Strimmer .
Factor analysis model
Let us assume that we have a random observed vector variable x of P dimensions, x = (x1,..., x P )'. We denote an instance of this vector with a superscript n and we assume that we have N such instances, x n where n = 1,..., N. Similarly, f = (f1,..., f K )' is a vector of K hidden variables, known as factors. Note that the number K of factors is always smaller than or equal to the number P of observed variables. The factor analysis model states that the observed variables are a linear combination of the factors plus a mean and an error term. For case n
where μ = (μ1,..., μ P )' and ε n = (,...,)' are column vectors of dimension P with elements corresponding to the mean and the error of the P observed variables. The vector μ is the same for all cases. Λ is the unobserved transition matrix also referred to as the factor loadings matrix. The factor loadings matrix has P × K dimensions. That is, each column corresponds to a factor and each row corresponds to an observed variable. The entries of the factor loadings matrix indicate the strength of the dependence of each observed variable on each factor. For example, if λ pk is zero, then variable x p is independent of factor f k . In matrix form equation 1 is
where X = (x1,..., x N ), F = (f1,..., f N ), E = (ε1,..., ε N ), M = μe N with e N an N dimensional row vector of ones. FA models assume that the error terms ε n are independent, and multivariate normally distributed with mean zero and covariance matrix Ψ, ε n ~ (0, Ψ), where Ψ = diag(,..., ). Thus the probability distribution of x for each observed case n has a multivariate normal density given by
or in matrix notation
where tr is the trace, the sum of the diagonal elements. In the methods section, we discuss in detail the prior and posterior probabilities of the parameters F, μ, Λ and Ψ, as well as algorithms for their estimation.
As shown in equation 5 in the methods section, the complete density of the data, when factors are integrated out, is given by a normal distribution with covariance matrix ΛΣ f Λ' + Ψ. There is a scale identifiability problem associated with Λ and Σ f . In order to avoid this problem, we could either restrict the columns of Λ to unit vectors or set Σ f to the identity matrix. The second approach is often preferred in factor analysis.
There is also an identifiability problem associated with equation 2. Let us assume that we have an orthogonal matrix Q of dimensions K × K with QQ' = Q'Q = I K . Then we can have
ΛF = ΛQQ'F = Λ*F*
with cov(F*) = cov(F). That is, it is not possible to distinguish between Λ and all its possible orthogonal transformations Λ* based on knowledge of the product ΛF only. However, as we show in the results section, if the loadings matrix underlying the data generating process is sparse enough, it can often be reconstructed. This can be done either by using sparsity priors on the entries of the loadings matrix in a Bayesian setting or by orthogonal rotations enforcing sparsity (see methods section).
Note that orthogonal transformations also include permutations of the factors. Factors could be ordered by the amount of variance explained. Or, as in the case of regulatory networks, we would have to map known TFs to the inferred factors. In Sabatti and James , the factors are constrained by assigning a priori zero values to the factor loadings matrix. Here, we map the TFs to the inferred factors based on previous knowledge about their activity profiles, as for example reported in Kao et al. .
Results and Discussion
We compare the algorithms by Ghahramani and Hinton  (Z), Utsugi and Kumagai  (U), Fokoue  (F), and West  (W) on simulated and real biological data. Algorithm W is based on updating hidden indicator variables representing network connections. For a full exploration of the posterior probability, all possible combinations of hidden values need to be evaluated, thus an exponential number of combinations of these variables. We therefore suggest and test a version (Ws) of the algorithm with independent updates of hidden variables. We also compare these Bayesian FA algorithms with classical FA (as implemented in the Matlab function factoran (M)).
In order to evaluate the strengths and weakness of such algorithms we simulate comparatively 'easy' data (that is from linear models) to be able to focus on the question how far sparsity in the connectivity allows identification of the loadings and the factor matrix. Moreover, as shown in the PhD thesis by Pournara  the assumption of linearity is not a severe one given the small amount of data and the significant amounts of noise present in microarray data, especially after taking logarithms of mRNA abundance levels or ratios (see also Kao et al. ). In a second step, instead of resorting to simulated nonlinear data, which would have invited questions about the choice of particular nonlinear functional forms, we apply the algorithms to real microarray data and evaluate their performance there directly.
We test the algorithms on simulated networks. For the generation of random networks we start with a description of network characteristics such as the indegree distribution of genes and outdegree distributions of TFs, which we take from known regulatory networks of E. coli. For each TF, we then select random genes subject to these constraints. The activity levels of the factors F are drawn from a Gaussian distribution with zero mean and covariance matrix I. The vector μ of means is set to zero. All non-zero loadings are set to 1. A noise term E p is added in each dimension p with zero mean and variance as
where is the variance of the data in dimension p, and snr is a signal to noise ratio. We evaluate the performance of the algorithms by calculating the mean of squared error (MSE) for the predicted factor loadings matrix Λ and the factor matrix F. We identify the labels of the factors by choosing the column permutation of F that gives the smallest MSE.
As discussed above, the loadings and factor matrices are only identifiable up to a rotation. Sparsity of the true loadings matrix helps to overcome this lack in identifiability. In algorithms F and W the parameters are estimated by imposing sparsity on the loadings matrix directly. Others, not imposing any prior sparsity, cannot be expected to find the correct solution without further processing, for example, by orthogonal transformations to a sparse form. Results can be improved by normalising the column vectors of the loadings matrix before the transformation, that is, by dividing each vector by its Euclidean length. The inverse of the orthogonal transformation of the loadings matrix is used to transform the factor matrix correspondingly. Finally, in order to assess how successful a factor analysis is independently of the identifiability problem for orthogonal transformations, we apply a procrustes orthogonal transformation (that is, one minimising squared vector distances, see methods section) of the column vectors of the reconstructed loadings matrix onto the column vectors of the true loadings matrix. Such rotation is possible since in the case of simulated data the true loadings matrix is known.
Simulated E. coli networks
Figure 3(a) also shows the MSE for the varimax and procrustes rotated matrix Λ rot . Varimax and quartimax rotation give similar results. The equamax rotation gives a slightly higher MSE for sparse matrices and lower MSE for dense matrices (results not shown). Once a varimax rotation is applied to matrices obtained by the FA algorithms, the difference between them regarding the MSE is significantly reduced. It appears that the performance of algorithm F for sparse matrices is better without a varimax rotation; actually so much so that algorithm F is still better than all the other algorithms even after application of varimax. The procrustes rotation indicates the ability of all the FA algorithms to reconstruct a factor loadings matrix that has a very small MSE. However, it also shows that finding the best possible rotation is difficult.
Two more tests were performed to investigate the behavior of the FA algorithms on datasets of different size (ranging from 25 to 100 cases) and data generated with different values of snr (ranging from 0.5 to 100). Note that the classical FA algorithm uses the covariance matrix of the data and thus the number of cases must be greater than the number of variables. That is, the factoran script was not run for datasets of 25 cases. These two tests were applied to networks with density 15. Figure 3(b) shows again that algorithms Z and U perform similarly regardless of the number of cases in the dataset. Moreover, algorithms F and W also perform similarly and have a much smaller MSE than the other algorithms. Once the varimax rotation is applied, all the algorithms give a similar performance with a smaller MSE achieved as the number of cases increases. For sparse networks with small densities even a very small dataset is enough to reconstruct the factor loadings matrix. The procrustes rotation indicates that algorithm W produces a factor loadings matrix which, if properly rotated, is very close to the true matrix for very sparse networks.
Figure 3(c) shows the results for different values of snr. As the amount of noise increases, the performance of most algorithms decreases. Algorithm F has the best performance overall. Varimax rotation improves the performances of the other algorithms and makes them comparable to the results of F and W. Note that algorithm W seems to perform worse when the data are free of noise (snr 100) than when there is at least some small amount of noise (snr 10). However, when we apply varimax rotation to this algorithm we see that the performance decreases indeed with increasing amounts of noise.
Summarising, algorithms F and W perform better on sparse matrices than algorithms Z, U and M because they implicitly capture the required sparsity on the factor loadings matrix. However, if an appropriate orthogonal rotation of the matrices Λ and F is applied, the performances of all the FA algorithms are enhanced and become comparable.
We further compare the FA algorithms to two biological datasets; the Hemoglobin dataset from Liao et al. , where the connectivity matrix and the profiles of the factors are known to some degree, and the E. coli dataset, where the TF profiles and some interactions have been suggested by Kao et al. .
The absorbance spectra of seven hemoglobin solutions (M 1,..., M 7) were measured in Liao et al. . Each spectrum is the outcome of a linear combination of the concentrations of three components: oxyhemoglobin (OxyHb), methemoglobin (MetHb) and cyano-methemoglobin (CyanoHb). This dataset consists of 321 measurements for each of the seven hemoglobin solutions.
We first compared the algorithms by Fokoue , West , and by Tran et al.  (GNCA) fixing the positions of zeros in the loadings matrix. Note that the algorithm by Tran et al.  requires this connectivity matrix as an input and is unlikely to work properly without this information. Tran et al.  have presented an extension of the NCA algorithm , the GNCA (generalised network component analysis) algorithm. For details regarding the different versions of the GNCA algorithm see . We present the results for versions GNCA and GNCA r . Each algorithm was run 20 times. For algorithms GNCA and GNCA r , we consider the run with the least MSE, while for the FA algorithms we consider the average of these runs.
Algorithm F and W perform quite well on both the reconstruction of the Λ and the factor profiles. Their performance is also improved by using any of the four rotation methods. Varimax rotation also improves the performance of algorithms Z and U. Again procrustes rotation shows that we can rotate the estimated Λ to match the true Λ very closely. However, as shown again by the MSE on the factors, the best rotation for Λ is not necessarily the best rotation for the factors.
Figure 7(a) shows the result for algorithm F when entries of the loadings matrix are restricted to stay close to 0 by a strong prior (shape parameter 10 for δ pk , scale parameter 0.01, see methods section). We also investigated the performance of algorithm F under a vaguer prior on matrix entries (shape parameter 1 for δ pk , scale parameter 0.01). As shown in Figure 7(c) and 7(d), this setting (Fu) performs better, but once the loadings matrix is rotated, the improvement is not as significant. Similarly, we set the prior probability π pk (see methods section) that an entry of the loadings matrix is nonzero to 1 in algorithm W (Wu) and to 0.2 (W). As expected, since the connectivity is not sparse for the hemoglobin data, the difference is small (Figure 7(c)). With rotation (except the procrustes rotation) the sparse prior seems to do considerably better though. Finally, the algorithm Ws (with prior probability 0.2) that we have suggested in order to avoid the combinatorial problem of algorithm W gives good results and comparable to the ones by W.
Figure 8 shows the reconstructed factor profiles after varimax rotation on the Λ without using prior information on nonzero entries. It also demonstrates that it is now harder to reconstruct the factor profiles, as seen in the greater variability of profiles from different MCMC runs when compared to Figure 6. All the profiles shown have very similar likelihoods, indicating that the overall distribution is multimodal. Algorithms F and W perform quite well. Algorithms Z and U reconstruct the second and third factors quite well but not as well the first one. As mentioned above the reconstruction of the factor profiles by algorithm M are quite poor, while algorithm F seems to find the best factor profiles.
Escherichia coli dataset
We evaluated the FA algorithms as well as the algorithm by Boulesteix and Strimmer  (S, as implemented in the R package plsgenomics) on an E. coli dataset from Kao et al. . These data consist of 25 time points for 100 genes. The first time point was ignored since all the values are zero. A matrix that indicates possible interactions between 16 TFs and the 100 genes has been suggested by Kao et al.  based on RegulonDB  and the current literature. We will refer to this matrix as the Kao connectivity matrix. This matrix also indicates whether a TF inhibits or activates a given gene. Each FA algorithm is run 10 times. The following results refer to an average value over these runs. The classical FA is not used in this analysis since the number of cases (24) is smaller than the number of observed variables (100).
Since the GNCA algorithm requires prior knowledge of zeros in the factor loadings matrix, for comparison we also run the FA algorithms of Fokoue  and West  providing prior information on zeros in the factor loadings matrix. Here, algorithms F and Ws treat the connectivity matrix simply as indicating whether there is a relationship or not between a gene and a TF and ignore the information on activation or inhibition. However, one could also include a more detailed prior information. We consider two different prior matrices for the GNCA algorithm: one where a simplified connectivity matrix that only indicates whether an interaction exists or not, and one with extra information on inhibition and activation. For each of the two different prior matrices, we run the GNCA algorithm 10 times, and we only consider the run with the least sum squared error.
MSEs of the reconstructed factor profiles for the E. coli data
MSE (with prior information)
MSE (without prior information)
Figure 9(b) shows the results when no prior information on connectivity is provided. For comparison, we match the resulting TF profiles with those of GNCA by minimum MSE and add the plots of the GNCA TF profiles from Figure 9(a). As is evident, the FA algorithms in the case of the E. coli dataset are still capable of reconstructing important aspects of the TF profiles even without any prior information on the connectivity. This is encouraging since prior information on TF binding is sometimes limited, difficult to obtain, or not always reliable. The profiles are slightly rougher than the ones inferred given the connectivity matrix and the FA algorithms show greater variability. However, it is still impressive how all FA algorithms are able to reconstruct the main trends of the TF profiles. The third column in Table 1 shows the MSE deviation of profiles of algorithms Z, U, F and Ws from profiles of GNCA with activation and inhibition information. The MSE is about twice as large if no prior information is available.
We also plot, in Figure 10(b), the ROC curve of each algorithm after applying procrustes rotation to the factor loadings matrix. Here, we use the Kao connectivity matrix as the target matrix for the procrustes rotation. The ROC curves have greatly improved indicating that an appropriate rotation of the learned loadings matrix for each algorithm can lead to a connectivity matrix that is very close to the Kao connectivity matrix. Again the combined loadings matrix gives a ROC curve that outperforms each of the ROC curves given by the FA algorithms.
We discussed and compared the performance of five factor analysis algorithms presented previously in the literature. Only one of these algorithms has been previously applied to biological data. We investigated the applicability of the algorithms on microarray data from E. coli, on data from hemoglobin spectroscopic measurements and on simulated data. In a gene regulatory context, we aim to identify regulatory relationships between genes and TFs and to reconstruct transcription factor activity profiles. That is, the expression levels of regulated genes are the observed variables and the TFs are the unobserved variables. Even after imposing a correlation structure on the factors, this is still an underdetermined problem. If, however, we assume that the connectivity matrix is sparse, that is, that most genes are regulated by a small number of TFs and most TFs regulate only a small number of genes, estimation of TF profiles and loadings becomes possible.
The sparsity requirement is implicit in the algorithms by Fokoue  and West , and thus these algorithms are shown to perform very well on sparse simulated networks where the underlying relationships are linear. However, we show that the performance of the algorithms by Ghahramani and Hinton , and Utsugi and Kumagai  is also very satisfactory after an orthogonal rotation of the loadings matrix. On the E. coli data, we see that all the FA algorithms reconstruct the factor loadings matrix and the factors profiles equally well. Moreover, we show, using the E. coli data, that such algorithms can reconstruct the underlying TF profiles to an acceptable degree even without any prior knowledge of the connectivity structure. In contrast, algorithms such as the GNCA algorithm of Tran et al. , depend heavily on prior connectivity information. Finally, we show that integrating results from several FA algorithms results in a connectivity matrix which has a better true positive rate given a specified false positive rate than each algorithm separately. Our analysis demonstrates the usefulness of FA algorithms for biological problems where prior information regarding the system under study is not fully available.
The FA algorithms discussed here ignore any time series information. We are currently working on an extension of the above methods to integrate time correlation. We expect that such correlation will smooth TF activity profiles further.
For completeness and to show commonalities and differences between the approaches to FA analysis discussed in this paper, we describe them in some detail in this section. We conclude this section with a short description of matrix rotation methods.
The factors are assumed to be normally distributed with mean zero and covariance matrix Σ f . That is,
f n ~ (0, Σ f )
To resolve identifiability problems (we will return to this issue later), we set Σ f equal to the identity matrix I K as suggested by Ghahramani and Hinton , Utsugi and Kumagai , and Fokoue . Sabatti and James  choose Σf = I K where is a constant value. Finally, West  assigns a more general prior, Σ f = diag(,..., ).
The posterior probability of the factors is now derived as
p(f n | x n , Λ, μ, Ψ) ∝ p(f n )p(x n | f n , Λ, μ, Ψ) = (f n | , )
where the posterior mean and variance are given by
We can now integrate F out of equation 4 to get the complete density of the data
The EM algorithm of Ghahramani and Hinton  consists of two steps: a) the E-step which calculates the expected values and the second moments of the factors for each case n given the current Λ and Ψ as given below
and b) the M-step which calculates the values of Λ and Ψ given the expected values of the factors that were computed in the E-step.
Mean vector μ
The prior probability assigned to the mean vector μ is the Gaussian distribution with a mean vector m μ and a covariance matrix Σ μ
μ ~ (m μ , Σ μ )
By using the above prior, we derive the following posterior distribution for μ
West , Sabatti and James , and Fokoue  suggest to centralise the data prior to the use of the FA model, and they also assume that μ = 0. We also suggest to standardise (centralise and scale by standard deviation) the data prior to the analysis.
Utsugi and Kumagai  use a different prior covariance matrix that ties the mean to the error term. That is,
where m μ is set to zero since they also centralize the data prior to the use of the FA model. The posterior distribution of μ, given the above prior is derived in the next section together with Λ.
Factor loadings matrix Λ
The main differences between the existing FA models lies in the assignment of the prior distribution of the factor loadings matrix Λ or in the prior distribution of its parameters. Let us discuss each of these priors separately.
Normal prior on Λ and Gamma prior on Λ's covariance parameter
Fokoue  uses the prior suggested by Tipping  in the context of Relevance Vector Machines to impose sparsity in the Λ matrix. That is, independent Gaussian priors are assigned to each element λ pk of Λ.
λ pk | δ pk ~ (0, )
To each δ pk , a Gamma prior is assigned as follows
δ pk | α δ , β δ ~ (α δ , β δ )
where a Gamma distribution with shape parameter α and a scale parameter β is defined as
In the context of biological data, we suspect that each gene is regulated by only a small number of TFs. Thus, we aim to identify a sparse factor loadings matrix that faithfully describes the relationship between the transcription factors and the regulated genes. The suggested prior leads to a Student t-distribution for each row of Λ. In two dimensions, such distribution assigns most probability mass to the origin where both λp 1and λp 2are zero and along the spines where one of the coefficients λ pk is zero.
We suggest that this type of prior on Λ is also applicable to biological data. We have also further extended this prior to include an extra level of hyperparameters for increased flexibility and for an easier assignment of the hyperparameters. Thus, the parameter β δ has also a Gamma prior of the form
The posterior probability of each row Λ p of Λ is given by
p(Λ p | X, F, μ, Ψ, Δ p ) ∝ p(Λ p | Δ p )p(X | F, Λ, μ, Ψ) = (Λ p | , ) (7)
Δ p = diag(,... ,), and A p is a row vector that corresponds to the p th row of A.
The posterior distribution of δ pk is also a Gamma distribution given by
Finally, the posterior distribution of the scale parameter β δ of Δ is given by
Mixture prior on Λ
West  has suggested a mixture prior on the elements λ pk that also induces sparsity on the factor loadings matrix Λ. Thus each element λ pk has the following prior
where δ0 is the unit point mass at zero, and π pk indicates the probability of λ pk to be different from zero. We set π pk to 0.2 in the case of unknown connectivity and to 0 and 1 in the case the connectivity is known. An auxiliary variable is usually used to enable the calculation of the posterior probabilities. Thus, let us introduce a matrix of indicator variables Z with each element z pk , corresponding to each element λ pk . The prior probability on Z is a product of independent Bernoulli distributions as follows
The z pk variables are called indicators, since they indicate whether the value of λ pk is to be drawn from the normal distribution or set to zero. That is,
The posterior probability of the vector variable Z p = (zp 1,..., z pK ) does not have a known form (see equation 8). Thus we have to calculate equation 8 for all possible configurations of Z p and then use the multinomial probability distribution to sample a new configuration for Z p . This is a combinatorial problem, and thus as the number of hidden variables increases, the computational cost increases exponentially.
where F[Z p ] denotes the submatrix of F obtained by removing those rows of F corresponding to z pk = 0, K' is the number of factors for which z pk = 1, and IK'is the identity matrix of K' dimensions. We also tested a version Ws of this algorithm in which equation 8 (with K = 1) is applied to each entry of the matrix individually, that is, without the need of a combinatorial evaluation of all possible 0,1 vectors Z p .
The posterior distribution of each row Λ p of Λ is the same as in equation 7 but F is now replaced by F[Z p ] and by I K' .
Normal prior with the covariance parameter depending on Ψ
Let us denote each column of the Λ matrix with Λ k where k = 1,..., K. A convenient conjugate prior for Λ k is the Gaussian distribution. Utsugi and Kumagai  set the mean of this distribution to zero and the covariance matrix to Ψ. That is,
Λ k ~ (0, Ψ)
where Ψ is the covariance of the noise. Thus, if the data are noisy then the above prior assigns large magnitude to the vector Λ k , while free of noise data suggest small magnitudes for Λ k .
The posterior distribution of the combined matrix = [μ, Λ] (see equation 6 for the prior on μ) is given
where ⊗ is the Kronecker tensor product,
and I K is a K dimensional vector of ones.
Moreover, Utsugi and Kumagai  suggest the use of a Gamma hyperprior on the parameters α1 and α2. That is,
The posterior distributions of those hyperparameters are also Gamma distributions given by
Noise covariance matrix Ψ
A convenient conjugate prior is assigned to the inverse of the noise covariance matrix Ψ so that its posterior distribution has a known form. Thus, the prior on each is a Gamma distribution given by
While the Gamma posterior distribution of ψ-2 in Utsugi and Kumagai  has a more complicated form since Ψ is tied to the covariance matrices of both μ, and Λ. Thus, it has the following form
where I p is the identity matrix of P × P dimensions and
West , and Sabatti and James  use a common variance ψ-2 for all dimensions P, while Ghahramani and Hinton , Utsugi and Kumagai , and Fokoue  allow the model to estimate a different variance in each dimension p. We also suggest the use of a second level of hyperpriors on the scale parameter of ψ-2 since it gives a greater flexibility to the model. The cost of this greater flexibility is that more parameters have to be estimated, but this disadvantage is compensated for by the easier assignment of the hyperparameters and the better estimation of the noise covariance matrix. We assign a Gamma prior on βΨ with parameters and . The posterior distribution is given by
Rotation of Λ matrix
We are usually interested in those rotations that result in interpretable factor loadings matrix. For example, a matrix that has as few nonzero loadings as possible. In a biological context that means that each gene is regulated by a small number of TFs. The algorithms of West  and Fokoue  implicitly look for sparse matrices. However, this is not true for the classical FA algorithm and the algorithms of Ghahramani and Hinton , and Utsugi and Kumagai . As shown in the results section, the performance of these algorithms can be improved by applying an additional orthogonal rotation Q on the learned factor loadings matrix that leads to a sparse one Λ rot , Λ rot = ΛQ.
Since different orthogonal rotation methods have different constraints as we discuss next, they can lead to different factor loadings matrix. Thus, a unique solution can not be achieved if a prior information regarding the position of the zeros in the factor loadings matrix is not given.
A number of metrics can be used as a measure of sparsity. For example, the varimax rotation  maximizes the row variances of the squares of the loadings.
Similarly, the quartimax rotation maximizes the column variances of the squares of the loadings (using that the sum of squares along columns is constant).
The equamax rotation is something between the varimax and quartimax rotation and gives better results for dense matrices.
We suggest a new method, the tanh rotation. It penalizes small deviations from zero but keeps the penalty constant for values far from zero.
where the parameter α determines the steepness of the tanh function.
Finally, the procrustes rotation  results in a factor loadings matrix Λ rot by minimizing the sum of squared differences to a target matrix T,
Thus, if the true factor loadings matrix is known, the procrustes method can be used to identify the best possible rotation. However, since this is not usually true for real data, the procrustes method can be used, for example, when assessing FA methods on synthetic data. That is, in this case the target matrix is the true matrix that we try to infer.
IP is supported by a BBSRC grant.
- 2.Aguilar O, West M: Bayesian dynamic factor models and portfolio allocation. Journal of Business and Economic Statistics 2000, 18: 338–357.Google Scholar
- 3.West M: Bayesian factor regression models in the "Large p, Small n" paradigm. Bayesian statistics 2003, 7: 733–742.Google Scholar
- 13.Ghahramani Z, Hinton G: The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96–1 1997.Google Scholar
- 16.Fokoue E: Stochastic determination of the intrinsic structure in Bayesian factor analysis. Technical Report 17 2004.Google Scholar
- 17.Pournara I: Reconstructing gene regulatory networks by passive and active Bayesian learning. PhD thesis. Birkbeck College, University of London; 2004.Google Scholar
- 18.Salgado H, Santos-Zavaleta A, Gama-Castro S, Millan-Zarate D, Diaz-Peredo E, Sanchez-Solano F, Prez-Rueda E, Bonavides-Martinez C, Collado-Vides J: RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12. Nucleic Acids Res 2001, 29: 72–74.PubMedCentralCrossRefPubMedGoogle Scholar
- 19.Factor analysis EM software[http://www.gatsby.ucl.ac.uk/~zoubin/software.html]
- 20.Tipping M: Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research 2001, 1: 211–244.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.