Time lagged information theoretic approaches to the reverse engineering of gene regulatory networks
A number of models and algorithms have been proposed in the past for gene regulatory network (GRN) inference; however, none of them address the effects of the size of time-series microarray expression data in terms of the number of time-points. In this paper, we study this problem by analyzing the behaviour of three algorithms based on information theory and dynamic Bayesian network (DBN) models. These algorithms were implemented on different sizes of data generated by synthetic networks. Experiments show that the inference accuracy of these algorithms reaches a saturation point after a specific data size brought about by a saturation in the pair-wise mutual information (MI) metric; hence there is a theoretical limit on the inference accuracy of information theory based schemes that depends on the number of time points of micro-array data used to infer GRNs. This illustrates the fact that MI might not be the best metric to use for GRN inference algorithms. To circumvent the limitations of the MI metric, we introduce a new method of computing time lags between any pair of genes and present the pair-wise time lagged Mutual Information (TLMI) and time lagged Conditional Mutual Information (TLCMI) metrics. Next we use these new metrics to propose novel GRN inference schemes which provides higher inference accuracy based on the precision and recall parameters.
It was observed that beyond a certain number of time-points (i.e., a specific size) of micro-array data, the performance of the algorithms measured in terms of the recall-to-precision ratio saturated due to the saturation in the calculated pair-wise MI metric with increasing data size. The proposed algorithms were compared to existing approaches on four different biological networks. The resulting networks were evaluated based on the benchmark precision and recall metrics and the results favour our approach.
To alleviate the effects of data size on information theory based GRN inference algorithms, novel time lag based information theoretic approaches to infer gene regulatory networks have been proposed. The results show that the time lags of regulatory effects between any pair of genes play an important role in GRN inference schemes.
KeywordsMutual Information Boolean Network Minimum Description Length Dynamic Bayesian Network Joint Entropy
A GRN is a complex set of highly interconnected processes that govern the rate at which different genes in a cell are expressed in time, space, and amplitude. Such a network is commonly represented by many pairs of proteins and genes, in which one protein/gene regulates the abundance and/or activity of another protein/gene . GRN’s can be modelled and simulated using various mathematical and computational approaches . The modelling and simulation of GRN’s is performed over the cDNA microarray data. There are two types of DNA microarray data: time series and time independent (or steady state). The time series data are obtained by sampling temporally the measurement process, whereas time-independent data sets are obtained by recording the gene expressions from independent sources, for example, different individuals, tissues, and experiments . As time series data would enable one to capture the time varying nature of a GRN, it is the preferred form of data used in reverse engineering algorithms. Moreover, time series data only gives the expression levels of genes without any knowledge of other cellular elements like protein/metabolite concentrations. In this paper a GRN is represented as a graph which consists of a set of nodes that represent genes and a set of edges that represent the interactions between genes. Thus the GRN inference problem investigated in this paper refers to finding the regulatory relationship between the genes of an organism.
Reverse engineering of gene regulatory networks remains a major issue and area of interest in the field of bioinformatics and systems biology. A survey paper  discusses a number of models related to this area, viz. Bayesian Networks , Dynamic Bayesian Networks , Boolean Networks , Probabilistic Boolean Networks [9, 10], Differential Equation Models  and Information Theory Models [3, 12, 13, 14, 15]. There is no gold standard method to reverse engineer gene regulatory networks; each method has its own advantages and disadvantages. Based on simulations of different models, it has been observed that differential equation models and dynamic Bayesian networks provide higher accuracy, but they are computationally expensive and hence, are applicable for only a small data set. Boolean networks can be used to study the coarse grained properties of genetic networks . Such binary representation of gene expression is clearly an approximation, as most biological phenomena manifest their properties in the continuous domain. Even though it is inherently deterministic, the Boolean formalism has enjoyed success in predicting biological behaviour, such as the accurate qualitative distinction between known tumor sub-classes [9, 10]. This work suggests that meaningful biological information is not lost when measured, continuous-domain, gene expression data is made binary. Information Theoretic methods to reverse-engineer GRNs build on such Boolean network models of gene expression and have gained popularity due to their simplicity and less computational cost . Each of the information theoretic schemes discussed in this paper as well as DBNs however can be easily extended to handle multiple levels of quantization to achieve higher inference accuracy at the cost of computational overhead.
ARACNE  and REVEAL  are two popular Information Theoretic approaches towards GRN inference. Both of these methods establish relationships between genes based on the MI metric. Zhao  analyzed the limitations of MI and proposed the conditional mutual information (CMI) based approach to infer GRNs. One of the major disadvantages of Information Theoretic approaches is the selection of the MI and CMI thresholds, for which Zhao  proposed the Minimum Description Length (MDL) principle and showed its effectiveness in selecting the best MI threshold. The MDL principle states that if multiple theories exist, the one with the minimum description length is the optimal. However the definition of description length varies for different models and applications. In their MDL implementation, Zhao  defined the description length as the sum of the model length (expressed as the memory usage of the inference algorithm) and data length (expressed as the over-all entropy of the inferred network). One limitation of the MDL principle was that the model length quantity in the description length expression could make the implementation arbitrary . To circumvent this problem, we have earlier proposed the Predictive Minimum Description Length (PMDL) principle approach  wherein we showed that by removing the model length quantity from description length and using CMI a higher inference accuracy can be obtained.
MI and CMI metrics are central in establishing the relationships between genes in information theory models. Hence, in order to design a smart GRN inference algorithm, it is important to study the behaviour of these MI and CMI metrics on microarray data of various sizes. The MDL implementation of Zhao  will henceforth be referred to as “network MDL” in the rest of the paper.
Another major disadvantage in information theory based models is that MI and CMI do not give directions between relationships. A unit time delay was assumed in our earlier PMDL  implementation. Zou  showed that the time lags in regulating one gene by another play an important role in inference accuracy as evident in their Dynamic Bayesian Network based approach. To incorporate the effects of time-lags in information theoretic methods, we propose a new time lag computation method in this paper, which is used to modify the standard MI and CMI computations. Based on our modified MI and CMI metrics, we next present novel time-lagged GRN inference schemes that show promising results in terms of improving the inference accuracy.
We show that the performance of the inference algorithms saturate beyond a certain data size due to the saturation in the information theory metric mutual information. Note that we have only varied the data size in our experiments to understand the effects of regulatory time-lags between genes on the algorithms. The overall performance of the algorithms would also be affected by other factors (e.g., the number of replicates and number of external chemicals used in the experiments to name a few), which might lead to other novel innovations that need to be considered in designing reverse-engineering schemes. This is however outside the scope of this paper.
A new way of computing time lags between any pair of genes is presented. Our scheme makes sure that time lags cannot be negative and we argue that a more biologically pragmatic view is that a gene can affect another gene only when it is up-regulated. This assumption makes more sense in the Boolean network formalism of GRNs where a gene can only be in two possible states: ON (i.e., up-regulated) and OFF (i.e., down-regulated).
We introduce the time lagged Mutual Information (TLMI) and time lagged Conditional Mutual Information (TLCMI) quantities.
We present novel GRN inference schemes based on TLMI, TLCMI, MDL and PMDL principles that provide higher accuracy over the existing information theoretic methods.
In this section, we first report the results of the existing inference schemes that were run on the time-series micro-array data of varying size and illustrate that the performance of the methods saturates beyond a certain number of time points. We also report how the pair-wise MI metric saturates beyond a certain data size. We then present our new time lag computation scheme and the modified version of the network MDL algorithm wherein, we replace the MI metric which considers unit time delay with the TLMI metric (considering a time-lag of τ). We next present a modified version of the PMDL algorithm, by replacing the MI and CMI metrics with the TLMI and TLCMI metrics. Finally the results from the network MDL, PMDL and modified network MDL and PMDL algorithms are compared.
Parameters to evaluate inference accuracy
Benchmark measures recall R and precision P are used to evaluate the performance of the algorithms. Although different definitions for recall and precision exist in the literature , in this paper, R is defined as Ce/(Ce+Me) and P is defined as Ce/(Ce+Fe), where Ce denotes the edges that exist in the true network and in the inferred network, Me are the edges that exist in the true network but not in the inferred network, and Fe are edges that do not exist in the true network but do exist in the inferred network.
Synthetic data generation methodology for the in silico experiments
It was imperative for us to use synthetic data over time series micro-array experimental data in this phase due to the following reasons:
Very few experimental data sets have equal time intervals between experiments and also the data size is generally limited to around 20 time points. In our in silico runs, we wanted to keep equal time intervals between each time point data such that we can understand the true effects of regulatory time-lags between genes on the inference accuracy. It is generally not possible to assign a single time-lag value to a gene-pair if the expression readings for each time point were not evenly spaced as mentioned in Zou .
Also, the saturation in inference accuracy generally requires a larger data size (> 30 time points as shown later) and it would have been difficult to identify the role of MI in bringing about this theoretical limit on the accuracy of information theoretic schemes with a smaller biological data set (of ~20 time points).
It should be noted that the Genenetweaver software derives the in silico GRNs from the prior knowledge database of yeast (Saccharomyces cerevisiae) which contains 4441 genes and 12873 interactions. Thus in order to create a sample GRN with 10 nodes, Genenetweaver clusters the yeast transcriptomic network into modules and chooses the module having number of genes closest to the given input (in this case 10 genes) to create the in silico network. Each such module maps to a particular biological function and this strategy essentially ensures that there is minimum cross-talk of these set of genes with the others in the yeast network resulting in a higher efficacy of the inference algorithms that use them.
Biological network data generation methodology to evaluate performance of proposed algorithms
The time series DNA microarray data from Spellman et al  was used to infer gene regulatory networks using the proposed algorithms. The Spellman experiment was chosen because it provides a comprehensive series of gene expression datasets for the Yeast cell cycle. Four time series expression datasets were generated using four different cell synchronization methods: cdc15, cdc28, alpha-factor and elutriation with 24, 17, 18 and 14 time points respectively. The alpha-factor dataset contained more time points than cdc28 and Elutriation datasets with fewer missing values than cdc15. Therefore, we chose to use time series expression data from the alpha-factor method to infer the gene regulatory networks.
We used the same preprocessing steps as in . Initially the data is quantized to 0 or 1. In order to quantize the expression values of every gene, they are sorted in ascending order and the first and last values of the sorted list are discarded as outliers; then the upper 50% is converted to 1 and the lower 50% is converted to 0. Any missing time points are set to the mean of their respective neighbors. If the missing time point is the first or the last one, it is set to the nearest time point value.
Four separate biological networks (as discussed later) used for comparison purposes were derived from the yeast cell cycle pathway [26, 27, 28]. The fine tuning parameter required by the network MDL based algorithms is set to 0.1 to retain most of the connections (see  for more details on this).
Effects of data size on GRN inference
Effects on Information Theory models
For the PMDL algorithm, it was observed that the precision increased until 55 time points, and, beyond that, the precision remains relatively stable for the two smaller tested networks (with 20 and 30 genes respectively). For the larger network (with 40 genes), the precision increased until 70 time points before saturation. The recall for PMDL algorithm increased until 40 time points before saturation for each of the 3 tested networks. For the network MDL algorithm, it was observed that precision increased until 35 time points and fluctuated after that. The recall for the network MDL algorithm kept increasing for all the test cases with considerable fluctuations.
Performance saturation points
MethodNo. of Genes
30, 45, 70
Effects on DBN based scheme
Why do information theory based models saturate?
The plots conclude that the saturation in the methods was due to the saturation in the mutual information quantity which goes close to zero even though the entropy increases in the network. This would conceptually mean that there is room to improve on the inference accuracy (due to high entropy), yet the mutual information metric will not be able to point us to the right direction. Other information theoretic algorithms, like REVEAL  use the ratio of MI and entropy to infer the network for this purpose which supposedly gives good performance. However, from the entropy and mutual information curves in Figure 5, we can see that the ratio of mutual information and entropy will also saturate, as the entropy increases in the network, and hence this ratio might also not be the right metric to achieve better accuracy by making use of more time point’s data. The recently proposed Directed Mutual Information metric  might be a better metric than the conventional MI based algorithms. We do plan to conduct similar studies on the performance of GRN inference algorithms based on these different metrics as a function of the number of time points in the future. It is imperative to identify the right metric for the research community to decide which class of GRN inference algorithms can work best with time-series data and also understand the ideal data size for them.
The saturation in MI due to increasing number of time points would suggest that the MI should not be computed for the entire range of time points of micro-array data available from the experiments. GRNs are inherently time varying, and hence the pair-wise MI between any 2 genes needs to be computed over the time range where the first gene will have substantial regulatory effect on the other one. This can be best approximated by estimating the regulatory time-lags between each gene pair, and subsequently computing the MI between them for this particular time range. This concept was used to compute the time-lags between genes and the TLMI and TLCMI metrics as discussed in the Methods section. Note that, the time lag computation concept initially proposed in  to implement time-lagged DBN needs to change to avoid the case of negative time-lags.
TLMI based network MDL implementation
TLMI and TLCMI based PMDL Implementation
Performance of MI and TLMI based PMDL
Time Lagged PMDL
Performance: Time and Space complexities of proposed algorithms
Step 4 of the PMDL algorithm iterates n2(m − τ) times where n is number of genes, m is the number of time points, and τ is the time lag. From line 5 to line 18 the algorithm iterates n4 times, lines 15 and 16 of the algorithm iterates n3(m − τ) times. Finally from lines 20 to 31 the algorithm iterates n3 times. Thus the time complexity of the over-all algorithm is Θ(n4 + n3(m − τ)).
From the time complexity it can be seen that if the number of genes is larger than the number of time points then the run time of PMDL algorithm depends on the number of genes. And if the number of time points is larger than the number of genes then the run time depends on the number of time points. In the worst case is zero for all genes and the algorithm runs in Θ(n4 + n3m) time.
Restrict the number of parents.
Take the next smallest description length, instead of using the smallest one.
The first approach will guarantee results when the number of parents is restricted to a small value but this may lower the accuracy of the result. The second approach may take more time to run but as we are not restricting the number of parents the accuracy of the algorithm is not affected. Some bench marking studies are required on the above two approaches to see which one works best.
In the MDL based implementation we discard the lines 20 to 31 from the PMDL based implementation. The worst case time complexity is again Θ(n4 + n3m).
In this paper, we have studied the effects of cDNA microarray data size on three algorithms: PMDL, network MDL, and a DBN based approach. The study shows that the data size plays an important role in the inference accuracy of each of these algorithms. The experiments were carried out on synthetically generated time-series data and the performance saturation points were listed for these algorithms. The immediate benefit of this work lies in helping biologists to devise cDNA microarray experiments intelligently depending on the class of GRN inference algorithms they are likely to use to achieve maximum accuracy. In a bid to understand the performance saturation of the information theoretic approaches, we also found out that mutual information saturates and effectively tends to zero as the entropy in the network increases with increase in the data size. These observations lead us to believe that MI by itself might not be the best metric in devising information theoretic approaches for GRN inference. The DBN approach however showed good performance only for a smaller data size which is non-intuitive and requires further analysis for validation. Based on these findings, we introduced two new information theory metrics viz. TLMI and TLCMI and used them in the network MDL and PMDL based algorithms to develop two novel GRN inference algorithms. The results indicate that transcriptional time lags play an important role in gene regulatory network inference methods as evidenced by the higher accuracy provided by our algorithm.
The concept of time lags was first introduced by Zou , where they proposed that the time difference between the initial expression change of a potential regulator (parent) and its target gene represents a biologically relevant time period. Here potential regulators are those set of genes whose initial expression change happened before the target gene. Also initial expression change is up or down-regulation (ON or OFF) of genes.
In biological networks the A↔ B schema is quiet common. Hence Zou’s time lag computation scheme needs to change to handle such cases. We also argue that a gene can affect another gene only when it is up-regulated (ON). Based on the above discussion, we propose time lags as the difference between initial up-regulation of first gene and initial expression change of the second gene after the up-regulation of first gene. This solves the issue of negative time lags besides being biologically more relevant as compared to the existing method of calculating time lags. Figure 10 also illustrates the proposed time lag computation based on our approach. In the figure, Ua and Ub indicate the initial up-regulation of genes A and B at time points two and three respectively. Ca and Cb indicate that time points six and three are the time points where the expression values of genes A and B changed after the initial up-regulation of genes B and A. Time lag between A and B is calculated as τ1 = Cb - Ua and time lag between B and A is calculated as τ2 = Ca - Ub respectively. In this example time lag between X and Y is one and time lag between Y and X is three.
Information Theoretic metrics
Entropy, joint entropy, mutual information and conditional mutual information
Entropy, H, is the measure of the average uncertainty in a random variable . If p i is the probability of observing a particular symbol in a sequence then entropy is given as Open image in new window
As the proposed algorithm quantizes the microarray data to two levels, a gene takes two values a 0 and 1 corresponding to being in OFF and ON states respectively. In this case, the entropy of a gene A is defined as Open image in new window , where p 0 and p 1 are the probabilities of observing a gene A as 0 and 1 respectively over the sequence A . Here sequence A contains the values taken by a gene in the time series data; thus if we have a time series data of m time points then sequence A is of length m.
This standard definition of entropy has been used in the MDL and PMDL schemes where the entropy was computed for sequence length of m-1 to simulate a default time-lag of 1, i.e., compute the entropy between A (considering its expression values from 1, 2,..., m-1) and B (considering its expression values from 2, 3,..., m). In order to implement the TLMI and TLCMI metrics, we need to compute the entropies between A (considering its expression values from 1, 2,..., m-τ) and B (considering its expression values from τ+1, τ+2,..., m).
Joint Entropy between two sequences A and B , H(A, B) is defined as Open image in new window
Thus joint entropy between two variables is an extension of entropy where the two sequences (A, B) are considered to be a single vector valued random variable dependent on each other .
As stated before, the proposed algorithm quantizes the microarray data into two levels, in this case the joint entropy between two sequences A and B H(A, B) is defined as Open image in new window where p0,0, p0,1, p1,1 and p1,1 are the probabilities of observing both zeros, a zero and a one, a one and a zero and both ones in sequences A and B respectively.
Mutual Information measures the amount of information that can be obtained about one random variable by observing another one .
Time lagged mutual information (TLMI) and time lagged conditional mutual information (TLCMI)
After implementing time lags, we no longer compute the entropy and joint entropy over the complete sequences of gene(s) as discussed before. If a time lag τ is computed between two genes A and B we remove the last τ symbols of sequence A and first τ symbols of sequence B to obtain reduced sequences (of length m-τ each) for A and B respectively. Computing the MI over these reduced sequences gives TLMI.
TLMI is not a symmetric quantity though, i.e. Open image in new window .
Considering a time lag, τ, between A and B, we compute TLCMI(A;B|C) by deleting the last τ symbols in sequences A and C (i.e., look into the sequences from time points 1, 2,..., m-τ) and first τ symbols in sequence of B (i.e., look into the sequences from time points τ+1, τ+2,..., m) and computing the CMI of these reduced sequences to obtain TLCMI.
Minimum Description Length (MDL) and Predictive Minimum Description Length (PMDL) Principle
Estimating the MI threshold is one of the major drawbacks in information theory based model. Zhao .  first proposed the MDL principle to solve the problem. Based on the microarray data, the MI matrix is computed. If the microarray data has n genes then two n*n matrices viz. one connectivity matrix and one MI matrix are stored. A time lag of one unit is assumed, thus the MI computations are not symmetric. Using every MI value as a threshold over the MI matrix, n2 models are obtained. For every model, the description lengths (model length + data length) are computed and the model with the minimum description length is selected as the best model. This algorithm involved a fine tuning parameter for the model length algorithm which also makes the MDL principle method arbitrary . To overcome this issue we earlier proposed a PMDL based inference algorithm . In the PMDL principle, we discard the model length while computing the data length of each model. The results increased both true edges and false edges in inferred networks. In order to reduce the false edges, CMI was applied over the best model and hence the over-all PMDL based approach improved the inference accuracy over the conventional network MDL scheme .
Time lagged based MDL and PMDL implementation
The basic information theoretic metrics with unit time lag were replaced in the network MDL and PMDL algorithms with the TLMI and TLCMI metrics. Figure 9 illustrates the existing and proposed algorithms. Figure 9A shows the MI and TLMI based PMDL algorithm. While both the existing and our proposed algorithms are shown in the same figure, it is to be noted that the difference lies in using the right information theoretic metrics (lines 4, 19 and 24 in Figure 9A). Figure 9B shows the MI and TLMI based network MDL algorithms. Further explanations on the PMDL and network MDL algorithms can be found in  and  respectively.
This work was supported by NSF through contract EPS-0903787 awarded to CZ and PG1, and the US Army Corps of Engineers Environmental Quality Program under contract # W912HZ-08-2-0011. Permission was granted by the Chief of Engineers to publish this information.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 6, 2010: Proceedings of the Seventh Annual MCBIOS Conference. Bioinformatics: Systems, Biology, Informatics and Computation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S6.
- 1.Björn JunkerH, Schreiber F: Analysis of biological networks. Wiley-Interscience; 2008.Google Scholar
- 5.Heckerman D, Geiger D, Chickering DM: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 1995, 20: 197–243.Google Scholar
- 6.Akutsu T, Miyano S, Kuhara S: Algorithms for inferring qualitative models of biological networks. Pacific Symposium on Biocomputing 2000, 4: 17–28.Google Scholar
- 7.Murphy K, Mian S: Modelling gene expression data using dynamic Bayesian networks. In Technical report, Computer Science Division University of California, Berkeley, CA 1999.Google Scholar
- 11.Chen T, He HL, Church GM: Modeling gene expression with differential equations. Pacific Symposium Biocomputing 1999, 4: 29–40.Google Scholar
- 13.Dougherty J, Tabus I, Astola J: Inference of gene regulatory networks based on a universal minimum description length. EURASIP Journal on Bioinformatics and Systems Biology 2008. Article ID: 482090, 11 pages Article ID: 482090, 11 pagesGoogle Scholar
- 15.Shoudan L: REVEAL: A general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing 1998, 3: 18–29.Google Scholar
- 17.Chaitankar V, Zhang C, Ghosh P, Perkins EJ, Gong P, Deng Y: Gene Regulatory Network Inference Using Predictive Minimum Description Length Principle and Conditional Mutual Information. Proceedings of International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing 2009, 487–490. full_textGoogle Scholar
- 19.Cover TM, Thomas JA: Elements of information theory. Wiley-Interscience, New York 1991.Google Scholar
- 21.Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G: Revealing strengths and weaknesses of methods for gene network inference. PNAS, in press.Google Scholar
- 24.Rao A, Hero AO 3rd, States DJ, Engel JD: Inference of Biologically Relevant Gene Influence Networks Using the Directed Information Criterion. ICASSP Proceedings 2006., 2(II-II):Google Scholar
- 25.Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell 1998, 9: 3273–3297.PubMedCentralCrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.