Integrated functional networks of process, tissue, and developmental stage specific interactions in Arabidopsis thaliana
Recent years have seen an explosion in plant genomics, as the difficulties inherent in sequencing and functionally analyzing these biologically and economically significant organisms have been overcome. Arabidopsis thaliana, a versatile model organism, represents an opportunity to evaluate the predictive power of biological network inference for plant functional genomics.
Here, we provide a compendium of functional relationship networks for Arabidopsis thaliana leveraging data integration based on over 60 microarray, physical and genetic interaction, and literature curation datasets. These include tissue, biological process, and development stage specific networks, each predicting relationships specific to an individual biological context. These biological networks enable the rapid investigation of uncharacterized genes in specific tissues and developmental stages of interest and summarize a very large collection of A. thaliana data for biological examination. We found validation in the literature for many of our predicted networks, including those involved in disease resistance, root hair patterning, and auxin homeostasis.
These context-specific networks demonstrate that highly specific biological hypotheses can be generated for a diversity of individual processes, developmental stages, and plant tissues in A. thaliana. All predicted functional networks are available online at http://function.princeton.edu/arathGraphle.
KeywordsGene Pair Specific Biological Process Plant Ontology Bayesian Integration Flower Development Stage
Plants are complex and diverse organisms and have adapted evolutionarily to almost every ecological niche on the planet. Agricultural and pharmaceutical applications of plant genomics have focused on understanding the metabolic and biochemical potential of specific plant tissues and environmental responses . Arabidopsis thaliana is the most common model organism for plants, with a short life cycle, relatively few genes, and a fully sequenced genome . It is a multicellular organism with multiple tissue types and developmental stages, and much of its tissue-specific and stage-specific molecular biology has yet to be determined.
Many A. thaliana gene products are functional only in a specific tissue or during a specific developmental period. [3, 4]. The ability to predict tissue- or development-stage-specific function from genomic data would aid in appropriately targeting experimental work; doing experiments on every plant structure at each of its development stages individually would be tedious and costly. Additionally, it would be challenging to summarize the resulting genomic data efficiently, since the combinatorics of 30 developmental stages  by over 50 plant structures  makes a large compendium of predictions unwieldy as raw data. With this as motivation, we have created probabilistic networks providing a data-driven view of protein functional relationships and co-expressions in A. thaliana. A functional relationship between two genes indicates that their products are used by the cell to perform a particular biological process (for example, two proteins both participating in the DNA damage response). We assign a probability of interaction between all gene pairs in a specific biological context of interest based on experimental data and expert annotations of such relationships from controlled vocabularies.
Tools like Genevestigator https://www.genevestigator.com, AtGenExpress Visualization Tool http://jsp.weigelworld.org/expviz/expviz.jsp, and ATTED-II http://atted.jp/ enable analysis of expression patterns across microarrays of different types and platforms, but none of these three employ active gene function or functional relationship prediction. In general, each takes a set of genes as input and aggregates raw microarray experimental results into informative plots and tables, for example showing host experiments cluster by plant tissue. ATTED-II also integrates a large collection of microarray experiments and utilizes gene co-expression between gene pairs to suggest genes functionally related to a query. However, they do not provide genes related within specific biological processes, tissues, or developmental contexts. Additional tools such as Genemania http://genemania.org, AraNet http://www.functionalnet.org/aranet/, and STRING http://string-db.org/ do provide data integration for Arabidopsis thaliana; however, again, none of these provide tissue, development, or biological context specific inferences. Adding such information improves predictions, as is shown in Additional file 1, in which the inclusion of developmental-specific information consistently improves the accuracy of functional predictions.
We have integrated the abundance of genomic data for A. thaliana (over 60 datasets) to construct a compendium of biological networks describing functional relationships and co-expression among A. thaliana genes. This compendium demonstrates the usefulness of data integration and includes networks that are "global" in the sense that they describe the overall set of functional interactions predicted to occur among A. thaliana proteins, independently of plant tissue, developmental stage, or environmental context . However, most networks in this compendium are context-specific: they describe only the functional relationships predicted to occur at a specific time or in a specific tissue. Context-specific data integration does not use all gold standard genes for training. Rather, it trains and evaluates using a subset of genes present in the biological process, tissue, or development stage of interest. The integration up- or down-weights each integrated dataset on a per-context basis, emphasizing experimental results that are particularly informative in each biological area of interest, and it has been shown to significantly increase predictive accuracy in other organisms [8, 9]. In this way, biological researchers can use the system to determine whether a gene or genes of interest behave differently in various development stages or if they are active only in specific parts of the plant.
Here, we investigate over 300 resulting global and context-specific functional networks generated for A. thaliana biological processes, tissues, and developmental stages. We evaluated these networks computationally to determine the accuracy of their predictions, and we found that genomic datasets are differentially informative across varied contexts. Gene products' predicted roles and interactions also varied, and we found validation in the literature for specific interactions for many proteins. We highlight several of these interactions for a diversity of developmental and physiological processes, including those for PHOSPHOENYL PYRUVATE/PHOSPHATE TRANSPORTER 2 (AtPPT2) during leaf and root developmental stages, the disease resistance proteins RESISTANCE TO PSEUDOMONAS 1 and 2 (RPS1 and RP2), the root epidermal patterning protein WEREWOLF (WER), and the auxin hormone receptor TRANSPORT INHIBITOR RESPONSE 1 (TIR1). Finally, we provide an intuitive, interactive representation of these results online at http://function.princeton.edu/arathGraphle.
Results and Discussion
Overview of integrated functional networks inferred for A. thalianapathways, tissues, and developmental stages
Global and context-specific functional relationship networks.
Number of Networks
Evaluation (AUC range)
Global functional network linking genes active in similar biological pathways and processes
Global functional network linking genes active in the same developmental stage(s)
Networks linking genes active in similar pathways only within the context of each specific biological process
0.46 - 0.79
Networks linking genes active in similar developmental stages only within the context of each specific developmental stage
0.43 - 0.74
Networks linking genes active in the same pathways during the same developmental stage
0.46 - 0.82
Networks linking genes active in the same plant tissues during the same developmental stage
0.5 - 0.78
We additionally inferred two compendia of context-specific networks, each describing functional relationships between genes predicted to occur only during a specific biological process or developmental stage. Creating biological process-specific networks (i.e. context-specificity) has been explored for the yeast and human genomes [8, 13] and provides a more specific view of genes and their functional interactions tailored to individual biological areas of interest. Here, we expand context-specific inference to include developmental stages and plant tissues in addition to biological processes and pathways. As described in Table 1, this resulted in the PROCESS and DEVEL compendia of networks. Each PROCESS network represents the functional relationships predicted to occur during a specific biological process (e.g. autophagy, the cell cycle, photosynthesis, and so forth), and genes linked with high probability are expected to co-participate in this process. Each DEVEL network represents a plant developmental stage (germination, senescence, etc.), and genes linked with high probability are expected to be co-active in that stage.
Finally, in order to investigate the interactions among biological processes, temporal developmental stages, and spatial locality in tissues, we generated two additional network compendia. The first, PROCESS-DEVEL, includes 40 networks each specific to a process/developmental stage pair (e.g. photosynthesis during leaf senescence). Only 40 of the ~4,000 possible pairs were analyzed due to a lack of curated training data for the remaining process/stage combinations. Similarly, the TISSUE-DEVEL compendium includes 44 networks, each predicting gene pairs expected to be co-active in a specific tissue location and at a specific time during development. All networks in these compendia were inferred using probabilistic Bayesian reweighting of 60 genomic datasets, and the results are analyzed in detail below.
Context-specific data integration improves predictive accuracy
Development stages and tissues/biological processes of interest
C globular stage
Strong interaction with development
D bilateral stage embryo dev stages flower dev stages
Strong interaction with development
0 germination flora organ dev stages flower dev stages
Weak interaction with development
The globular stage and meristem combination network has the highest AUC in the TISSUE-DEVEL compendium, and the globular stage is indeed when primary meristems produce new cells that will ultimately differentiate and patterning of the shoot and root apical meristems begins . The globular stage also has a high AUC with other tissues (leaf, root, and seed) and biological processes (the organismal physiological process, the reproductive physiological process, and transcription), suggesting that meristem activity in these tissues is prominent and significant. Other predictions for the meristem  are also informative: in the bilateral stage, the meristems become distinguished as shoot and root meristems; in the embryo development stages, the embryo develops radial patterning and primary shoot meristems are formed; and in the flower development stage, floral meristem genes help the transition from shoot to floral meristem . All of these TISSUE-DEVEL networks achieve high AUCs. In contrast, a specialized tissue like the carpel has both low and high predictive powers across development stages. Since the stigma, not carpel, is the receptive tissue where pollen germination happens , accuracy is low in the pollen germination development stage but higher in the flower development stage and floral organ development stages.
Bayesian integration highlights experimental datasets informative in specific biological contexts of interest
Regularization of Bayesian network parameters using dataset mutual information efficiently increases prediction accuracy
Naïve Bayesian models assume independence between all input datasets, which can artificially inflate predicted probabilities when this assumption is violated (e.g. when multiple very similar datasets are integrated). Conversely, a full Bayesian model accounting for naturally-occurring dependencies (similar experimental conditions, platform and lab effects, etc.) would be inefficient to learn and evaluate using dozens of whole-genome datasets. Our solution to this issue was to regularize the Bayesian learning process using mutual information between datasets as a prior to upweight or downweight the total possible contribution of each dataset. This mixes a uniform prior with each dataset's predictions, weighted relative to the amount of information it shares with other datasets, and does so as a preprocessing stage without diminishing the efficiency of naive Bayesian learning and inference. We show in Additional file 2 that regularization is critical to the accuracy of our networks (the GLOBAL-PROCESS network substantially outperforms the GLOBAL-PROCESS without regularization; similarly, the GLOBAL-DEVEL network outperforms the GLOBAL-DEVEL without regularization).
Additional file 3 shows normalized pairwise mutual information scores between all datasets. As expected, physical interaction datasets cluster together and are quite different from the main body of microarray expression data. Microarray data falls into several large classes: abiotic stresses, biotic stresses, chemical treatments, hormone treatments, and physical protein-protein interactions. Abiotic treatments are the most similar (and thus downweighted), since they evoke strong transcriptional responses that are easy to detect during the integration process [21, 22, 23]. Similarly, other abiotic treatments - different temperature treatments of seeds and hormone treatment - basic hormone treatment of seeds are similar and share more data than most dataset pairs. These datasets are unique in that they stress A. thaliana seeds as opposed to seedlings, and their upweighting (Figure 4) may indicate that the response to these stresses is easier to detect in seeds than in other experimental conditions.
Development-specific networks enable biological hypothesis generation
An interesting case study is the predicted functional relationship between genes AT4G37930 and AtPPT2 in the leaf development stage, which is most influenced by the following datasets: 1) a study of drought stress in shoots , 2) salt stress in shoots , 3) UVB stress in shoots , 4) osmotic stress in shoots , and 5) cold stress in shoots . A clear hypothesis implied by this prediction is thus that AT4G37930 and AtPPT2 both play a role in the cellular response to stress in shoots. Additional experiments not included in our input data  show that AtPPT2 is highly expressed only in leaf development stages and not in the root development stages.
Predicted interactions in several networks are literature-validated
RPM1 INTERACTING PROTEIN 4 (RIN4), RESISTANCE TO PSEUDOMONAS SYRINGAE pv. MACULICOLA 1 (RPM1) and RESISTANCE TO PSEUDOMONAS SYRINGAE 2 (RPS2) were predicted to be co-active in the GLOBAL-PROCESS network and in the vegetative growth stages. RIN4 has been shown to physically interact with RPM1 and RPS2, and the three proteins are part of the plant's defense response to the bacterium P. syringae [26, 27]. In the vegetative stage, RIN4 is also predicted to be co-active with NDR1, which physically interacts with RIN4 in vivo . Further, in the GLOBAL-DEVEL network, RIN4 is predicted to be co-active with NPR1-like protein 4 (NPR4). Mutations in NPR4 result in susceptibility to P. syringae, and although NPR4 has not previously been shown to associate with RIN4, our predicted network suggests these proteins may interact.
Our GLOBAL-DEVEL network predicts an interaction between the root hair patterning regulator WEREWOLF (WER) and additional proteins in the root hair development pathway, including CAPRICE (CPC), GLABRA3 (GL3), and ENHANCER OF GLABRA3 (EGL3). In addition, this network predicts that GL3 and EGL3 interact, and that CPC is interacts with EGL3 and GL3. WER is known to regulate expression of CPC , and both WER and CPC regulate expression of EGL3 and GL3 . Further, GL3 and EGL3 physically interact . We also found that the transcription factors (TFs) MAGPIE (MGP), NUTCRACKER (NUC) and JACKDAW (JKD) are co-active in the seedling growth stage, while MGP and NUC are co-active in the root development stages. These three proteins are part of a network involved in ground tissue patterning in the root [32, 33]. MGP and NUC are downstream direct targets of the ground tissue patterning regulator SHORTROOT (SHR) . JKD and MGP physically interact both with each other and with SHR and another key ground tissue patterning transcription factor (TF), SCARECROW (SCR) . MGP transcription depends on SHR and SCR, while JKD transcription in embryogenesis is independent of SHR and SCR, but becomes dependent on these TFs at later stages . Though mgp mutants do not have a phenotype, jkd mutants show a small reduction in root length compared to wild type plants. Additionally, reducing MGP expression in the jkd mutant showed that these proteins have opposing effects on SHR and SCR in the ground tissue .
A third predicted network involves the plant hormone auxin. TRANSPORT INHIBITOR RESPONSE 1 (TIR1), encodes an auxin receptor that regulates auxin-mediated transcription [34, 35]. TIR1 has been shown to interact with ASK1, ASK2, AtCUL1, and AUX/IAA proteins [36, 37], all of which are predicted to be co-active in the GLOBAL-DEVEL network. Our network further predicts that TIR1 interacts with proteins not known to associate with the receptor, such as AT3G23640, a heteroglycan glucosidase involved in carbohydrate metabolism, and AT2G36720, an uncharacterized transcription factor, suggesting that these proteins may be involved in auxin related processes.
Together, these results show that our networks can accurately predict interactions in different plant developmental stages in a wide array of physiological processes.
Here, we present an ensemble of genome-wide functional relationship networks predicted for A. thaliana using Bayesian integration of 60 experimental datasets. ArathGraphle is a hypothesis generation tool that integrates information from a variety of experiments to find consistent co-activities that might otherwise go unnoticed. We infer six classes of networks: one GLOBAL-PROCESS network predicting genes participating in related biological roles; one GLOBAL-DEVEL network predicting genes co-active in the same developmental stage(s); a compendium of PROCESS networks, each containing relationships specific to one biological process or pathway; a compendium of DEVEL networks, each predicting co-activity within an individual developmental stage; and the PROCESS-DEVEL and TISSUE-DEVEL compendia calling out processes and tissue-specific activity occurring during individual developmental stages. Each network reweights the genomic data compendium to yield predictions tailored to an individual biological context of interest. The leaf- and root-specific networks predicted that the AtPPT2 protein functions during leaf development but not root development, which has since been confirmed experimentally . We further identified several literature-validated interactions among our predicted interactions.
We anticipate that these context-specific predictions of A. thaliana functional relationships will be useful to drive future hypotheses generation regarding protein function and interactions as they change among A. thaliana tissues and developmental stages. With these networks, biologists can pose questions regarding individual genes' interactions within isolated plant tissues and at only one (or more) time(s) during development, allowing them to discover novel functional interactions more rapidly. A web interface to our predictions, available at http://function.princeton.edu/arathGraphle, provides these networks in a convenient interface accessible to the wider biological and bioinformatics communities.
The experimental framework for this study consisted of the following processes: three primary gold standards were created indicating genes related or unrelated within biological processes, developmental stages, or plant tissues; A. thaliana genomic data was assembled and integrated using regularized Bayesian classifiers; and the resulting predicted genome-wide functional networks were evaluated computationally and experimentally.
Gold standard generation
We created three gold standards, each containing subsets of positive (related) and negative (unrelated) protein pairs. For the GLOBAL-PROCESS standard, we selected a set of interesting terms from the Gene Ontology as described by . Gene pairs co-annotated to one of these terms were considered to be related, and pairs containing genes annotated to some term (but not co-annotated) were considered to be unrelated. For details, see . This resulted in 188,343 positive and 1,183,813 negative pairs in the GLOBAL-PROCESS standard.
The GLOBAL-DEVEL standard was created similarly, save that genes were required to be co-annotated to a development stage in the Plant Ontology. These gold standards were decomposed into subsets for the PROCESS and DEVEL compendia by limiting positive pairs to individual processes and development stages, respectively, and randomly sub-sampling ten times as many negatives. The PROCESS-DEVEL and TISSUE-DEVEL standards intersected these PROCESS and DEVEL gold standards with an identically generated pathway- and tissue-specific standard using 43 PO terms.
Bayesian data integration
Each functional relationship network was predicted by a corresponding Bayesian classifier trained as detailed in  and . Briefly, a naive classifier was constructed for each gold standard as described above: one each for GLOBAL-PROCESS and GLOBAL-DEVEL, 208 PROCESS terms from the Gene Ontology, 19 DEVEL terms from the Plant Ontology, and 40 PROCESS-DEVEL intersections and 44 TISSUE-DEVEL intersections (each containing at least 10 genes).
Each classifier integrated the same data, broadly comprising coexpression data, protein sequence families, and physical and genetic protein-protein interactions 55 microarray datasets were gathered from AtGenExpress  and GEO  and converted into pairwise scores by Pearson correlation, z-transformation to obtain a normal distribution , and z-scoring to distribute this with mean 0, standard deviation 1 for each dataset. These coexpression scores were discretized into 7 bins from -∞ to -1.5, -1.5 to -0.5, -0.5 to 0.5, 0.5 to 1.5, 1.5 to 2.5, 2.5 to 3.5, 3.5 to ∞. Protein families were drawn from the automatically generated PFam B , and protein interactions were taken from BIND , BioGRID , computational predictions and enzyme assays used for functional annotations , and annotations extracted from literature in TAIR (The Arabidopsis Information Resource); all were binarized to indicate the presence or absence of an interaction. This resulted in 60 total datasets integrated in each classifier.
Regularization using mutual information
where P i, j (FR) is the probability that genes i and j have a functional relationship, d k (g i , g j ) is the supporting data for a dataset k between a pair of genes g i and g j , P(D k = d k (g i , g k )) is the probability of the dataset k containing some value for a pair of genes.
Computational performance evaluation
We randomly withheld 20% of genes from the positive pairs and 20% from the negative pairs in our gold standard set, using any gene pair including at least one of these genes as a test set excluded during training. All performance evaluations were performed exclusively on test sets selected this way using 5-fold cross validation.
The authors would like to thank the other members of the Troyanskaya lab for valuable feedback. This work was supported by NSF CAREER award DBI-0546275; NIH grants R01 GM071966 and T32 HG003284; and NIGMS Center of Excellence grant P50 GM071508.
- 2.Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.Google Scholar
- 6.Avraham S, Tung CW, Ilic K, Jaiswal P, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, et al: The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations. Nucleic Acids Res. 2008, 36: D449-454. 10.1093/nar/gkm908.PubMedCentralCrossRefPubMedGoogle Scholar
- 17.Barlow P: Meristematic tissues in plant growth and development. Ann Bot. Edited by: McManus MT, Veit BE. 2002, 90: 546-547. 10.1093/aob/mcf217.Google Scholar
- 20.The Arabidopsis Information Resource (TAIR). [http://www.arabidopsis.org/portals/expression/microarray/ATGenExpress.jsp]
- 22.Cho SK, Chung HS, Ryu MY, Park MJ, Lee MM, Bahk YY, Kim J, Pai HS, Kim WT: Heterologous expression and molecular and cellular characterization of CaPUB1 encoding a hot pepper U-Box E3 ubiquitin ligase homolog. Plant Physiol. 2006, 142: 1664-1682. 10.1104/pp.106.087965.PubMedCentralCrossRefPubMedGoogle Scholar
- 25.Knappe S, Lottgert T, Schneider A, Voll L, Flugge UI, Fischer K: Characterization of two functional phosphoenolpyruvate/phosphate translocator (PPT) genes in Arabidopsis--AtPPT1 may be involved in the provision of signals for correct mesophyll development. Plant J. 2003, 36: 411-420. 10.1046/j.1365-313X.2003.01888.x.CrossRefPubMedGoogle Scholar
- 33.Welch D, Hassan H, Blilou I, Immink R, Heidstra R, Scheres B: Arabidopsis JACKDAW and MAGPIE zinc finger proteins delimit asymmetric cell division and stabilize tissue boundaries by restricting SHORT-ROOT action. Genes Dev. 2007, 21: 2196-2204. 10.1101/gad.440307.PubMedCentralCrossRefPubMedGoogle Scholar
- 39.Willis RC, Hogue CW: Searching, viewing, and visualizing data in the Biomolecular Interaction Network Database (BIND). Curr Protoc Bioinformatics. 2006, Chapter 8: Unit 8 9Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.