Data sources
To illustrate the functionality of VLAD, we analyzed genes from a previously published study that described the genome wide gene expression patterns across key developmental stages of normal mouse diaphragm (Russell et al. 2012). In this study, the investigators used time-series analysis (Ernst and Bar-Joseph 2006) of microarray-based expression data to identify over 650 genes whose expression levels increased significantly between embryonic day 11.5 and embryonic day 16.5 and over 360 genes whose expression levels decreased significantly over this same time period.
To demonstrate the extensibility of VLAD to user-provided ontologies, an OBO ontology of mouse biochemical pathways (mousecyc_obo.txt) and a corresponding set of annotations in GAF format (mousecyc_gaf.txt) from the curated MouseCyc database (Evsikov et al. 2009) were downloaded from the MouseCyc project ftp site (ftp://informatics.jax.org/pub/curatorwork/MouseCycDB/) and chosen as the basis for a custom term enrichment analysis using the Annotation Data Set options on the VLAD homepage. The mouse diaphragm gene lists, OBO ontology, and GAF files are available as supplemental data and from the following ftp site: ftp://informatics.jax.org/pub/supplemental/MammGenome2015.
Running VLAD
VLAD is preconfigured to work with either gene-function annotations from MGI using the GO and/or gene-phenotype annotations from MGI using the Mammalian Phenotype (MP) Ontology. The GO and MP annotations used by VLAD are updated weekly. A user may also upload a different ontology (a file in OBO format) and corresponding annotation dataset (a file in GAF format) to perform custom enrichments. Users may also use the built in MP and GO ontologies but supply their own gene-to-ontology term annotations.
To run an analysis with VLAD users submit one or more test sets of gene symbols or accession identifiers for the laboratory mouse. By default, the test set of genes is compared to the annotations for all genes in the mouse reference genome to determine the likelihood that the annotation terms associated with the test set would occur by chance. Alternatively, users can submit a custom list of genes to which their test set should be compared. This option may be preferable for the analysis of lists of genes from studies that use a targeted set of genes for analysis; the distribution of annotations for genes for such targeted studies may be quite different than the annotations for the genome as a whole. Annotations from all sources of evidence are included in the analysis by default. The user has the option of limiting the analysis to only those annotations derived from specific classes of evidence. For example, it may be desirable to limit analyses to only those annotations derived from direct experimental assays as opposed to those inferred from sequence similarity or homology. For custom enrichment analyses, users can include their own evidence codes in the input GAF file. These user-supplied codes can be specified in the evidence code parameter settings of VLAD to exclude specific sets of annotations from the enrichment analysis.
A unique feature of VLAD is the support for the analysis of multiple gene sets at a time. For example, the up-regulated and down-regulated gene sets from a transcriptomics experiment can be analyzed at the same time to evaluate the biological consequence of gene expression changes from the perspective of biological function. Each gene set is analyzed independently, and the results are shown in a combined display designed to reveal enrichment differences between the sets.
Calculating statistical significance of annotations for a gene set
Suppose out of a list of 100 genes up-regulated in a disease sample relative to normal, 40 are associated with mortality/aging phenotypes. Is 40 % significant? What would we expect to see if we simply picked 100 genes at random? Like many other ontology term enrichment tools, VLAD calculates significance based on the hypergeometric distribution. For every term, t, in the ontology, VLAD computes a p value, p
t
(k, n, K, N), where k is the number of genes in the query set annotated to t or its descendants, n is the size of the query set, K is the total number of genes in the database annotated to t or its descendants, and N is the total number of annotated genes. The results are sorted with the terms of highest significance at the top. The default analysis in VLAD is for term enrichment where the p value is the probability of drawing at least
k successes in n tries given a population of K out of N. VLAD also offers the option of performing a depletion analysis where p is the probability of drawing at most
k successes in n tries.
One issue that VLAD and similar tools must deal with is the multiple testing problem (Noble 2009), which in this case means that the reported p values are inflated simply because we are calculating them for so many terms (i.e., doing many tests). To account for multiple testing in VLAD an additional statistic, the q value is calculated, which is based on the concept of the positive false discovery rate (pFDR) (Storey 2002). The q value is the proportion of false positives when a given group of tests is called significant and is easily computed from the ordered p values. In terms of the results generated by VLAD, the q value in row i, q
i, is interpreted as the rate of false positives if we were to consider all terms in rows 0…i to be significant.
VLAD output
The output from the VLAD program includes both graphical and tabular representations of the ontology terms associated with the user-supplied gene list (i.e., the query set) and the calculated significance scores. The tabular display shows the detailed results, i.e., all the ontology terms, their scores, statistics, and associated genes, sorted in order of decreasing significance. The graphical display provides a high level visual summary of the most significant terms from table. Each node in the graph corresponds to a term and node sizes are scaled by term significance. VLAD uses GraphViz (Gansner and North 1999) (http://www.graphviz.org/) for graph layout and visualization. The nodes in the graph and the rows in the tabular view are cross-linked so the user can easily move between the two kinds of display. The results of a gene set analyses in VLAD can be downloaded and saved to a user’s local disk. VLAD also provides the option of downloading results as an Excel spreadsheet or tab-delimited file.
VLAD provides numerous options allowing the user to customize color, image size, and nodes to display. Even for small sets of genes, the number of associated ontology terms, and hence, the size of the resulting image, may be large. VLAD provides adjustable parameters that allow the user to limit the number and reduce the size of the nodes drawn in the image. The “limit nodes” option allows the user to display only those nodes that meet specific scoring criteria. The “cull interior nodes” option allows the user to further reduce the size of the graphic by omitting many uninformative interior nodes from the display. This option is “on” by default. The root node in the ontology is always included regardless of the settings because it is visually important and helps establish context for the user.
When analyzing multiple gene sets at once, the user may assign colors to the sets, which then appear in both displays to help with comparison (Fig. 1). The color and style of the edges that connect the nodes represent the relationship between the terms (i.e., is-a, part-of, positively regulates, negatively regulates).