1 Introduction

The use of bioinformatics to analyze protein and genomic sequences is based on the principle that functional regions in proteins and genomes are less likely to undergo random mutational changes, hence conserved sequences are candidates for important structural or cis-regulatory function (17). The application of this principle to Hedgehog (hh) genes and proteins is particularly relevant. Not only hh genes are often highly conserved in their protein-coding sequence, but they have also highly conserved expression patterns among distantly related phylogenetic groups (812). This implies that homologs can be searched in different taxa based on the conservation of protein domains. A history of the evolution of this protein family can then be deduced from analysis of the number of homologs in each taxa, the rate of amino acid substitutions and the evolutionary distance between orthologs. In this chapter, we focus on the use of sequence analysis and comparative genomics for the identification of Hedgehog (Hh) family members in different taxa and the analysis of their evolutionary history.

Cis-regulatory elements (CRMs) of genes play a crucial role in the correct spatial and temporal expression of genes. Mutations in CRMs can cause gene misexpression and disease or expose individuals to higher risk of multifactorial diseases. For example mutations mapping in the vicinity of sonic hedgehog-regulatory elements have been suggested to cause preaxial polydactily (13,14). Therefore, identification of CRMs is an important step in understanding the genetic basis of human diseases. We describe here the current methods for the identification of cis-acting-regulatory elements of genes. Although no hh gene-specific protocols can be established for cis-regulatory sequence analysis, this chapter provides examples related to hh genes from the published literature. Rather than providing detailed protocols, we aim to give the reader general considerations and advice to apply best, these biocomputing tools to the study of Hh proteins and genes.

2 Materials

All software and algorithms cited in this chapter can be downloaded from the internet. Some of these are commercial packages, but most are free. We have listed their web sites in Table 1 . Moreover, a selection of useful websites with more software for phylogenetic analyses and tools for analysis of CRMs are also listed in Table 2 .

Table 1 Web Sites for Sequence, Phylogenetic and Cis-Regulatory Element Analyses
Table 2 Websites with Bioinformatic Tools Mentioned in this Article

3 Methods

3.1 Evolutionary Analysis of Hedgehog Proteins

The phylogenetic relationship and evolution of Hh proteins have been analyzed in considerable detail (15,16). Recently, further members of the Hh gene family have been reported in teleosts with the description of a second indian hedgehog and a desert hedgehog homologs (17).

3.1.1 Retrieving Protein Sequences for Phylogenetic Analyses

Protein sequences of conserved genes used to be predicted from a cDNA sequence, isolated either by degenerate polymerase chain reaction or by screening of cDNA libraries. Although these methods are still used in the case of nonmodel organisms, protein sequences are now mostly isolated in silico. There are numerous possibilities to find the sequences of interest by searching protein databases or genomic databases. Searches can be performed with keywords (i.e., Hh or Shh) and/or using Blast searches. NCBI and EBI have search tools to scan GenBank and Swiss-Prot.

Alternatively, animal model genomes can be searched using Ensembl, which in its newest version (v.37) contains several genomes, although not all complete and annotated (see Table 3 ).

Table 3 Genomes Available in Ensembl

3.1.2 Protein Sequence Alignment

Protein sequences must then be aligned. For our purpose, a global alignment method, which performs progressive pairwise alignments should be used. ClustalW (18) or Clustal X (19) software have been widely used. However, with the recent growth of sequence databases, it has been necessary to develop other algorithms that can align large protein families with speed and accuracy. Thus, new software for multiple sequence alignment have been designed and include: T-Coffee (20) which is slower than Clustal but tends to perform better in sequence alignments. MAFFT (21) is another program, which performs very well with sequences of different lengths (see Note 1 ) and appears to be faster than Clustal. Finally, MUSCLE (22) is advertised as faster than T-Coffee or Clustal.

Before proceeding with the inference of a phylogenetic tree, sequence alignments should be checked and edited to realign sequences and eliminate gaps. Jalview provided in Clustal, MUSCLE, and T-Coffee allows you to edit your sequence alignment, whereas the PHYLIP package contains its own sequence editing program. Once the alignment has been performed, the tree file should be saved in the appropriate format (see Note 2 ).

3.1.3 Building a Phylogenetic Tree

There are three methods which make up two main classes to infer a phylogenetic tree: Character-based methods, which include maximum parsimony (MP) and maximum likelihood (ML) (23), and distance-based methods, which include the neighbor-joining (NJ) method (24). The former relies on character states, such as the position of an amino acid at a specific place, whereas with the latter method evolutionary distances are calculated as the number of amino acid replacements between two proteins. None of these methods provide entire satisfaction (i.e., will infer a true tree) because they rely on several assumptions; for instance, a constant rate of divergence of a taxa from an ancestor. NJPlot algorithms will build a tree based on the NJ method, whereas PHYLIP and PAUP allow for the inference of an evolutionary tree using NJ, MP, or ML methods. Because distance-based methods are more amenable to molecular data (such as protein sequences) and several methods including bootstrap analyses have been designed to establish the reliability of an evolutionary tree. NJ methods tend to be more widely used and have been the preferred method for analyses of Hh proteins (15,25). If using NJ Plot open the tree file (.nj) previously saved. If using PHYLIP, a tree can be drawn using DRAWGRAM. Both will draw rooted trees, which allow for evolutionary analyses, in contrast to unrooted trees, which only display the degree of relationship with no mention of the most recent ancestor. TREEVIEW is another software package to draw trees. It supports tree files in pretty much any format and will display bootstrap values.

If the assumption of rate constancy among taxa does not account for the actual rate of divergence, the inferred tree may appear erroneous (i.e., misplace a species or a group of species). These errors can be remedied by choosing an outgroup as a reference (i.e., a species for which we have previous knowledge that it diverged from a common ancestor prior to the other species listed). A new tree is then built based on a new distance matrix established from the reference ( Fig. 1 A).

Fig. 1.
figure 1

Phylogenetic analysis of Hedgehog (Hh) proteins. (A) Inferred phylogenetic tree of Hh proteins: Hh proteins (full length) were aligned using ClustalW. An inferred phylogenetic tree was established with the NJ method after eliminating gaps from the alignment and using the Kimura correction for distances. The annelid P. capitella was used as an outgroup. Bootstrap values, indicated at the nodes, were calculated from 1000 pseudosamples within ClustalW. Branch lengths are proportional to the distance. (B) Phylogeny of the Metazoa. At the branching between protostome and deuterostome is indicated the position of the bilaterian ancestor. (C) Phylogeny of vertebrates. Estimated evolutionary distances between some species are indicated at the node (in my: million year).

3.1.4 Phylogenetic Tree Analyses

Tree reliability: One of the advantages of using the NJ method is that it allows for bootstrap analysis, a computational method to apply statistics on a tree topology (26). This technique calculates the level of confidence for each clade of an inferred tree. This is done through a resampling technique where a series of pseudosamples are generated (usually between 500 and 1000, see Note 3 ) and the deduced trees are compared with the inferred one. A bootstrap value, expressed as the percentage of trees having the same topology as the inferred tree, is then calculated. It is usually admitted that a bootstrap value of >95 corresponds to a high level of confidence in the clade, whereas values <70 show a low level of confidence. Bootstrap can be run from PHYLIP using Seqboot or Clustal.

Estimating divergence time: An estimation of the evolutionary divergence time can be calculated from a distance-based tree ( Fig. 1 A). This calculation is based on the hypothesis that the rate of amino acid substitutions is constant during evolution. First, the rate of divergence per site per million years, r, is calculated for two species for which the divergence time, T 1, is known from other data (paleontological records, molecular data). Usually, vertebrates are a better choice because there are many records available providing the best approximate divergence time (see Note 4 ).

r = d/2T 1, where d is the average distance between the two species chosen and the distance is directly proportional to the rate of amino acid substitution. Once r is determined, it can be applied to the equation T 2 = d avg/2r, where T 2 is the unknown divergence time between two species/events we are interested in and d avg is the average distance between these two species/event.

Using similar calculations, it was found that the divergence time between Shh and Ihh, and Shh and Dhh was 563 and 662 my, respectively (15), which suggests that the first duplication of the Hh gene to give rise to the Dhh family occurred prior to the emergence of chordates (550 my) (27,28). This is not consistent with the fact that prior to the emergence of vertebrates, a single Hh gene is found in all three phyla, Deuterostomia, Ecdysozoa, and Lophotrochozoa ( Fig. 1 A,B). In particular, the presence of a single Hh gene in the cephalochordate amphioxus Branchiostoma floridae (12) suggests that the duplication event that gave rise to Hh1 and Hh2 in the urochordate Ciona intestinalis occurred independently from the duplication events leading to the Dhh, Ihh, and Shh families ( Fig. 1 A–C) (29). An interesting exception to the existence of a single Hh is that of the nematode Caenorhabditis elegans for which no true Hh ortholog was found. In contrast, closer sequence comparisons with subdomains of the Hh protein unraveled that several C. elegans proteins were homologs to the C-terminal region of Hh and formed a family of proteins, the inteins, with endonuclease activity (30). Because earlier taxa such as the mollusc Proteus vulgaris and the Annelid P. capitella do have a single Hh gene, this would suggest that nematodes have had Hh proteins but lost them during evolution. Alternatively, there is the possibility that nematodes do not belong to Ecdysozoa and form an earlier taxon (31). There are data consistent with a grouping of Arthropods and vertebrates together (protostome and deuterostome), called Coelomata that leave out the nematodes, which form an earlier phylum, the Pseudocoelomata (32). If this were the case, Hh would have evolved after the emergence of nematodes and before the Coelomata group.

3.2 Detection of Cis-Regulatory Elements of Hedgehog Genes by Sequence Analysis

CRMs do not have stringent directional, positional, and compositional constraints such as coding exons, which makes their automated detection with bioinformatics tools more difficult. One technique often used is phylogenetic footprinting (33), which, is based on the principle that alignment of noncoding sequences from different species reveals evolutionarily conserved segments that are candidates for cis-regulatory function (1,3,5,7,34). Bioinformatic tools which utilize phylogenetic footprinting to detect such regions have been reviewed recently (3538). Phylogenetic footprinting has been used extensively to identify putative CRMs of sonic hedgehog orthologs (36,3942).

3.2.1 Choice of Sequence Alignment and Visualization Tools

Two main strategies can be followed in sequence alignment: The local alignment protocol (e.g., BLASTZ [43]) searches for short stretches of similarity between the sequences, which are then extended, whereas global alignment tools (e.g., LAGAN [44]) search for best alignment over the entire length of the sequence using local similarities as anchors (see Note 5 ). A recent addition to LAGAN also allows for the detection of inversions between the two compared sequences (shuffle-LAGAN [44]). Global alignment tools have a higher sensitivity, whereas local tools provide better specificity in detection of shorter conserved blocks (45). Results of sequence alignments are usually displayed through web-based graphical tools, such as PipMaker (46), ECR browser (47), and VISTA (48,49), which indicate conservations above certain threshold levels. Because of their distinct designs, the performance of global and local alignment algorithms differs in the detection of conservation. Notably, the DiAlign tool (50,51) allows for both local and global alignment output modes.

3.2.2 Choice of Genomes for Cross Species Comparison

Comparisons of multiple species (“phylogenetic shadowing”) (38), using a set of closely related species (e.g., Refs. [50,52]), may be applied for the identification of conserved elements. However, the efficiency of finding conserved CRMs by phylogenetic footprinting (both in terms of number and level of conservation) is dependent on the evolutionary distance between the species compared (38,53). Comparisons between mouse and human (approx. 90 million years, Fig. 1 C) provide close evolutionary distance with high degree of conservation among functionally relevant binding sites placed in conserved blocks (5458). However, the slow rate of neutral divergence among vertebrates, may result in the retention of conserved sequences with no regulatory role between species with short evolutionary distance (59). Several vertebrate genomes representing most major classes have recently been sequenced (see Table 3 ), providing the raw material for comparative analyses of species with greater evolutionary distances than mammals. A note of caution must be applied though, the greater the evolutionary distance, the more likely regulatory elements will have diverged. Thus, a lower number of regulatory elements will have retained conserved transcriptional activities, reducing the likelihood of identifying conserved CRMs (60). However, it is generally observed that developmentally regulated genes (including hh genes) and transcription factors tend to be more conserved in their CRMs than other genes (40,61). This was particularly striking in CRMs of fish and mammals sonic hedgehog orthologs that are separated by 450 my and still show remarkable conservation (36).

3.2.3 Variable Divergence of CRMs Within a Locus

CRMs within one gene locus may have different rates of change, as is the case for the shh locus itself. For example, four enhancers named ar-A to ar-D, are involved in shh activation in the zebrafish midline tissues. These four CRMs show varying degree of conservation between pufferfish and mouse (36,6264). Interestingly, ar-A and ar-C are conserved between fish and mouse, whereas ar-B also shows significant sequence similarity when compared with zebrafish and pufferfish (Tetraodon nigroviridis), indicating that the phylogenetic footprinting approach can result in the detection of additional functional regulatory elements when the evolutionary distance between the species used in the analysis matches the rate of change in regulatory sequences. The enhancer ar-C is significantly conserved in mouse but less than ar-A, and is active in the midline in zebrafish and mouse. Strikingly, no function has been assigned to the well-conserved ar-A in mouse. This may indicate a conservation due to functional constraints other than CRM (reviewed in Ref. [65]). Significant sequence similarity in the 3′ UTR region of shh genes has also been observed between fish and mouse. However, no published data is available for a putative function of these conserved sequences.

3.2.4 Identification of Long Distance Regulatory Elements

It is not always trivial to assign a predicted conserved regulatory element to its cognate gene. The distance limit of regulatory elements from their regulated gene is not at all deciphered, and looping of chromatin over 40 Mb to sites of transcriptional activity has been demonstrated (66). Bacterial or phage artificial chromosome vectors provide a technology for analysis of regulatory elements over large distances (42). This approach allowed for the detection of shh-regulatory elements that lay several hundred kilobases away from the coding sequence in the mouse. Several of the elements identified in the mouse (SBE 2, 3, and 4) are well conserved among human, chicken, and frog, but not teleost fish sequences (42). Interestingly, the function of these long distance elements is to drive shh expression in the ventral diencephalon, an activity covered by the intronic ar-C enhancer in the fish. This functional divergence of enhancers may explain the lack of conservation of SBE2-4 and suggests that subfunctionalization mechanisms may be involved in the evolution of shh CRMs (67).

A large number of genes are likely to contain CRMs at very long distance from the gene locus (68). An extreme example is the case of the sonic hedgehog limb enhancer, which lies 1 Mb away from the shh coding sequence in the intron of the lbmr1 gene (69). This enhancer is highly conserved among vertebrates both in terms of its sequence and its interdigital position in the lmbr1 gene (70). This example suggests that further regulatory elements placed at a large distance may function in the regulation of shh. Indeed, several conserved noncoding elements were found at long distances from shh (up to 50 kb in fugu) and when tested in zebrafish embryos, provided enhancer activity (41). It may be possible to identify these elements by limiting the search to chromosomal regions that remain unchanged during evolution. The interdigitation of coding genes with embedded regulatory elements of other neighboring genes also implies an evolutionary constraint on chromosomal rearrangements to avoid breakpoints in such regions. Conserved chromosomal synteny has been suggested to aid in predicting the limits of the regulatory regions of a gene (71,72). Thus, comparisons between multiple species should establish the furthest, long distance CRMs are located from the promoter by analyzing the breakpoints of syntenic fragments. To assist researchers in these analyses, the Ensembl genome server database provides mammalian and chick chromosomal synteny, whereas an independent web server provides fugu and human synteny analysis (73).

3.2.5 Identification of the Transcriptional Start Site and the Core Promoter

Core or basal promoters are positionally defined regulatory regions, which are located about 50–100 base pairs (bp) up- and/or downstream of the transcriptional start site (TSS), and are required for the formation of preinitiation complexes for subsequent transcription initiation (74) (see Note 6 ). The absence of experimental approaches to characterize TSSs and the diversity of promoter types made it relatively difficult to predict accurately core promoter regions using sequence analysis, despite the large number of programs available on the internet (see Tables 1 and 2 for a selection of tools). Prediction of core promoters has recently improved substantially, due to the accumulation of large-scale data on TSS (75,76). Promoter predictors based on searching for motifs such as the TATA box (reviewed in Ref. [74]) failed, as it is now known that only a subset of human genes whose transcription is initiated by the RNA polymerase II contain a TATA box (77). The characterization of motifs involved in transcription initiation of the remaining genes is still in progress (77,78). A TATA box is however present in vertebrate shh genes (79,80). Interestingly, transcription factors and brain-specific genes were found to have shorter conserved blocks than other genes (81). The core promoter of vertebrate shh genes have been characterized in fish and human (79,80) and were shown to contain two TSSs and to be regulated by retinoic acid and Foxa2 (HNF3β).

3.2.6 Transcription Factor-Binding Site Analysis

Information on transcription factor-binding sites are available in either commercial (like TRANSFAC (82), Genomatix) or open access (JASPAR [83]) databases. Binding-site clustering is a feature of CRMs (84), which is utilized by several algorithms (8591). The predictive value of such clustering approaches is enhanced by incorporating sequence conservation criteria (see Ref. (92) for example). Ahab also detects clusters of weak sites (93,94), and this can be further improved with Stubb, which includes comparative information and allows for the prediction of regulatory modules (95,96). To search entire genomes for coexpressed genes, a software package (CisOrtho [97]) was developed which evaluates the co-occurrence of motifs in orthologs regions. CRMs of coregulated genes show “signatures”, i.e., transcription factor-binding site combinations with distinct spacing and orientation requirements (90,98), which seem to be retained between species even when the overall sequence similarity is low (90). On the basis of this finding, TraFaC identifies conserved TF-binding sites by scanning regions of conserved sequence similarity to detect co-occurrence of binding sites (99), whereas rVista (100,101) and ConSite (57) score aligned binding sites in conserved regions. CONREAL (102) applies a similar approach and uses binding- site predictions as anchors for sequence alignment, and performs better than other sequence alignment programs when aligning sequences from distant species. As more algorithms for motif detection that take into account phylogenetic conservation (e.g., PhyloCon [103], CompareProspector [104], Footprinter [105]) become available, functional-binding sites in hedgehog genes and other developmentally regulated genes will be identified.

4 Notes

  1. 1.

    It has been reported that variations in sequence length affect the accuracy of sequence alignments. ClustalW seems to be more sensitive to this issue than MAFFT. Thus, it is recommended to include sequences covering regions of similar length, although a sufficiently large portion of the protein sequence should be included to make the analysis meaningful. Comparing fragments of Hh protein to other full-length Hh proteins, for instance, can only lead to unmeaningful data.

  2. 2.

    Take care of saving the tree file corresponding to the sequence alignment in the correct format (.nj if you are to use NJPlot to draw the tree or .ph if you are to use PHYLIP).

  3. 3.

    It is common in the literature to see bootstrap samples of 100 or 200. It is recommended to use 500–1000, especially if many species are involved.

  4. 4.

    Listed here are some evolutionary divergence times commonly used (see Fig. 1 C). Rat/mouse, 41 my; mammals/fishes, 450 my; mammals/amphibians, 360 my; mammals/birds, 310 my.

  5. 5.

    A consideration when choosing a particular program is that many algorithms have been optimized for specific-species comparisons (e.g., BlastZ for human-mouse, WABA (106) for C. elegans-C. briggsae) and may not perform well with other species.

  6. 6.

    A recent larger-scale analysis of mouse and human promoters identified conserved blocks within 500 bp from the start site, thereby defining the likely 5′ limit of proximal promoter regions (58).