Introduction

Evolution of eukaryotic genomes is characterized by complex genome rearrangements leading to gene duplication, fusion, fission, recombination, and loss of fragments. These effects play a significant role in the evolution of many gene families and can lead to extensive domain rearrangements and evolution of novel domain architectures (Apic et al. 2001; Bjorklund et al. 2005; Patthy 2003; Weiner et al. 2006). Another phenomenon illustrating the dynamics of genome evolution are lineage-specific protein family expansions (Lespinet et al. 2002), seen first in Caenorhabditis elegans (Copley et al. 1999), but coming fully into focus only after sequencing of the sea urchin genome (Sodergren et al. 2006). In sea urchin, several innate immune- and apoptosis-related families underwent an unprecedented expansion as compared to any previously sequenced organism (Hibino et al. 2006; Robertson et al. 2006). There are many questions surrounding these expansions. How did the functions of recently diverged paralogs evolve? Is the number of paralogs after species-specific expansion similar in different species? Is it possible that proteins in such independently evolved groups would converge on similar functions? In this paper, we attempt to answer some of these questions for the several groups of innate immune-related receptors and regulators, which display all the phenomena mentioned in this paragraph.

The regulation of innate immune responses relies on several families of pattern-recognition receptors (PRRs) that recognize pathogen- or damage-associated molecular patterns (PAMPs, DAMPs), which originate from invading pathogens or are released by dying or injured cells. In the absence of adaptive immunity, the number and diversity of PRRs may provide an advantage to an organism living in pathogen-rich environments. Two families of PRRs conserved from early invertebrates to mammals, the intracellular NOD-like receptors (NLRs) (Fritz et al. 2006; Martinon and Tschopp 2005; Ting et al. 2006) and the transmembrane toll-like receptors (TLRs; Beutler et al. 2006; Medzhitov 2001; West et al. 2006), are of particular interest because of their roles in a number of diseases.

The NLR family is a group of cytoplasmic PRRs that are characterized by the presence of a conserved nucleotide binding NACHT domain. The general domain organization of NLRs includes an N-terminal effector domain, such as a caspase recruitment domain (CARD), a PYRIN domain (also known as PAAD, PYD, or DAPIN), or several baculovirus inhibitor of apoptosis repeat (BIR) domains, all of which mediate protein–protein interactions for initiating downstream signaling; a centrally located NACHT domain, which is required for nucleotide binding and self-oligomerization; and an array of C-terminal leucine-rich repeat (LRR) domains, which mediate ligand sensing and autorepression (Kanneganti et al. 2007; Martinon and Tschopp 2005). Human NLRs can be classified into several subgroups according to their N-terminal effector domain: CARD-containing NODs, IPAF, and CIITA; PYRIN-containing NALPs; and BIR-containing NAIP (Kanneganti et al. 2007; Martinon and Tschopp 2005). The second family of innate immune receptors discussed here is the TLR family, which belongs to type I transmembrane receptors and is characterized by its C-terminal signaling domain–Toll/Interleukin-1 receptor (TIR) domain (West et al. 2006)–and N-terminal LRR domains. The TIR domain is also present in the interleukin-1 receptor (IL-1R) family and in the TIR-domain-containing adaptors (Boraschi and Tagliabue 2006; McGettrick and O'Neill 2004). The interleukin-1 receptors use the immunoglobulin (Ig) domain for ligand binding instead of the LRR domain as in TLRs (Boraschi and Tagliabue 2006).

As genomes of more and more organisms become available, the phylogenetic analysis of NLR and TLR families can be done across a large number of species, which is useful for deciphering the evolutionary relationships inside these families and helps us understand the evolutionary dynamics of the innate immune system. We show here that the above receptor families have especially interesting evolutionary histories, undergoing large expansions and extensive domain recombination in various lineages. In particular, several domain architectures, such as Death–NACHT–LRR, CARD–NACHT–LRR, PYRIN–NACHT–LRR, and Ig–TIR, have emerged multiple times in different lineages, suggesting that parallel evolution is a common phenomenon in the evolution of innate immunity.

Materials and methods

Sequence database searches

The v1.0 genome assemblies and related protein sets of amphioxus (Branchiostoma floridae) and sea anemone (Nematostella vectensis) were downloaded from the Joint Genome Institute (http://www.jgi.doe.gov). The genome assembly Spur_v2.0 and the GLEAN3 gene models for the sea urchin (Strongylocentrotus purpuratus) were obtained from the Baylor College of Medicine Human Genome Sequencing Center (http://www.hgsc.bcm.tmc.edu). The other genome sequences and corresponding protein sets were downloaded from Ensembl (http://www.ensembl.org). Several rounds of PSI-TBLASTN searches (Altschul et al. 1997) were performed against each genome by using known human NACHT or TIR domain amino acid sequences as seeds. The hits were then mapped to the corresponding genome protein set to acquire the full-length protein sequences (for sea urchin and sea anemone, some of the gene models were in addition predicted by GENSCAN (Burge and Karlin 1998)). All identified genes were checked by reciprocal BLAST analysis, Pfam protein searches (Bateman et al. 2004), Conserved Domain Search (CD-Search), and Reverse PSI-BLAST (Marchler-Bauer and Bryant 2004). Domains verified by Pfam and CD-Search are evolutionarily conserved units in proteins (Bateman et al. 2004; Marchler-Bauer et al. 2002). Additionally, two lamprey TIR domain-containing proteins (laTLR14a and laTLR14b) identified in (Ishii et al. 2007) are also included.

Multiple sequence alignments and phylogeny reconstructions

In phylogenetic analysis of multidomain families, for both practical and conceptual reasons, it is critical to analyze each domain separately. In multidomain proteins, variable linker lengths, different mutation rates in different domains, and occasional domain losses, duplications, or substitutions make it oftentimes impossible to build high-quality alignments across more than one domain. At the same time, making alignments and performing phylogenetic analysis on only the subset of protein families with the same domain architectures would likely produce a misleading picture by neglecting the possible gene recombination and domain rearrangement events. In this paper, the phylogenetic analyses of NLR and TLR families are based on the NACHT domain and the TIR domain, accordingly. To ensure alignment of homologous domains, collected protein sequences with NACHT or TIR domains were trimmed according to Pfam 21.0 models (Bateman et al. 2004). Multiple sequence alignments were produced by PROBCONS 1.12 (Do et al. 2005), MAFFT 6.240 (localpair, maxiterate 1000) (Katoh et al. 2005), and hmmalign from HMMER 2.3.2 (Eddy 1998; Nuin et al. 2006). Multiple sequence alignment columns with a gap in more than 50% of sequences were deleted.

Phylogenetic analysis was performed using three different approaches. For the Bayesian inference approach, MrBayes 3.1.2 was used with 4,000,000 generations, 64 chains, a sample frequency of 1,000, a mixture of amino-acid models with fixed-rate matrices and equal rates, and 25% burn in (Ronquist and Huelsenbeck 2003). For the maximum likelihood approach, RAxML 7.0.4 was used with rapid bootstrap analysis (100 steps) and search for the best-scoring ML tree (“-f a” option), the variable time (VT) model and four relative rate substitution categories with empirical base frequencies (Stamatakis 2006). For distance-based approaches, such as FastME 1.1 (Desper and Gascuel 2002), neighbor-joining from PHYLIP 3.66 (Felsenstein 1989; Saitou and Nei 1987), and BIONJ (Gascuel 1997), pair-wise distances were calculated by TREE-PUZZLE 5.2 using the VT model (Schmidt et al. 2002). Phylogenetic trees were drawn using Archaeopteryx 0.901 (http://www.phylosoft.org/archaeopteryx/). All conclusions presented in this work are robust under different multiple sequence alignment and phylogeny reconstruction methods. All sequence, alignment, and phylogeny files are available upon request.

Domain composition analysis

Domains were analyzed with hmmpfam from HMMER 2.3.2 and Pfam 21.0 (Bateman et al. 2004; Eddy 1998).

Structural modeling

The crystal structure of Apaf-1 CARD from Apaf-1/procaspase-9 complex (PDB code 3YGS; Qin et al. 1999) was used as a template for modeling other CARD domains. The SCWRL program (Canutescu et al. 2003) was used for homology modeling, and APBS (Baker et al. 2001) was used for calculating surface potentials. All structure figures were prepared with PyMOL (http://www.pymol.org).

Results

The NACHT protein family

We collected the NACHT domain-containing genes from three recently sequenced marine invertebrate genomes whose sequences became publicly available in the last 2 years, including a cephalochordate (the amphioxus B. floridae; Putnam et al. 2008), an echinoderm (the sea urchin S. purpuratus; (Sodergren et al. 2006), and a cnidarian (the sea anemone N. vectensis; Putnam et al. 2007). We also collected the NLR genes from other animals, including several vertebrates (the human Homo sapiens, the mouse Mus musculus, the dog Canis familiaris, the chicken Gallus gallus, the western clawed frog Xenopus tropicalis, the zebrafish Danio rerio, the Japanese pufferfish Fugu rubripes, and the green pufferfish Tetraodon nigroviridis), and a urochordate (the transparent sea squirt Ciona intestinalis). No NLR-like genes were found in the arthropod fruit fly Drosophila melanogaster and the nematode C. elegans—two very popular model organisms.

Previously, it was thought that the NLR genes are vertebrate-/deuterostome-specific, since they were absent in both Drosophila and C. elegans (Fritz et al. 2006). However, multiple copies of NACHT domain-containing proteins were found in sea anemone, an animal that belongs to the basal phylum Cnidaria, suggesting that this family emerged even before the protostome–deuterostome split (Darling et al. 2005; Putnam et al. 2007) and was lost in the arthropod and nematode lineages. The repertoire of NLR proteins in mammals is fairly stable at around 20, while there is significant (five to ten times) expansion of this receptor gene family in both invertebrate deuterostomes—amphioxus and sea urchin—and to a smaller extent in some fish genomes, such as pufferfish.

A detailed phylogenetic analysis of the NACHT domains of NLR proteins shows that all the invertebrate NLRs belong to the lineage-specific groups (Fig. 1; Supplementary Fig. 1), suggesting that all extant NACHT-containing genes in invertebrates are a result of multiple rounds of duplication of a single ancestral gene. Interestingly, despite that, several recurring domain architectures of NLR proteins can be found throughout the tree. The possible explanation of this is a scenario that includes multiple, independent evolution of similar domain architectures in a process that fits the definition of parallel evolution (West-Eberhard 2003). We do not know what the ancestral domain architecture of NLR protein at the internal node (A) was, as it could have been any of the following: Death–NACHT–LRR, CARD–NACHT–LRR, or NACHT–LRR (or, less likely, NACHT associated with other domains or by itself; Fig. 1). But no matter what the ancestral domain organization looked like, other domain architectures must have evolved independently. For example, if we assume that the Death–NACHT–LRR architecture is ancestral, then the CARD–NACHT–LRR architecture evolved independently in the IPAF branch and in the second amphioxus-specific branch. If the ancestor had the CARD–NACHT–LRR domain architecture, then the Death–NACHT–LRR association appeared later separately in the sea urchin branch and the amphioxus-specific branch. The same situation encountered with the NACHT–LRR or other NACHT domain associations, in such case, both the Death–NACHT–LRR and the CARD–NACHT–LRR architectures should be evolved multiple times in the descendants. At another internal node (B), we have a similar situation in which no matter what the ancestral domain architecture was PYRIN–NACHT–LRR, CARD–NACHT–LRR, NACHT–LRR or other (Fig. 1), one or more architectures had to evolve independently more than once from the same ancestor. This scenario fits the definition of parallel rather than convergent evolution. In parallel evolution, similar traits (here domain architectures) independently evolve from similar ancestral states (here the ancestral NACHT-containing gene); whereas in convergent evolution, similar traits evolve from unrelated (or distantly related) ancestral states. We can only speculate about the driving force behind this independent emergence of identical domain architectures as being related to the pressure elicited by pathogens and the advantage provided by quick response by the innate immunity system.

Fig. 1
figure 1

Overview of the phylogeny and domain organization of the NACHT protein family. On the phylogenetic tree, branches are assigned with different background colors according to species. The numbers associated with each clade indicate bootstrap values for minimum evolution and maximum likelihood and the posterior probability from Bayesian inference, respectively (values lower than 50% are not shown). The size of each branch is proportional to the number of proteins inside that branch with the domain architectures shown on the right side. Duplication event is indicated by a red circle and speciation by a green square. Domain architectures at the internal nodes A and B are speculated. The detailed sequence information for each terminal node is shown in Supplementary Fig. 1. Some unusual NACHT gene models in sea urchin and amphioxus require further experimental verification. *The NALPs branch appears to be mammalian-specific, with the exception of one protein from chicken with the canonical PYRIN–NACHT–LRR architecture

The TIR protein family

The TIR domain-containing proteins constitute another group of proteins involved in the innate immune response. The TIR domain is present in several groups of proteins with different domain architectures: the transmembrane TLRs and IL-1Rs, as well as in the intracellular TIR domain-containing adaptors (such as myeloid differentiation factor 88 (MyD88), TIR domain-containing adaptor protein (TIRAP), TIR domain-containing adaptor inducing interferon-β (TRIF), TRIF-related adaptor molecule (TRAM), and sterile α and HEAT-Armadillo motifs containing protein (SARM) in human; McGettrick and O'Neill 2004; O'Neill and Bowie 2007).

Similar to the situation with the NACHT domain, the TIR protein family also underwent a large expansion in both amphioxus and sea urchin. There are around 24 TIR domain-containing proteins in mammals, 11 in Drosophila, and 2 in C. elegans; whereas there are more than 100 copies of such proteins in both amphioxus and sea urchin genomes.

The exact phylogenetic tree for the entire TIR domain family remains elusive because of extreme sequence diversity in this family, but the separation of the main sub-branches is quite robust under different multiple sequence alignment and tree-building methods. The tree can be divided into two parts—the TLR family together with the TIR domain-containing adaptors and the IL-1R family (Fig. 2). Many TIR domain-containing proteins from invertebrates belong to species-specific branches. Similar to the previously discussed case of NLR proteins, while we do not know what the ancestral TIR domain-containing protein looked like, multiple lines of circumstantial evidence suggest the role that parallel evolution played in the evolution of the TIR domain family. In this case, we have an indirect argument that an IL-1R-like domain architecture with an Ig–TIR domain combination probably was not ancestral, as no IL-1- or IL-18-like gene (the main ligands for the vertebrate IL-1R family; Boraschi and Tagliabue 2006) was found in amphioxus and sea anemone. Although we cannot rule out the possibility that cytokines in these two animals are too divergent to be recognized, no matter what the ancestral domain architecture was at the internal node (A; Fig. 2), either the Ig–TIR or LRR–TIR domain combination has evolved independently more than once.

Fig. 2
figure 2

Phylogeny of the TIR protein family. Protein names are assigned to different colors according to species origin. Lineage-specific subtrees are collapsed for clarity with the number of sequences shown in brackets. Support values for each clade are indicated by differently colored boxes to increase their visibility. Bootstrap values higher than 50% from minimum evolution and maximum likelihood approaches are colored in green and blue, respectively. Posterior probability values from a Bayesian approach higher than 0.5 are colored in red. The simplified domain architectures of each branch are presented on the right side of the phylogenetic tree. Domain architectures at the internal node A are speculated. Ig–TIR domain-containing sequences from amphioxus and sea anemone are separate from the vertebrate IL-1R branch, which has the same domain architecture. The detailed sequence information can be found in Supplementary Table 2

Parallel evolution can result in proteins with identical domain architectures, such as amphioxus and sea anemone IL-1R-like proteins, which look like the vertebrate IL-1R family, but most likely have evolved independently (Fig. 2). Another, better-known example can be found in the TLR family, where human and Drosophila TLR proteins, despite the similar size of the family and numbering scheme that may suggest one-to-one orthology between individual proteins in fruit fly and human, have also evolved independently (except Toll-9, which is the only Drosophila toll family member that groups with the vertebrate TLRs).

Structural features of the associated protein–protein interaction domain

Pattern-recognition receptors use the associated protein–protein interaction domains, such as CARD, Death, and PYRIN, to connect to the downstream part of the signaling cascade. Proteins that evolved by parallel evolution arose independently from each other; even though they have identical domain architectures, their individual domains come from non-orthologous branches and may have different functions.

Phylogenetic analysis cannot tell us more about functional differences or similarities between such proteins. However, for proteins for which the functions are well understood and the three-dimensional structures are available, we can use other tools, such as protein structure modeling and model analysis, to reason about their functional similarities or differences. In the example in Fig. 3, the two amphioxus proteins with similar domain architectures, both containing a CARD–fn3 domain combination, most likely evolved independently. On the other hand, chicken and mouse CARD domains are part of proteins with different domain architectures that both represent a specific vertebrate expansion of the CARD family. For CARD domains, their function (protein–protein binding) is determined by the charge distribution of their surfaces (Qin et al. 1999) and in particular by the details of the dimer interface (see the top panel in Fig. 3). We built three-dimensional models of the four CARD domains mentioned above and calculated the charge distributions of their surfaces using the APBS package (Baker et al. 2001). Surface similarity analysis can be used to compare two protein structures (Binkowski and Joachimiak 2008; Dlugosz and Trylska 2008; Pawlowski and Godzik 2001; Sael et al. 2008; Sasin et al. 2007), similar to using sequence similarity to compare sequences with the difference that the resulting surface similarity score is related directly to function, rather than to evolutionary relation between two proteins. Here, the surface feature differences between the last structure (02_BRAFL_CARD in Fig. 3) and the other three are noticeable even by visual inspection. As seen in Fig. 3, even though the two CARD domains from amphioxus come from proteins with similar domain architectures and are both part of amphioxus-specific expansion of CARD domains, one of them has surface features suggesting functional similarity to CARD domains from mouse or chicken, despite their low sequence similarity (sequence identity around 20%), while the other domain has a very different charge distribution and is likely to interact with a different downstream partner.

Fig. 3
figure 3

Structure comparison between different CARD domains. A phylogeny based on the CARD domains is shown on the left, and the related structure models are displayed on the right. The electrostatic potential was mapped on the protein surface of each 3D model. The acidic surface (negatively charged) is colored in red, and the basic surface (positively charged) is in blue. All protein surfaces are displayed in the same view with the putative binding surface (the surface used by the Apaf-1 CARD that binds the Caspase-9 prodomain (Qin et al. 1999) shown on the top panel) facing the viewer. The potential binding surface of the last structure (02_BRAFL_CARD) is dominated by a positive charge while the other three structures are mainly negatively charged

Discussion

The conservation of NLR and TLR receptors from (at least) cnidarians to mammals highlights the importance and the ancient evolutionary history of these important innate immunity families. The presence of multiple proteins with similar domain architectures creates the impression that all these proteins and, by extension, possibly even the specific pathways in which they participate, could have been present in ancestral species. However, we show here that the appearance of conservation hides a very complex evolutionary history of these receptor families, which underwent massive species-specific expansions and independently evolved identical domain architectures. This phenomenon is most obvious in the NACHT protein family, where all invertebrate NLR proteins evolved by species-specific expansions (Fig. 1). However, this phenomenon also plays an important role in the evolution of the TIR protein family (Fig. 2) and possibly other families. The expansions of these protein families in fish and amphioxus were noticed earlier by several independent studies (Laing et al. 2008; Oshiumi et al. 2008; Stein et al. 2007; Zhang et al. 2008). Also, studies for TLRs and other innate immune-related protein families between arthropods and vertebrates reach similar conclusions that members of innate immune systems could have evolved independently (Hughes 1998; Hughes and Piontkivska 2008), which reinforces our parallel evolution hypothesis.

The most interesting observation is that such massive expansions and domain shuffling only resulted in a relatively small number of protein architectures. Clearly, the number of possible solutions must be limited by functional considerations that act as constraints, limiting the potentially huge number of possible domain architectures to the same, independently rediscovered ones. The presence of such constraints limiting the number of functional domain combinations provides a possible alternative explanation for the conservation of domain architectures in eukaryotes, where the majority of the genomic proteins are multidomain proteins (Han et al. 2007), but only a small fraction of all possible domain combinations are present.

Some studies suggested that domain architectures are largely inherited (Gough 2005). However, a more recent study indicates that domain architecture reinvention is a more common phenomenon than previous thought (Forslund et al. 2008). These authors suggested that between 5.6% and 12% of all domain architectures could have been created more than once in different genomes. In this paper we show specific examples of parallel evolution in families of innate immune receptors: the NACHT and TIR protein families. Both NACHT and TIR domain are protein–protein interaction domains that contribute to signal transduction, and this functional class of proteins was called “promiscuous” because of their tendency to associate with different domains (Basu et al. 2008). When compared with the list of the top 215 highly promiscuous domains in eukaryotes (Basu et al. 2008), it turned out not only the NACHT and TIR domain themselves, but also the domains they associate with, such as Death, CARD and Ig domains, are on that list. However, only a small fraction of the possible domain combinations actually exist in nature, suggesting that domain architectures are under strong evolutionary selection (Han et al. 2007). For the NLR family proteins, where only a limited number of protein–protein interaction domains such as Death, CARD, DED, or PYRIN domains can appear at the amino terminus, provides us with a clue how such selection may be executed. These four domains belong to the death domain superfamily, which has very similar structures and modes of action (Reed et al. 2004). Reshuffling between these domains would not incur much structural conflict with the function of controlled oligomerization facilitated by the NACHT domain. In this context, it may be worth mentioning that the PYRIN domain, which is not found in any currently sequenced invertebrate genomes, has probably evolved from other death domain superfamily members and represents another example of domain reshuffling. It is found in several very different types of protein architectures.

While the emergence of similar domain architectures can be clearly shown by comparing predicted genes identified in genome projects, we still do not know if proteins with the same domain architecture share similar functions in different species. Some examples, including those from the families discussed here, suggest that this is not always true. For instance, while Drosophila toll-like receptors mainly carry out roles in embryonic development, their mammalian homologs are key regulators of immune responses (Kambris et al. 2002; Leulier and Lemaitre 2008). For other proteins, we have some indirect arguments about their functional divergence. For example, both amphioxus and sea anemone have Ig–TIR domain-containing sequences, the same architecture as IL-1R family members in vertebrates. These sequences are likely reinvented in various animal lineages by parallel evolution. The function of the IL-1R-like proteins in amphioxus and sea anemone is not clear and could be different from its corresponding sequences in vertebrates, as no IL-1- or IL-18-like genes were found in these two genomes. Further experimental work is needed to unravel the precise roles of these proteins. We also show here that proteins that evolved independently by parallel evolution can have very divergent surface features (Fig. 3). Therefore, extrapolation of protein function based on the domain architecture must be done very carefully.