TALEs in Archaeplastida are divided into two groups, KNOX and non-KNOX
To collect all the available homeobox protein sequences, we performed BLAST and Pfam-motif searches against non-plant genomes and transcriptome assemblies throughout the Archaeplastida, identifying 338 proteins from 56 species as the Archaeplastida homeobox collection (30 genomes and 18 transcriptomes; Additional file 1: Table S1). Of these, 104 possessed the defining feature of TALE proteins, a three-amino-acid insertion between aa positions 23–24 in the homeodomain . At least two TALE genes were detected in most genomes except five genomes in the Trebouxiophyceae class of the Chlorophyta (Additional file 1: Table S1; see Additional file 2: Note S1 for further discussion of the absence of TALEs in Trebouxiophyceae).
The collected TALE sequences were then classified by their homeodomain features using a phylogenetic approach, with TALEs from animals, plants, and early-diverging eukaryotes (Amoebozoa and Excavata) as outgroups (Additional file 3: Figure S1). The resultant TALE homeodomain phylogeny distinguished two groups in all three phyla of Archaeplastida (Fig. 2). (1) The KNOX-group as a well-supported clade displayed a phylum-specific cladogram: two Glaucophyta sequences at the base (as KNOX-Glauco) were separate from the next clade, which combines Rhodophyta sequences (as KNOX-Red1) and a Viridiplantae-specific clade with strong support (92/90/1.00). (2) The non-KNOX group, including the BELL and GSP1 homologs, contained clades of mixed taxonomic affiliations. These analyses showed that the TALE proteins had already diverged into two groups before the evolution of the Archaeplastida and that the KNOX-group is highly conserved throughout Archaeplastida.
KNOX group sequences share the same heterodimerization domains throughout Archaeplastida
The next question was whether the plant KNOX class originated prior to the Viridiplantae phylum. The plant KNOX proteins and the Chlorophyta GSM1 possess the KNOX-homology, consisting of KN-A, KN-B, and ELK domains, required for their heterodimerization with other TALE proteins ; therefore, the presence of the KNOX homology would suggest the potential for heterodimerization to the KNOX group. To collect homology domains without prior information, we performed ad-hoc homology domain searches among the KNOX group sequences. Using the identified homology domains as anchors, we carefully curated an alignment of the KNOX-group sequences combined with any other TALE sequences with a KNOX-homology, (Additional file 3: Figure S2). From this KNOX alignment, we found all KNOX group sequences (excluding partial sequences) showing amino acid similarity scores > 50% for at least two of the three domains comprising the KNOX-homology region (Additional file 1: Table S3 for calculated domain homology). To test whether the observed similarity is specific to the TALE sequences, we generated HMM motifs for KN-A and KN-B domains from the KNOX alignment, searched them in the target genomes, and confirmed that KN-A and KN-B domains are found only in the TALE sequences (Additional file 4: Data S1 and S2). We thereby defined KNOX-homologs as the TALE sequences possessing searchable KNOX homology (Fig 2, marked by red dots following their IDs), suggesting that the KNOX-homolog already existed before the evolution of eukaryotic photosynthesis as represented by the Archaeplastida.
In addition to the KNOX-homology, the same search also revealed two novel domains at the C-terminus of the homeodomain (Additional file 3: Figure S2): the first (KN-C1) was shared among the Chlorophyta sequences, and the second (KN-C2) was shared among a group of KNOX homologs in a clade outside the KNOX-group (KNOX-Red2).
KNOX classes diverged independently among the algal phyla
In Viridiplantae, we found a single KNOX homolog in most Chlorophyta species, whereas KNOX1 and KNOX2 divergence was evident in the Streptophyta division, including the charophyte Klebsormidium flaccidum and land plants (Fig. 2). The newly discovered KN-C1 domain was specific to the Chlorophyta KNOX sequences and found in all but one species (Pyramimonas amylifera). The absence of similarity between KN-C1 and the C-terminal extensions of KNOX1/KNOX2 sequences suggests independent, lineage-specific KNOX evolution in the Chlorophyta and Streptophyta (Additional file 3: Figure S2). We, therefore, refer to the Chlorophyta KNOX classes as KNOX-Chloro in contrast to the KNOX1 and KNOX2 classes in the Streptophyta.
The KNOX homologs in the Rhodophyta were divided into two classes: a paraphyletic group close to the KNOX-Chloro clade, named KNOX-Red1, and a second group near the PBX-Outgroup, named KNOX-Red2. KNOX-Red1 lacked a KN-A, whereas KNOX-Red2 lacked an ELK and shared a KN-C2 domain (Additional file 3: Figure S2). We consider KNOX-Red1 as the ancestral type, since the KNOX-Red1 sequences were found in all examined Rhodophyta taxa, whereas the KNOX-Red2 sequences were restricted to two taxonomic classes (Cyanidiophyceae and Florideophyceae). Interestingly, the KNOX-Red2 clade included two green algal sequences, with strong statistical support (89/89/0.97; Fig. 2); these possessed a KN-C2 domain, suggesting their ancestry within the KNOX-Red2 class (Additional file 3: Figure S2; see Additional file 2: Note S2 for further discussion about their possible origin via horizontal gene transfer).
Available TALE sequences were limited for the Glaucophyta. We found a single KNOX homolog in two species, which possessed KN-A and KN-B domains but lacked an ELK domain. We termed these KNOX-Glauco.
Non-KNOX group TALEs possess animal type PBC-homology domain, suggesting a shared ancestry between Archaeplastida and Metazoa
Following the identification of KNOX homologs, the remaining TALE sequences were combined as the non-KNOX group that lacks KN-A and KN-B domains in Archaeplastida. Further classification of the non-KNOX group was challenging due to its highly divergent homeodomain sequences. However, we noticed that the number of non-KNOX genes per species was largely invariable: one in most Rhodophyta and Glaucophyta genomes and two in the majority of Chlorophyta genomes, suggesting their conservation within each radiation.
Our ad-hoc homology search provided critical information for non-KNOX classification, identifying a homology domain shared among all Glaucophyta and Rhodophyta non-KNOX sequences (Fig. 3a, b). Since this domain showed a similarity to the second half of the animal PBC-B domain (Pfam ID: PF03792) known as heterodimerization domain , we named this domain PBL (PBC-B Like). Accordingly, we classified all the non-KNOX TALEs in Glaucophyta and Rhodophyta as a single PBC-related homeobox class, PBX-Glauco or PBX-Red. PBX-Glauco sequences also possessed the MEINOX motif, conserved in the animal PBC-B domain, indicating common ancestry of PBC-B and PBL domains (Fig. 3a).
GSP1 shares distant PBC-homology together with other non-KNOX group sequences in Viridiplantae
A remaining question was the evolution of the Chlorophyta non-KNOX sequences that apparently lacked a PBC-homology. To uncover even a distant homology, we compared the newly defined PBL domains with the Chlorophyta sequences by BLAST (cut-off E-value of 1E-1) and multiple sequence alignments. This query collected three prasinophyte and one charophyte TALE sequences that possessed a MEINOX motif and a putative PBL-domain; however, they showed very low sequence identity among themselves (Fig. 3c). Further query utilizing these four sequences identified 11 additional non-KNOX sequences. Nine of these were made into two alignments, one including GSP1 homologs and the other combining most prasinophyte sequences (Additional file 3: Figure S3). The two remaining sequences (Picocystis_salinarum_04995 and Klebsormidium_flaccidum_00021_0250) showed a homology to a PBX-Red sequence of Chondrus cruentum (ID:41034) in a ~ 200 aa-long extension beyond the PBL domain, suggesting their PBX-Red ancestry (another potential case of horizontal transfers; Additional file 3: Figure S4). All the Chlorophyta non-KNOX sequences that carry the PBL-homology domains were classified as GLX (GSP1-like homeobox) in recognition of the GSP1 protein of Chlamydomonas as the first characterized member of this class .
Is the plant BELL class homologous to the Chlorophyta GLX class?
The BELL class is the only non-KNOX class in land plants, sharing a POX (Pre-homeobox) domain (PF07526)  and lacking an identifiable PBL domain. The K. flaccidum genome, one of the two genomes available in the charophyte from which land plant emerged, contained three non-KNOX sequences, all possessing a PBL domain (Fig. 3, Additional file 3: Figures S3, S4). The second charophyte genome of Chara braunii contained one putative BELL homolog that appears to be truncated for the N-terminal sequences outside its C-terminal homeodomain possibly due to the incomplete gene model. Therefore, the lack of PBL-homology in the plant BELL class appears to be due to divergence or domain loss from an old charophyte class that had PBL-homology. We found an intron at the 24(2/3) homeodomain position of a K. flaccidum GLX homolog, which was previously identified as being specific to the plant BELL class (Additional file 3: Figure S5) , suggesting that the plant BELL class evolved from an ancestral GLX gene. More taxon sampling in charophytes is needed to confirm this inference.
Two non-KNOX paralogs of Chlorophyta heterodimerize with the KNOX homologs
Even with our sensitive iterative homology search, we could not identify a PBC/PBL-homology in about half of the Chlorophyta non-KNOX sequences. Since most Chlorophyta genomes possess one GLX homolog and one non-KNOX sequence without the PBL-homology domain, we refer the latter collectively to Class-B (Additional file 3: Figure S6). Exceptions were found in one prasinophyte clade (class Mamiellophyceae), whose six high-quality genomes all contain two non-KNOX sequences lacking the PBL-homology. Nonetheless, these non-KNOX sequences formed two groups, one more conserved and the other less conserved and polyphyletic, referred to the Mam-A and Mam-B classes, respectively (Additional file 3: Figures S7, S8). Considering the reductive genome evolution of the Mamiellophyceae , the conserved Mam-A class may be derived from an ancestral GLX class.
Two divergent non-KNOX classes in Chlorophyta led to a critical question about their dyadic networks. Previously studies had shown that TALE heterodimers required interaction between MEIS and PBC domains in animals and between KNOX and PBL domains in Chlamydomonas [6, 10]. It was, therefore, predicted that all Glaucophyta and Rhodophyta TALEs form heterodimers via their KNOX- and PBL-homology domains. On the other hand, it remained to be tested whether the Chlorophyta TALEs lacking a PBL-domain can form heterodimers with other TALEs.
To characterize the interaction network of TALE class proteins in Chlorophyta, we selected three prasinophyte species for protein-protein interaction assays: two species containing Mam-A and Mam-B genes (Micromonas commoda and Ostreococcus tauri), and another species (Picocystis salinarum), whose transcriptome contained one GLX and one Class-B sequence. In all three species, we found that KNOX homologs interacted with all examined non-KNOX proteins in Mam-A, Mam-B, Class-B, and GLX class (Fig. 4a–c). No interaction was observed between the two non-KNOX proteins in any of the three species (Fig. 4a–c). Similar to the GLX-KNOX heterodimerization, Mam-A and Mam-B also required additional domains outside the homeodomain for their heterodimerization with the KNOX homologs (Additional file 3: Figure S9). These results showed that the all divergent non-KNOX TALEs maintained their original activity to form heterodimers with the KNOX homologs. Observed interacting network among the TALE sequences is summarized in Additional file 3: Figure S10.
TALE heterodimerization evolved early in eukaryotic history
Our discovery of the PBC-homology in Archaeplastida suggests common ancestry of the heterodimerizing TALES between Metazoa and Archaeplastida. It also predicted that other eukaryotic lineages might possess TALEs with the PBC-homology. Outside animals, the Pfam database contains only two PBC-B domain-harboring sequences, one from a Cryptophyta species (Guillardia theta, ID:137502) and the other from an Amoebozoa species (Acanthamoeba castillian, ID:XP_004342337) . We further examined the Excavata group, near to the posited root of eukaryotic phylogeny . A search of two genomes (Naegleria gruberi and Bodo saltans) collected 12 TALE homeobox sequences in N.gruberi, and none in B.saltans, of which we found one with a PBC-homology domain (ID:78561, Fig. 3a) and one with a MEIS/KNOX-homology (ID:79931, Additional file 3: Figure S2). We searched additional genomes in the Amorphea and found the PBC-homology and MEIS/KNOX-homology in the TALE sequences collected from Apusozoa, Ichtyhosporea, and Choanoflagellata but not from Fungi (Additional file 3: Figures S11-S14). Our data suggest that the heterodimerization domains—the PBC-homology and MEIS/KNOX-homology—originated early in eukaryotic evolution and persisted throughout the major eukaryotic radiations.
Intron-retention supports the parallel evolution of the heterodimeric TALE classes during eukaryotic radiations
The ubiquitous presence of dyadic TALEs raised next question: Are all the dyadic TALEs reported in this study the descendants of a single ancestral dyad, or do they result from lineage-specific evolution from a single prototypical TALE (proto-TALE) that does not engage in heterodimerization. To probe deep ancestry, we examined intron-retention, this being regarded as a long-preserved character and less prone to occur by homoplasy (a character displayed by a set of species but not present in their common ancestor) . Five intron positions were shared by at least two TALE classes, of which the 44/45 and 48(2/3) introns qualified as the most ancestral since they were found throughout the Archaeplastida and Metazoa (Additional file 3: Figure S5).
The 44/45 and 48(2/3) introns showed an intriguing exclusive distribution between the two dyadic partners of each phylum: one possesses the 44/45 and the other possesses the 48(2/3) intron (Additional file 3: Figure S5). This mutually exclusive pattern suggested that two TALE genes with distinct intron positions existed at the onset of the eukaryotic radiation. We consider the 44/45 intron position as the most ancestral, given that it was conserved in most non-TALE homeobox genes . In this regard, we speculate that acquisition of the 48(2/3), and loss of the 44/45 intron, accompanied an early event wherein the proto-TALE with the 44/45 intron was duplicated to generate a second TALE with the 48(2/3) intron. Since the 48(2/3) intron position was found within the KNOX/MEIS group genes in Viridiplantae and Metazoa and also in the PBX group genes in Rhodophyta and Cryptophyta, we may speculate that the duplicated TALEs arose early and diversified to establish lineage-specific heterodimeric configurations during eukaryotic radiations. Alternatively, the 48(2/3) intron position in the TALE homeodomain might have been acquired many times during eukaryotic radiations.
Given that the heterodimeric TALEs evolved in a lineage-specific manner, we asked what the proto-TALE looked like at the time it underwent duplication. The following observations suggest that the proto-TALE was a homodimerizing protein. First, the PBC-homology domains of PBX/GLX class proteins identified in the Archaeplastida includes the MEINOX-motif that was originally defined for its similarity to the MEIS/KNOX-homology domains (Fig. 3) . Second, PBX-Glauco sequences possess the ELK-homology within their PBL domain (Fig. 3), which align well to the ELK domains of KNOX class sequences in Viridiplantae (Additional file 3: Figure S15). Therefore, the MEINOX-motif and ELK-homology across the heterodimerizing KNOX and PBX groups supported the common origin of heterodimerizing TALE groups from a single TALE by duplication followed by subfunctionalization.