What is the phylogenetic signal limit from mitogenomes? The reconciliation between mitochondrial and nuclear data in the Insecta class phylogeny
- 8.9k Downloads
Efforts to solve higher-level evolutionary relationships within the class Insecta by using mitochondrial genomic data are hindered due to fast sequence evolution of several groups, most notably Hymenoptera, Strepsiptera, Phthiraptera, Hemiptera and Thysanoptera. Accelerated rates of substitution on their sequences have been shown to have negative consequences in phylogenetic inference. In this study, we tested several methodological approaches to recover phylogenetic signal from whole mitochondrial genomes. As a model, we used two classical problems in insect phylogenetics: The relationships within Paraneoptera and within Holometabola. Moreover, we assessed the mitochondrial phylogenetic signal limits in the deeper Eumetabola dataset, and we studied the contribution of individual genes.
Long-branch attraction (LBA) artefacts were detected in all the datasets. Methods using Bayesian inference outperformed maximum likelihood approaches, and LBA was avoided in Paraneoptera and Holometabola when using protein sequences and the site-heterogeneous mixture model CAT. The better performance of this method was evidenced by resulting topologies matching generally accepted hypotheses based on nuclear and/or morphological data, and was confirmed by cross-validation and simulation analyses. Using the CAT model, the order Strepsiptera was recovered as sister to Coleoptera for the first time using mitochondrial sequences, in agreement with recent results based on large nuclear and morphological datasets. Also the Hymenoptera-Mecopterida association was obtained, leaving Coleoptera and Strepsiptera as the basal groups of the holometabolan insects, which coincides with one of the two main competing hypotheses. For the Paraneroptera, the currently accepted non-monophyly of Homoptera was documented as a phylogenetic novelty for mitochondrial data. However, results were not satisfactory when exploring the entire Eumetabola, revealing the limits of the phylogenetic signal that can be extracted from Insecta mitogenomes. Based on the combined use of the five best topology-performing genes we obtained comparable results to whole mitogenomes, highlighting the important role of data quality.
We show for the first time that mitogenomic data agrees with nuclear and morphological data for several of the most controversial insect evolutionary relationships, adding a new independent source of evidence to study relationships among insect orders. We propose that deeper divergences cannot be inferred with the current available methods due to sequence saturation and compositional bias inconsistencies. Our exploratory analysis indicates that the CAT model is the best dealing with LBA and it could be useful for other groups and datasets with similar phylogenetic difficulties.
KeywordsMitochondrial Genome Phylogenetic Signal Compositional Heterogeneity Mitochondrial Data Nuclear Protein Code Gene
From the seminal comprehensive study of Hennig , to the impressive descriptive work of Kristensen [2, 3], to the increasingly common molecular approaches [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], Insecta class systematics has been a challenging field of study. Molecular phylogenies have become a powerful tool that shed light on many parts of the Tree of Life. At the same time, due to the increasing number of sequences and genomes published, methodological questions are broadly explored by researchers in order to correctly and fully infer evolutionary relationships and patterns. In fact, it is widely accepted that many factors can influence final tree topologies, not to mention supports. Among these factors, we can cite 1) the quality of the sequences and the alignment; 2) the amount of phylogenetic information present in the sequences; 3) the presence of evolutionary biases that are not taken into account by most used evolutionary models (compositional heterogeneity, heterotachy...); 4) the use of markers whose evolution does not reflect the species evolutionary history (paralogs, xenologs); 5) the accuracy of the evolutionary model and the efficiency of the tree search algorithm used for the study [15, 16, 17, 18, 19]. Thus, different strategies in the analyses can often lead us to arrive at mutually contradictory conclusions starting from the same dataset. This seems to be particularly true when comparing the studies of relationships among the main taxonomic groups of Arthropoda [20, 21, 22, 23, 24, 25, 26]. Intra- and inter-ordinal insect relationships are not an exception and represent a ceaseless source of debate. They have been commonly explored using different types of molecular data: rDNA 18S and 28S, mitochondrial genes, complete mitochondrial genomes, nuclear protein coding genes, the presence of shared intron positions  or mitochondrial gene rearrangements . Among the most controversial insect groups with regard to systematic position we can mention the Strepsiptera, an order of obligate endoparasitic and morphologically derived insects. The most basal relationships within the holometabolous and the paraneopteran insects are another example of long-debated relationships.
Mitochondrial genomes have been successful in recovering intra-ordinal phylogenetic relationships concordant with other sources of data, with convincing levels of support, such as in Diptera , Hymenoptera , Orthoptera  and Nepomorpha (Heteroptera) . Nevertheless, mitogenomes proved so far to be generally inadequate to study inter-ordinal relationships of insects and deeper levels of Arthropoda, frequently resulting in strong incongruence with morphological and nuclear data, poor statistical supports, and high levels of inconsistency among different methods [16, 24, 25, 26, 32]. Indeed, comparative studies that contrast nuclear and mitochondrial datasets suggest that nuclear markers are better suited to deal with deep arthropod relationships, as the mitochondrial genome is on average more saturated, biased, and generally evolves at a much faster rate than the nuclear genome [33, 34]. Thus, knowing the specific limits for each set of mitogenomes analyzed, i.e. when substitution rates result in saturation that distorts the phylogenetic signal at deeper nodes, is crucial to assess their usefulness in phylogenetics .
It is well known that arthropod mitochondrial genomes present some anomalous characteristics, like very high percentage of AT content, frequent gene rearrangements  or accelerated evolutionary rates likely related to phenotypic changes in body size or to parasitic lifestyle , all of which can limit their applicability in phylogenetic reconstruction. These biases in the data can result in systematic errors when the evolutionary model used for phylogenetic inference does not take them into account. Thus, homogeneous models of substitution or replacement where all sites evolve under the same substitution process  and constantly through time [19, 39] are not adequate for Arthropoda. One of the most usual artefacts, especially in deep relationships where mutational saturation exists , is the long-branch attraction (LBA), a systematic error where two or more branches tend to cluster together producing false relationships . Also, models not accounting for heterogeneity in nucleotide composition among taxa  can lead to artefactually group unrelated taxa with similar base composition [42, 43, 44, 45].
For all these reasons, artropods in general and insects in particular, constitute an excellent model to tackle challenging questions of phylogenetic methodological interest. Several strategies have been designed to minimize potential biases: 1) Increasing the taxon sampling as far as possible, although generally counteracted by the removal of taxa with an evidently incorrect placement disturbing the reconstruction. 2) Filtering genes in large phylogenomic analyses to avoid paralogy problems and unexpected effects of missing data [45, 46, 47]. 3) The use of more specific substitution/replacement models. For example, matrices of amino acid replacement have been designed for Arthropoda (MtArt) [48, 49] and Pancrustacea (MtPan) . 4) Removing fast-evolving sites according to discrete gamma category [40, 50, 51, 52]. 5) Removing third codon position or recoding them as purines and pyrimidines (RY-coding) in DNA alignments [23, 53] to reduce the effects of saturation. 6) Using a site-heterogeneous mixture model (CAT) to allow flexible probabilities of the aminoacid replacement equilibrium frequencies, in order to minimise LBA effects [38, 54, 55].
In this work, we test the performance of different phylogenetic methodological strategies, using mitochondrial genomes of the Class Insecta as a model and including long-branched problematic taxa within Hymenoptera, Strepsiptera, Thysanoptera and Phthiraptera orders that have been usually excluded from mitochondrial datasets. We address controversial taxonomical questions at three different levels of divergence, for which solid hypotheses based on nuclear phylogenies and morphological data exist. Our results show strong differences among the methods tested in their power to resolve inter-ordinal relationships. Using both real and simulated data (see Additional file 1), we confirm the capacity of the site-heterogeneous mixture model (CAT) under a Bayesian framework, currently implemented in software PhyloBayes [38, 56], to substantially avoid the LBA artefacts. We show for the first time that the reconciliation between mitochondrial and previous nuclear and morphological knowledge is possible in the cases studied.
Results and Discussion
About the exploratory phylogenetic framework
Methodologies that produced topologies identical or very similar to these hypotheses were considered better than those that resulted in very different trees. We noticed a strong susceptibility of our data to the type of analyses performed, confirming once more the instability of phylogenies based on insect mitochondrial genomes. Indeed, almost each approach resulted in a different topology and only the Bayesian inference using the CAT model with amino acid sequences (BI-AA-CAT) was able to obtain trees fitting potentially correct hypotheses. A better performance of the CAT model was confirmed using simulations (Additional file 1; Figure S1). No differences were observed between the MtRev and MtArt models in ML trees for any dataset. Cross-Validation statistics were performed to test the fit of the replacement models for proteins MtRev and CAT to the data. A better fit of the CAT model for Paraneoptera (mean score = 9.216 ± 12.038) and Eumetabola (mean score = 5.75, SD = ± 32.3539), and a similar fit for Holometabola (mean score = -0.055, SD = ± 32.3539) were detected.
The site-heterogeneous mixture model CAT  assumes the existence of distinct substitution processes, which usually results in a better fit to the data than site-homogeneous models based on empirical frequencies of amino acid or nucleotide substitutions, like MtRev or GTR [57, 58, 59, 60]. In fact it has already been shown in other taxonomical groups that the CAT model is very powerful to overcome LBA artefacts [45, 54, 61, 62, 63]. Thus, the use of models accounting for compositional heterogeneity in the replacement process seems to be more effective than strategies focused on the removal of saturated positions in the case of Insect mitogenomes. Combining the CAT model with the use of amino acid sequences instead of DNA, which should reduce saturation biases, under a Bayesian framework produced the most satisfactory results. The topologies resulting from the analyses with different methods are discussed further on.
Holometabola phylogeny and the Strepsiptera problem
Since their discovery, Strepsiptera has been associated with Diptera, Siphonaptera, Odonata, Ephemerida, Hymenoptera and Lepidoptera . More recently, four different placements have been suggested as possibilities: membership in the Coleoptera , sister group to the Coleoptera [67, 68], outside the Holometabola  and sister to Diptera [4, 69]. Based on molecular studies of 18S rDNA, Whiting et al  proposed the grouping of Diptera + Strepsiptera under the name Halteria. Chalwatzis et al [70, 71] reached similar conclusions about the relationships between Diptera and Strepsiptera using a larger dataset of 18S rDNA sequences. Later, other authors [72, 73, 74] attributed that grouping to an artefact due to LBA, becoming one of the best examples of LBA ever, called "the Strepsiptera problem".
Further phylogenetic evidence using other nuclear data contradicted the Halteria hypothesis, and supported associations between Strepsiptera and Coleoptera. Rokas et al (1999)  pointed to an intron insertion in en class homeoboxes of Diptera and Lepidoptera but not in Strepsiptera, Coleoptera or Hymenoptera, arguing that an intron loss is an improbable event. Based on a different approach, Hayward et al (2005)  used the structure of the USP/RXR hormone receptors, which showed a strong acceleration of evolutionary rate in Diptera and Lepidoptera, to reject the Halteria clade and to provide strong evidence for Mecopterida. Bonneton et al (2003, 2006) [77, 78], confirmed the USP/RXR approach of Hayward et al and added the ecdysone receptor (ECR; NR1H1) to the analysis, confirming a Mecopterida monophyletic group. However, these two studies were not able to define clear associations for Strepsiptera. Misof et al (2007)  published a large phylogeny of Hexapoda using 18S rDNA and applying mixed DNA/RNA substitution models. Although they recovered well-supported hexapod basal relationships, they obtained very low resolution and unclear relationships within Holometabola.
Recent molecular studies using extensive nuclear data seemed to contradict the Halteria hypothesis again, recovering a close relationship between Coleoptera and Strepsiptera [11, 13, 14]. First, Wiegmann et al (2009)  used a complete dataset of six nuclear protein coding genes including all holometabolan orders. They recovered Coleopterida and provided statistical evidence discarding LBA effects. They found some conflicting signal using individual genes like cad, which recovered Halteria, a result that was attributed to LBA because this is a rapidly evolving locus. Longhorn et al (2010)  used a total of 27 ribosomal proteins and tested several nucleotide-coding schemes for 22 holometabolan taxa, including two strepsipteran species, where a majority of the schemes tested recovered Coleopterida. McKenna and Farrell (2010)  raised identical conclusions using a total of 9 nuclear genes for 34 holometabolan taxa. Also, the Coleopterida have been recently recovered when using large morphological datasets [79, 80]. Thus, evidence supporting Coleopterida has grown in recent years, suggesting that the phylogenetic placement of Strepsiptera has been definitely identified.
In summary, classical and most recent morphological and molecular studies based on nuclear data support the Mecopterida and Coleopterida hypotheses. Until now no mitochondrial evidence backed these hypotheses and our results are the first to fully agree with the most generally accepted point of view.
The Hymenoptera position and the basal splitting events of Holometabola
Depending on algorithm conditions, we observed inconsistencies among analyses in the Hymenoptera position (Figure 2). For example, the fact of using six gamma rate categories instead of four in ML-AA, or simply performing 5000000 instead of 1000000 runs (each with chain stability checked with Tracer) for BI-DNA, or assigning different partitions for DNA, tRNA and rRNA produced alternative results, either ((Diptera + Lepidoptera) Hymenoptera) Coleoptera) or (Diptera + Lepidoptera) + (Coleoptera + Hymenoptera) (not shown). Similar problems when using mitochondrial data have been previously described by Castro and Dowton (2005, 2007) [81, 82] regarding this question, namely inconsistencies depending on the ingroup and outgroup selection and the analytical model. Overall, they described a tendency in their analyses to group Hymenoptera as sister taxa to Mecopterida, but they also found Hymenoptera or Hymenoptera + Coleoptera as the most basal lineages in some of their trees.
When using the BI-AA-CAT method our mitochondrial overview suggests a sister relationship of Hymenoptera with Mecopterida, placing Coleopterida outside a clade comprising the other examined holometabolan insects. This result coincides with one of the classical morphological points of view [3, 83, 84], some nuclear evidence , and with morphological and nuclear combined analyses [4, 5] that recovered Coleoptera at the base of Holometabola (but not Strepsiptera). A phylogeny inferred from 356 anatomical characters by Beutel et al (2010)  placed Hymenoptera as the basal holometabolous insects and recovered a paraphyletic Mecopterida, although these groups were not strongly resolved. A morphological study based on characters of the thorax contributed by Friedrich and Beutel (2010)  offered two scenarios depending on the phylogenetic algorithm used: Coleopterida as the most basal group in the Bayesian analysis, but Hymeoptera as the most basal when using parsimony. Several hypotheses based on morphology situate hymenopterans as sister to Mecopterida [85, 86], grouping coleopterans with the basal Endopterygota [see references in ], or with Neuroptera (not present in our dataset) [2, 3, 5, 84, 85, 86, 87, 88, 89, 90]. Also, based on the analysis of wing characters, Kukalova-Peck & Lawrence (1993)  proposed an alternative phylogenetic hypothesis consisting in a most basal position for the Hymenoptera. Such discrepancies enhance the view that morphological characters are rather useless in order to determine the phylogenetic position of Hymenoptera within the Holometabola .
Our results do not support the most recent molecular studies based on nuclear data, all of them reporting Hymenoptera as the most basal holometabolan insects, for example, the phylogenomic results contributed by Savard et al (2006)  using a total of 185 nuclear genes. Since these authors were using emerging genome projects to assemble and analyze all the genes, they only were able to use 8 taxa with 4 orders of holometabolan insects represented (Diptera, Lepidoptera, Coleoptera and Hymenoptera). Their phylogeny resulted in a supported Coleoptera sister to Mecopterida clade, leaving Hymenoptera at the base. Zdobnov and Boork (2007)  obtained the same conclusions in another phylogenomic approach, using 2302 single copy orthologous genes for 12 genomes representing the same 4 holometabolous insect groups. Based on a dataset with similarly limited taxon sampling, and using the gain of introns close to older pre-existing ones as phylogenetic markers, Krauss et al (2008)  arrived to the same conclusion identifying 22 shared derived intron positions of Coleoptera with Mecopterida, in contrast to none of Hymenoptera with Mecopterida. Additionally, phylogenies with a large number of markers and a complete taxon sampling also gave rise to the same conclusions [11, 13, 14]. Therefore, mitochondrial data under the CAT model avoids obviously wrong relationships caused by LBA and recovers one of the two main current hypotheses. This hypothesis has been proposed mostly based on morphological evidence and differs from most recent nuclear and genomic results. This issue remains thus an open question deserving deeper study.
Paraneoptera phylogeny and the position of Phthiraptera
Classically, Hemiptera is divided in two suborders: Homoptera and Heteroptera. Homoptera includes Sternorrhyncha and Auchenorrhyncha (Cicadomorpha + Fulgoromorpha). However, according to inferred phylogenies from 18S rDNA, Euhemiptera (Heteropterodea (Heteroptera + Coleorrhyncha) + Auchenorrhyncha) were proposed as sister group of Sternorrhyncha, leaving Homoptera as paraphyletic [91, 92, 93], which is currently the most accepted hypothesis. Our mitochondrial analyses with BI-AA-CAT produced the same conclusions as 18S rDNA datasets. Thus, Euhemiptera was recovered as a robust clade formed by Cicadomorpha plus Fulgoromorpha (a group known as Auchenorrhyncha) plus Heteropterodea, while Homoptera (Sternorrhyncha + Cicadomorpha + Coleorrhyncha) was paraphyletic with respect to the heteropteran Triatoma dimidiata, which appeared in all the tested methods as sister to Cicadomorpha with high support. In this study we were not able to test the Auchenorrhyncha paraphyly due to the lack of a Fulgoromorpha genome when the analyses were performed.
A sistergroup relationship between the Hemiptera and Thysanoptera, jointly known as Condylognatha [94, 95], has been proposed based on morphological characters and supported by 18S rDNA data . Moreover, the closest relatives of this group seem to be the Psocodea (= 'Psocoptera' + Phthiraptera). Although Homoptera paraphyly is fully accepted, at molecular level it has just been tested with nuclear single-gene phylogenies and the full Paraneoptera has never been studied with mitochondrial genomes. There is a broad acceptance that Paraneoptera is a monophyletic group of hemimetabolous insects, comprising the Hemiptera, Thysanoptera, and Psocodea, but the basal relationships within this group are quite controversial. The Condylognatha proposal (Hemiptera + Thysanoptera) was supported by several studies [83, 97, 98, 99, 100, 101, 102], although spermatological characters , fossil studies [103, 104] and combined molecular and morphological data  suggested an alternative sister-group relationship between Psocodea and Thysanoptera. Psocodea, however, is a fully accepted clade, even if the two orders included have been proposed to be mutually paraphyletic [96, 105].
In our results, even using the AA-BI-CAT, which seemed to eliminate LBA artefacts for other clades, the basal Paraneoptera relationships were in contradiction with the generally accepted hypotheses. Psocodea was not monophyletic because Thysanoptera was recovered as the closest to Phthiraptera, and consequently Condylognatha is not supported. In fact, Thrips imaginis and the Phthiraptera genomes, were recovered as sister with high support in most of the methods tested. This result, although unexpected, cannot be readily dismissed as wrong and deserves more scrutiny (see for example ).
Eumetabola: Assessing the limits of the mitogenomic data
Given the observed inconsistencies when the divergence is increased in insects, we should question the utility of the Arthropoda mitogenomes to recover supra-ordinal phylogenetic information because of mutational saturation, at least with the current methodological offer.
Mitogenomic data have been used to successfully address several phylogenetic questions within mammals [107, 108] and birds . In both cases, however, relationships at the root level were not fully resolved, like the basal relationships between paleognaths and neognaths in birds, and Theria (marsupials plus placentals) versus Marsupionta (monotremes plus marsupials) hypotheses in mammals. In an ecdysozoan mitogenomics study testing the affinities of the three Panarthropoda phyla and the Mandibulata vs. Myriochelata hypothesis, Rota-Stabelli et al (2010)  also described difficulties caused by LBA. They obtained reasonable results only by removing rapidly evolving lineages and appyling the CAT model. Within Arthropoda, Nardi et al (2003)  presented an unexpected result based on a mitogenomic phylogeny: the paraphyly of the hexapods. They found crustaceans as sister to Insecta, and Collembola as sister to both. This result was discarded by Delsuc et al (2003) , who tried to avoid saturation and composition heterogeneity by recoding nucleotides as purines (R) and pyrimidines (Y), recovering then a monophyletic Hexapoda. Later, Cameron et al (2004)  performed a detailed battery of analysis including the major arthropod groups to test the hexapod monophyly. They removed hymenopteran and paraneopteran genomes from the analysis due to their extreme divergences. Even so, they could not obtain a conclusion about the relationships due to the strong topological instability of the trees.
Cook et al (2005)  assessed the same question concluding that Crustacea and Hexapoda were mutually paraphyletic, although not including unstable lineages like Hymenoptera, the wallaby louse (Heterodoxus macropus) and others with long branches. Removing some important lineages, may strongly affect the general topology and the inferred evolutionary history. Carapelli et al (2007)  obtained the same conclusions by Cook et al (2005)  when including several additional genomes in the analysis and using a new model of amino acid replacement for Pancustracea, MtPan. They presented two large phylogenies based on DNA and protein alignments that supported a non-monophyletic Hexapoda, both obtaining a higher likelihood score under MtPan than when using MtArt or MtRev. However Carapelli et al did not test as many strategies and combinations like others did, in which case they would have probably arrived to similar contradicting conclusions. Some of the relationships within the Insecta they recovered were in strong disagreement with previous morphological and molecular evidence. For example: 1) They recovered a supported association between Strepsiptera and the Crustacean Armillifer armillatus (Pentastomida), two problematic yet clearly not related organisms sharing exceptional rates of evolution. 2) Diptera was included in the polyneopteran insect lineage when using BI-DNA and as an independent lineage from all the rest of the Insecta class when using BI-AA with MtPan model. 3) The positions of the orthropterans Gryllotalpa orientalis and Locusta migratoria remained unclear in their analysis. 4) The plecopteran Pteronarcys was recovered outside the polyneopterans, clustering with the Diptera. All of these cases were strongly supported by Bayesian posterior probabilities, but it is known that biases in deep phylogenies might increase supports of incorrect relationships. They attributed the non-monophyletic clade of Holometabola to a biased sampling, lacking orders like Mecoptera, Siphonaptera, Trichoptera or Neuroptera, but once more they removed the Hymenoptera from the analyses. Timmermans et al (2007)  re-evaluated the Collembola position using ribosomal protein gene sequences, which resulted in the supported monophyly of Hexapoda for all methodologies used (MP, ML and BI). They also found the inconsistency between nuclear and mitochondrial data when analyzing pancustracean relationships and clearly claimed that "caution is needed when applying mitochondrial markers in deep phylogeny".
The limitations of the mitochondrial genome as phylogenetic marker were already pointed out by Curole and Kocher (1999) , when an increasing number of mitogenomes were sequenced and resulting phylogenies conflicted with morphological and nuclear hypotheses in the deep relationships of tetrapods and arthropods, as well as in mammals . Within Insecta, more than one hundred mitogenomes are available now in GenBank/DDBJ/EMBL and they have been used to successfully resolve intra-ordinal relationships, such as in Diptera , Hymenoptera , Orthoptera  and Nepomorpha (Heteroptera) . We report the difficulties to work on inter-ordinal relationships within Insecta, although showing that they can be generally avoided by using the BI-AA under the site-heterogeneous mixture model (CAT). However, we conclude that divergences in mitochondrial sequences above super-order levels represent an insurmountable problem for current methods. This result is at least valid for Arthropoda mitochondrial genomes, but difficult to extrapolate to other groups of organisms. We must remember some exceptional characteristics of the Insecta and Arthropoda in general, like high AT-content, the parasitic life-styles present in some groups or explosive radiation events in others. It is thus possible that a more relaxed evolutionary process in other metazoans allows for slightly deeper studies, and the limits of each dataset should be independently assessed.
Mitochondrial single genes reliability
Scale-factor and Robinson-Foulds distances for individual mitochondrial genes.
Scale-factor and Robinson-Foulds distance when comparing five concatenated genes versus whole genome in different datasets.
(long-branched taxa excluded)
(long-branched taxa excluded)
(long-branched taxa excluded)
Although several innovative phylogenetic methods have been developed to improve mitochondrial phylogenetic trees in some groups of organisms, results have been controversial in insects, leading to different conclusions that most often disagree with more generally accepted relationships obtained from nuclear and morphological data. Thus, insects constitute a perfect model to test different methodologies and to better understand phylogenetic inference behaviour. Here we tested a battery of those strategies with three datasets of complete mitochondrial genomes of Insecta, including problematic taxa usually excluded from the analyses, and we compared the results with the current nuclear and morphological state of knowledge. The results suggested that the use of amino acid sequences instead of DNA is more appropriated at the inter-ordinal level and that the use of the site-heterogeneous mixture model (CAT) under a Bayesian framework, currently implemented in the software PhyloBayes, substantially avoids LBA artefacts. We show that inferring phylogenies above the super-order level constitutes the limit of the phylogenetic signal contained in insect mitochondrial genomes for currently available phylogenetic methods. For many of the relationships studied, we demonstrate for the first time that, with the proper methodology, mitochondrial data supports the most generally accepted hypotheses based on nuclear and morphological data. Thus, we confirm the non-monophyly of Homoptera within Paraneoptera, and recover Strepsiptera as a sister order to Coleoptera. In the basal splitting events in Holometabola we recover the Hymenoptera-Mecopterida association, and Coleoptera + Strepsiptera form a clade sister to the rest of Holometabola, which coincides with one of the two most accepted hypotheses. Recovered basal relationships in Paraneoptera differ from the currently accepted hypothesis in the position of Phthiraptera, which is recovered as sister to Thysanoptera, resulting in a paraphyletic Psocodea. By comparing single-gene to whole genome tree topologies, we select the five genes best performing for deep Insect phylogenetic inference. The combined used of these five genes (cox1, nad1, cytb, nad2 and nad4) produces results comparable to those of mitogenomes, and we recommend the prioritary use of these markers in future studies.
A total of 55 complete or almost complete Eumetabola mitochondrial genomes (17 of Paraneoptera and 38 of Holometabola) were downloaded from GenBank (Additional file 1: Table S1). Analyses were conducted using 3 datasets; 1) Holometabola, 2) Paraneoptera and 3) Eumetabola (Paraneoptera + Holometabola) in order to assess phylogenetic behaviour in a higher divergence level.
Every gene was translated to protein according to the arthropod mitochondrial genetic code and individually aligned using Mafft 5.861 . To produce the DNA alignments, gaps generated in the protein alignment were transferred to the non-aligned DNA sequences using PutGaps software . The resulting DNA and protein alignments for each gene were concatenated after removing problematic regions using Gblocks 0.91  under a relaxed approach  with the next set of parameters: "Minimum Number Of Sequences For A Conserved Position" = 9, "Minimum Number Of Sequences For A Flank Position" = 13, "Maximum Number Of Contiguous Nonconserved Positions" = 8, "Minimum Lenght Of A Block" = 10, "Allowed Gap Positions" = "With Half", and the kind of data was "by codons" for DNA and "Protein" for the aminoacids.
tRNA and rRNA sequences were individually aligned using ProbconsRNA 1.1  and ambiguously aligned regions removed with Gblocks with the same parameters used for DNA. For the Paraneoptera dataset, both tRNA-Leu sequences from Aleurodicus dugesii were removed because they were extremely long in comparison to the rest and affected the alignment mechanism. For the Holometabola dataset, the large subunit ribosomal RNA sequence from Anophophora glabripennis, the tRNA-Met from Ostrinia nubilalis and Ostrinia furnacalis, and the tRNA-Trp from Cysitomia duplonata were unusually short and were not included. All these fragments were excluded from the Eumetabola dataset as well. Sequences were concatenated, and gaps were used instead of the removed RNAs and the few lacking coding genes.
Strategies for phylogenetic analysis
We tested several strategies for phylogenetic analyses on the three datasets. These differed in the phylogenetic algorithm, the treatment of saturation, and the use of different models of replacement: 1) Maximum likelihood on protein alignments under the MtRev and MtArt models (ML-AA); 2) Maximum likelihood on protein alignments under Empirical profile mixture models (20 and 60 profiles) (ML-AA-CAT) [38, 115] 3) Bayesian inference on protein alignments under the MtRev model; 4) Bayesian inference on DNA alignments including only first and second codon positions for the 13 coding genes under the GTR+I+G model (BI-DNA); 5) Bayesian inference on DNA alignments including first and second codon positions of the 13 coding genes, plus 22 tRNA and 2 rRNA  and under the GTR+I+G model; 6) Bayesian inference with a site specific rate model for all DNA + RNA positions  7) Bayesian inference under the CAT model on DNA alignments including only first and second codon positions of the 13 coding genes; 8) Bayesian inference under the CAT model of protein alignments from the 13 coding genes (BI-AA-CAT) (Additional file 1: Table S2).
For maximum likelihood analyses, the software PhyML 2.4.4  with the empirical MtRev model and six gamma rate categories was used. PhyML-CAT applying mixture models (C20 and C60)  was used when testing an alternative to empirical rate matrices in ML. For Bayesian inference, we used MrBayes v. 3.1.2  and PhyloBayes 2.3 . For MrBayes calculations in DNA alignments we used two partitions (first and second position of every codon), the GTR+I+G model, and four chains of 5.000.000 trees, sampling every 5000 generations. When including coding genes + tRNA + rRNA, sequences were partitioned in three independent partitions, one for each sequence type. For MrBayes analyses on protein alignments we used the MtRev model and four chains of 1.000.000 trees, sampling every 1000 generations, and applied a burn-in of 10% generations. For PhyloBayes analyses we used the site-heterogeneous mixture model CAT model for aminoacid sequences and the GTR-CAT model for the nucleotide sequences, and we run two independent chains of 5000 cycles, removing the first 1000 and sampling one point every five. For the site-specific rate model, characters were divided into six discrete rate categories using TreePuzzle  and partitioned in MrBayes from fastest to slowest, following a similar approach than in Kjer & Honeycutt . Convergence of independent runs was checked with the software Tracer v1.4.
For the whole Eumetabola dataset, the CAT-BP model was tested, using the software nhPhylobayes v.023 [119, 120]. This model is supposed to better account for amino-acid compositional heterogeneity, because it allows breakpoints along the branches of the phylogeny at which the amino acid composition can change. The number of components in the mixture were fixed to 120, according to the previous CAT-based phylogeny for Eumetabola. Four independent chains were run, and only two of them converged after highly demanding computation. Taking every tenth sampled tree, a 50% majority rule conseus tree was computed using the converged chains.
To statistically compare the CAT model with the stardard site-homogeneus models, cross validation statistics with PhyloBayes 3.3b were performed between the amino acid models (MtRev and CAT), as described in Philippe et al (2011) .
Mitochondrial single genes reliability
In order to explore the contribution of each individual gene to the concatenated tree, 13 single-gene phylogenies from the Paraneoptera dataset were reconstructed with BI-DNA excluding third codon positions. We scored each single-gene resulting phylogeny based on Robinson-Foulds distances and relative scale-factor values  using the complete mitochondrial tree as reference. The 5 best-scored genes were selected according to Robinson-Foulds distances and they were used to infer Paraneoptera, Holometabola and Eumetabola 5-gene phylogenies. Again, Robinson-Foulds distances and relative scale-factor values were calculated. In the same way, we also tested 5-gene performance when following a common practice in mitochondrial phylogenies of insects: the removal of rapidly evolving lineages with branch lengths deviating from the mean of the reference tree. To do that, taxa with a divergence to the root of the tree higher than 0.5 substitutions/position for Paraneoptera and Holometabola datasets and higher than 0.6 substitutions/position for the Eumetabola were removed. Thus, a total of 6 datasets were scored for the Robinson-Foulds distance and the scale-factor.
We wish to thank José Castresana for his contribution and helpful ideas in designing the experiments and for comments on the manuscript, Cymon Cox for help with the nhPhylobayes analysis, and three anonymous reviewers for highly useful comments. Support for this research was provided by the Spanish Ministerio de Ciencia e Innovación project CGL2007-60516/BOS and CGL2010-21226/BOS; and predoctoral fellowships I3P-CPG-2006 and BES-2008-002054 to GT.
- 1.Hennig W: Die Stammesgeschichte der Insekten. 1969, Kramer, FrankfurtGoogle Scholar
- 2.Kristensen NP: Phylogeny of extant hexapods. The insects of Australia: A textbook for students and research workers. 1991, Melbourne University PressGoogle Scholar
- 3.Kristensen NP: Phylogeny of Endopterygote insects, the most successful lineage of living organisms. European Journal of Entomology. 1999, 96 (3): 237-253.Google Scholar
- 5.Wheeler WC, Whiting M, Wheeler QD, Carpenter JM: The phylogeny of the extant hexapod orders. Cladistics. 2001, 17 (4): 403-404.Google Scholar
- 7.Ogden TH, Whiting MF, Wheeler WC: Poor taxon sampling, poor character sampling, and non-repeatable analyses of a contrived dataset do not provide a more credible estimate of insect phylogeny: a reply to Kjer. Cladistics. 2005, 21 (3): 295-302. 10.1111/j.1096-0031.2005.00061.x.Google Scholar
- 11.Wiegmann BM, Trautwein MD, Kim JW, Cassel BK, Bertone MA, Winterton SL, Yeates DK: Single-copy nuclear genes resolve the phylogeny of the holometabolous insects. BMC Biology. 2009, 7: 36-10.1186/1741-7007-7-36.Google Scholar
- 20.Giribet G, Ribera C: A review of arthropod phylogeny: New data based on ribosomal DNA sequences and direct character optimization. Cladistics. 2000, 16 (2): 204-231. 10.1111/j.1096-0031.2000.tb00353.x.Google Scholar
- 21.Cameron SL, Miller KB, D'Haese CA, Whiting MF, Barker SC: Mitochondrial genome data alone are not enough to unambiguously resolve the relationships of Entognatha, Insecta and Crustacea sensu lato (Arthropoda). Cladistics. 2004, 20 (6): 534-557. 10.1111/j.1096-0031.2004.00040.x.Google Scholar
- 26.Timmermans M, Roelofs D, Mariën J, van Straalen NM: Revealing pancrustacean relationships: Phylogenetic analysis of ribosomal protein genes places Collembola (springtails) in a monophyletic Hexapoda and reinforces the discrepancy between mitochondrial and nuclear DNA markers. BMC Evolutionary Biology. 2008, 8: 83-10.1186/1471-2148-8-83.PubMedPubMedCentralGoogle Scholar
- 27.Boore JL, Collins TM, Stanton D, Daehler LL, Brown WM: Deducing the pattern of arthropod phylogeny from mitochondrial-DNA rearrangements. Nature. 1995, 13: 163-165.Google Scholar
- 28.Cameron SL, Lambkin CL, Barker SC, Whiting MF: A mitochondrial genome phylogeny of Diptera: whole genome sequence data accurately resolve relationships over broad timescales with high precision. Systematic Entomology. 2007, 32 (1): 40-59. 10.1111/j.1365-3113.2006.00355.x.Google Scholar
- 30.Fenn J, Song H, Cameron SL, Whiting MF: A preliminary mitochondrial genome phylogeny of Orthoptera (Insecta) and approaches to maximizing phylogenetic signal found within mitochondrial genome data. Molecular Phylogenetics and Evolution. 2008, 49 (1): 59-68. 10.1016/j.ympev.2008.07.004.PubMedGoogle Scholar
- 41.Felsenstein J: Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology. 1978, 27 (4): 401-410. 10.2307/2412923.Google Scholar
- 46.Philippe H, Delsuc F, Brinkmann H, Lartillot N: Phylogenomics. Annual Review of Ecology, Evolution and Systematics. 2005, 36: 541-562. 10.1146/annurev.ecolsys.35.112202.130205.Google Scholar
- 48.Abascal F, Posada D, Knight RD, Zardoya R: Parallel evolution of the genetic code in arthropod mitochondrial genomes. PLoS Biology. 2006, 4: 711-718.Google Scholar
- 57.Philippe H, Derelle R, Lopez P, Pick K, Borchiellini C, Boury-Esnault N, Vacelet J, Renard E, Houliston E, Quéinnec E, Da Silva C, Wincker P, Le Guyader H, Leys S, Jackson DJ, Schreiber F, Erpenbeck D, Morgenstern B, Wörheide G, Manuel M: Phylogenomics Revives Traditional Views on Deep Animal Relationships. Current Biology. 2009, 19 (8): 706-712. 10.1016/j.cub.2009.02.052.PubMedGoogle Scholar
- 60.Rota-Stabelli O, Kayal E, Gleeson D, Daub J, Boore JL, Telford MJ, Pisani D, Blaxter M, Lavrov DV: Ecdysozoan mitogenomics: evidence for a common origin of the legged invertebrates, the Panarthropoda. Genome Biology and Evolution. 2010, 2: 425-440. 10.1093/gbe/evq030.PubMedPubMedCentralGoogle Scholar
- 66.Crowson RA: The biology of the Coleoptera. 1981, London: Academic PressGoogle Scholar
- 67.Kathirithamby J: Review of the order Strepsiptera. Systematic Entomology. 1989, 14 (1): 41-92. 10.1111/j.1365-3113.1989.tb00265.x.Google Scholar
- 68.Kukalova-Peck J, Lawrence J: Evolution of the hind wing in coleoptera. Canadian Entomologist. 1993, 125: 181-258. 10.4039/Ent125181-2.Google Scholar
- 69.Whiting MF: Phylogeny of the holometabolous insect orders: molecular evidence. Zoologica Scripta. 2002, 31: 3-17. 10.1046/j.0300-3256.2001.00093.x.Google Scholar
- 70.Chalwatzis N, Baur A, Stetzer E, Kinzelbach R, Zimmermann F: Strongly expanded 18S rRNA genes correlated with a peculiar morphology in the insect order of Strepsiptera. Zoologica. 1995, 98: 115-126.Google Scholar
- 71.Chalwatzis N, Hauf J, Van de Peer Y, Kinzelbach R, Zimmermann F: 18S ribosomal RNA genes of insects: Primary structure of the genes and molecular phylogeny of the Holometabola. Annals of the Entomological Society of America. 1996, 89: 788-803.Google Scholar
- 79.Friedrich F, Beutel RG: Goodbye Halteria? The thoracic morphology of Endopterygota (Insecta) and its phylogenetic implications. Cladistics. 2010, 26: 1-34. 10.1111/j.1096-0031.2009.00297.x.Google Scholar
- 80.Beutel RG, Friedrich F, Hörnschemeyer T, Phol H, Hünefeld F, Beckmann F, Meier R, Misof B, Whiting MF, Vilhelmsen L: Morphological and molecular evidence converge upon a robust phylogeny of the megadiverse Holometabola. Cladistics. 2010, 26: 1-15. 10.1111/j.1096-0031.2009.00297.x.Google Scholar
- 82.Castro LR, Dowton M: Mitochondrial genomes in the Hymenoptera and their utility as phylogenetic markers. Systematic Entomology. 2007, 32 (1): 60-69. 10.1111/j.1365-3113.2006.00356.x.Google Scholar
- 83.Boudreaux HB: Arthropod phylogeny with special reference to insects. 1979, New York: John Wiley & SonsGoogle Scholar
- 84.Beutel R, Gorb S: Ultrastructure of attachment specializations of hexapods, (Arthropoda): evolutionary patterns inferred from a revised ordinal phylogeny. Journal Of Zoological Systematics And Evolutionary Research. 2001, 39: 177-207. 10.1046/j.1439-0469.2001.00155.x.Google Scholar
- 85.Krenn HW: Evidence from mouthpart structure on interordinal relationships in Endopterygota?. Arthropod Systematics & Phylogeny. 2007, 65 (1): 7-14.Google Scholar
- 86.Whiting M: Phylogeny of the Holometabolous Insects. Assembling the Tree of Life. 2004, Oxford University Press, 345-359.Google Scholar
- 87.Gullan PJ, Cranston PS: The Insects: An Outline Of Entomology (3rd Edition). 2004, Wiley-BlackwellGoogle Scholar
- 88.Hornschemeyer T: Phylogenetic significance of the wing-base of the Holometabola (Insecta). Zoologica Scripta. 2002, 31: 17-29. 10.1046/j.0300-3256.2001.00086.x.Google Scholar
- 89.Grimaldi D, Engel M: Evolution of the insects. 2005, Cambrigde University PressGoogle Scholar
- 90.Beutel RG, Pohl H: Endopterygote systematics - where do we stand and what is the goal (Hexapoda, Arthropoda)?. Systematic Entomology. 2006, 31 (2): 202-219. 10.1111/j.1365-3113.2006.00341.x.Google Scholar
- 91.Campbell BC, Steffen-Campbell JD, Sorensen JT, Gill RJ: Paraphyly of Homoptera and Auchenorrhyncha inferred from 18S rDNA nucelotide sequences. Systematic Entomology. 1995, 20 (3): 175-194. 10.1111/j.1365-3113.1995.tb00090.x.Google Scholar
- 92.Sorensen JT, Campbell BC, Gill RJ, Steffen-Campbell JD: Non-Monophyly of Auchenorrhyncha (Homoptera), Based Upon 18s rDNA Phylogeny - Eco-Evolutionary and Cladistic Implications within Pre-Heteropterodea Hemiptera (s.l) and a Proposal for New Monophyletic Suborders. Pan-Pacific Entomologist. 1995, 71 (1): 31-60.Google Scholar
- 94.Borner C: Zur Systematik der Hexapoden. Zool Anz. 1904, 27: 511-533.Google Scholar
- 95.Yoshizawa K, Saigusa T: Phylogenetic analysis of paraneopteran orders (Insecta: Neoptera) based on forewing base structure, with comments on monophyly of Auchenorrhyncha (Hemiptera). Systematic Entomology. 2001, 26 (1): 1-13. 10.1046/j.1365-3113.2001.00133.x.Google Scholar
- 97.Kristensen NP: The phylogeny of hexapod orders a critical review of recent accounts. Journal of Zoological Sytematics and Evolutionary Research. 1975, 13 (1): 1-44.Google Scholar
- 98.Kristensen NP: Phylogeny of insect orders. Annual Review Of Entomology. 1981, 26: 135-157. 10.1146/annurev.en.26.010181.001031.Google Scholar
- 99.Kristensen NP: The phylogeny of hexapod orders a critical review of recent accounts. Journal of Zoological Sytematics and Evolutionary Research. 1975, 13 (1): 1-44.Google Scholar
- 100.Kristensen NP: Phylogeny of insect orders. Annual Review Of Entomology. 1981, 26: 135-157. 10.1146/annurev.en.26.010181.001031.Google Scholar
- 101.Hamilton K: Morphology and evolution of the rhynchotan head (insecta, hemiptera, homoptera). Canadian Entomologist. 1981, 113 (11): 953-974. 10.4039/Ent113953-11.Google Scholar
- 102.Hennig W, Pont A, Schlee D: Insect phylogeny. 1981, John Wiley & SonsGoogle Scholar
- 103.Sharov AG: Basic arthropodan stock with special reference to insects. 1966, Pergamon Press, New YorkGoogle Scholar
- 104.Sharov AG: The phylogenetic relations of the order Thysanoptera. Entomol Obozr. 1972, 51 (4): 854-858.Google Scholar
- 112.Fitzpatrick D, Pentony M: PutGaps: DNA gapped file from Amino Acid alignment. [http://bioinf.nuim.ie/software/putgaps]
- 115.Quant LS, Gascuel O, Lartillot N: Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 2008, 24: 2317-2323. 10.1093/bioinformatics/btn445.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.