Identification and characterization of insect-specific proteins by genome data analysis
- 9.1k Downloads
Insects constitute the vast majority of known species with their importance including biodiversity, agricultural, and human health concerns. It is likely that the successful adaptation of the Insecta clade depends on specific components in its proteome that give rise to specialized features. However, proteome determination is an intensive undertaking. Here we present results from a computational method that uses genome analysis to characterize insect and eukaryote proteomes as an approximation complementary to experimental approaches.
Homologs in common to Drosophila melanogaster, Anopheles gambiae, Bombyx mori, Tribolium castaneum, and Apis mellifera were compared to the complete genomes of three non-insect eukaryotes (opisthokonts) Homo sapiens, Caenorhabditis elegans and Saccharomyces cerevisiae. This operation yielded 154 groups of orthologous proteins in Drosophila to be insect-specific homologs; 466 groups were determined to be common to eukaryotes (represented by three opisthokonts). ESTs from the hemimetabolous insect Locust migratoria were also considered in order to approximate their corresponding genes in the insect-specific homologs. Stress and stimulus response proteins were found to constitute a higher fraction in the insect-specific homologs than in the homologs common to eukaryotes.
The significant representation of stress response and stimulus response proteins in proteins determined to be insect-specific, along with specific cuticle and pheromone/odorant binding proteins, suggest that communication and adaptation to environments may distinguish insect evolution relative to other eukaryotes. The tendency for low Ka/Ks ratios in the insect-specific protein set suggests purifying selection pressure. The generally larger number of paralogs in the insect-specific proteins may indicate adaptation to environment changes. Instances in our insect-specific protein set have been arrived at through experiments reported in the literature, supporting the accuracy of our approach.
KeywordsCuticular Protein Paralogous Group NCBI Protein Database Drosophila Protein Proteome Characterization
Insects constitute nearly 80% of species on earth and are among the most diverse group of organisms in the history of life, giving them considerable potential to provide insight into evolutionary mechanisms. Insects, with their large number of species, their biomass, diversity of adaptation, and ecological impact, support the structure and function of ecosystem and biodiveristy on the lands of the earth. Numerous crops rely on insects for pollination, with the importance of insects extending into other agricultural and human health concerns. Insects have been in existence for at least 400 million years, making them among the earliest land animals. Though nearly one million insect species have been classified and named, their actual number is believed to be between 2.5 and 10 million. It is widely accepted that insects diverged as members of one of the largest subphyla in arthropods more than 390 million years ago. During this time, insects experienced rapid evolution and a radiation that is considered faster than any other group , migrating into nearly all available environmental niches except the benthic zone . Mitochondrial DNA strongly supports an insect-crustacean clade as a sister group, which excludes the other arthropod subphyla collectively known as the myriapods . The insects are a monophyletic group, a universally held view supported by morphological and molecular features.
The structure of an organism is an outgrowth of development tailored to meet functional demands in an idiosyncratic evolutionary history. Like other segmented animals, insects are composed of a series of repeated units called metameres. Extant arthropods share many taxonomical characteristics, such as an exoskeleton, jointed appendages, and reduced coeloms and hemocoels. The segments of the insect body are organized into three major tagmata unique to this subclass: the head, thorax, and abdomen . The thorax has three pair of legs, and in pterygotes, the wings. In the abdomen, we find the presence of an ovipositor in females. In addition to the macro-scale features mentioned above, other defining features of the Insecta include: the loss of musculature and the presence of the Johnsonton's organ in the antenna, loss of articulations between the coxae and the sterna, sub-segmentation of the tarsus into units called tarsomeres, articulation of the pretarsal claws with the apical-most tarsomere , and the presence, at least primitively, of a long terminal filament . Insects are one of only four lineages of animals with powered flight, the others being pterosaurs, birds, and bats. Wings refine insect design, vastly improving mobility, dispersal, and complex behaviors to adapt to environmental challenges. It is widely held that insects evolved flight just once, at least 100 million years before pterosaurs, perhaps 170 million years ago . Other noteworthy features include the development of the posterior tentorium into a tranverse bar, and metamorphism and segmentation of metameres [7, 8].
It is likely that the specialized features of the Insecta clade are based on components specific to its proteome. Characterization of this protein set should improve understanding of the molecular basis for the diversification of insects and their extensive success in ecological niches. Toward elucidating this molecular basis, we have characterized the eukaryote and insect proteomes. The large number of eukaryote genome sequences now available, including various insect genomes, makes it possible to characterize proteomes computationally. In this work, we utilized the insect genome sequences of fruit fly, mosquito, silk worm, beetle, honeybee, locust ESTs, and the non-insect eukaryote genomes of nematode, human, and yeast. (The insect-species in our study cover holometabolous and hemimetabolous development.) Since our approach utilizes genome sequence for approximating the proteome, the resolution of the proteome characterization improves as more genomes become available. This rapid characterization of proteomes through computation facilitates rational hypothesis generation and experiment design in applied research in many areas, such as biodiversity, agriculture and human health.
Insect and Eukaryote protein sets
We modeled the insect proteome by selecting the subset of Drosophila protein sequences with homology to predicted genes in all insect-species studied here. Similarly, we defined the subset in Drosophila common to the eukaryote species studied here: mosquito, silkworm, beetle, honeybee, human, nematode and yeast. Because at this time it is not possible to definitively determine the eukaryote and insect proteomes, estimates are useful for comparative assessments. Our protein sets were derived from a collection of 13,525 protein sequences established for Drosophila melanogaster, which we reduced to 10,018 orthologous groups; proteins with significant similarity were considered as singletons in our processing, since paralogs may have arisen after speciation.
To determine the proteins in the Drosophila orthologous groups common to all insects studied here, called the insect core set, we used predicted proteins from insect genome sequences and EST sequences. We obtained 1346 orthologous groups from the intersection of the whole genomes of five holometabolous insects (see Methods). One aspect of our approximation is to use homologs to Drosophila proteins to characterize proteomes, implicitly assuming that function follows structure. This could contribute to differences in our characterization from the actual proteome, but it does not significantly detract from our use of the characterizations. We discuss further implications of our approximation in more detail below.
We found 466 proteins with homology to all eukaryotes considered in this study using methods similar to those above [see Additional file 3].
GO annotations and functional categories
We categorized proteins in the eukaryote (466 groups in opisthokont) and insect-specific sets (154 groups) using high-level gene ontology categories with results shown in Figure 2. In both the eukaryote and insect-specific sets, metabolic proteins constituted the highest fraction, 25% and 20%, respectively. Disproportionately represented categories are interesting to consider for candidate proteins that confer distinguishing characteristics. In the eukaryote/opisthokont set, genes responsible for processes such as cell division, cell motility, cell cycle, reproduction and cellular process are more highly represented by factors from about two to twenty. These proteins and their respective functional categories may distinguish insects less from eukaryotes/opisthokont than those proteins in categories that have a significant representation in the insect-specific set and are underrepresented in the eukaryotic/opisthokont set. These more highly represented categories in the insect-specific set are: larval development (2% in opisthokont, 4% in insect); defense response (0 in opisthokont, 6% in insect); and stress respone (0.2% in opisthokont, 6% in insect). What's more, a significant number of the insect-specific proteins were found to be related to pheromone/odorant binding proteins (OBP), insect cuticle proteins, and proline-rich proteins [see Aditional file 2].
Biological process categories
The five insects with whole genomes are all holometabolous and might not be representative of all insects. At present, a complete genome sequence for hemimetabolous has not been sequenced, most likely because hemimetabolous insects often have large genomes (more than 2 gigabases) . Fortunately, 45,474 high quality EST sequences from the hemimetabolous insect migratory locust permit us to perform analysis with all insects . We determined the insect-specific orthologs in the locust ESTs to arrive at a collection of six sets of insect-spectific proteins. Our analysis found the functional distribution of the orthologous proteins in of the six insects to be similar with the functional distribution of the largest set from the five holometabolous insects [see Additional file 2].
We have noted above, the computed insect-specific protein dataset is an approximatation dependent on available genome sequence. Inclusion of additional genomic data could alter the protein set. The lack of many representative outgroups might causes false positives, i.e. some proteins might be inaccurately included in our list. For example, the gene CG6895 related to immune function is identified as an insect-specific gene in this study, but its homolog was recently reported in the sea urchin . Improved quality of genome sequences and gene annotations for the insects used in this study will improve the accuracy of our computed proteins sets [13, 14].
Molecular function categories
A considerable number of the 51 insect-specific proteins were found to be related to insect cuticle proteins and pheromone/odorant binding proteins (OBP) [see Additional file 2]. Molting and metamorphosis are crucial processes in the developmental history of the insects involving cuticular proteins. Cuticular proteins are involved in important composite structural materials for insect cuticles, which provide protection, support, and locomotion; these prevent water loss via a wax layer, provide sites for waste product deposition, and protect from ultraviolet radiation . Olfaction is essential to insect survival and reproduction, such as in location of food sources and mate selection. These olfactory driven behaviors contribute significantly to the ability of insects to adapt to the environment. The odorant-binding proteins, which compose the insect olfactory system, are involved in the recognition of odorants of plants by insects [16, 17]. The pheromone binding proteins (PBP), abundantly present in the sensillum lymph of pheromone-responsive antennal hairs, are thought to be important in the recognition and discrimination of species-specific pheromones [18, 19]. The olfactory system in insects evolved as a remarkably selective and sensitive system, approaching the theoretical limit for a detector. Even a single pheromone molecule is enough to elicit impulses at the olfactory neuron [20, 21]. The large number of odorant and olfactory proteins in the insect-specific set suggests that in the evolution and diversification of insects, communication and adaptation with the environment played key roles in shaping their morphological and physiological characteristics.
Other insect-specific proteins in our insect-specific set have been found essential to development through experimental procedures [22, 23, 24, 25], supporting our insect-specific proteome characterization. Moreover, these have been found to be active in insects and are of interest for evolutionary reasons including their suspected roles in diversification. For example, the gene sinuous (CG10624), which is active in tracheal system development, can partially rescue the tracheal defects of sinuous mutants . The Exuperantia (Exu) protein in our insect-specific set is the earliest factor known to be required for the localization of bicoid mRNA to the anterior pole of the Drosophila oocyte. Exu is highly enriched in the sponge bodies; mutation of exu in Drosophila may result in defection of embryonic development . Larval serum proteins (Lsp), another type of protein in the insect-specific set, belonging to the hemocyanin superfamily. This family is thought to function as storage proteins that provide amino acids and energy during non-feeding periods of immature and adult development [24, 25].
Low mutation rate of insect-specific proteins
Our analysis suggests that our working set of insect-specific proteins had been shaped by strong natural selection, with environment as one of the selective influences.
An analysis of the genetic basis of evolution and development in insects was performed by characterizing the eukaryote/opisthokont core and insect-specific proteomes through genome analysis. Studies of the conservation and divergence between different organisms can provide clues to the molecular basis of species diversity and adaptation. The characterization of proteomes based on genome sequences provides a rapid method to approximate and update putative proteomes as genome sequences become available. Using this approach, we isolated fifty insect-specific proteins, many supported by experimental studies.
Proteins related to stress and immune responses constitute a significantly larger fraction of the proteins in our characterization of the insect-specific proteome, in contrast to our characterization of the eukaryote/opisthokont core proteome. The large component of olfaction and cuticle development proteins specific to the insect suggests the significance of communication and adaptation to the environment in insect evolution. Purifying selections in the evolution of insects were indicated in the analysis of nonsynonymous-to-synonymous substitution ratios, with a larger fraction of multi-paralog proteins possibly providing insects with an adaptive advantage over other eukaryotes. Due to the nature of our computatational method, our insect-specific proteins can increase or decrease with the inclusion of additional genome data from insects and non-insect species.
The protein sets in this work were founded on 18,282 protein sequences of Drosophila melanogaster  obtained from Ensembl . Genes were predicted in genome sequences for Anopheles gambiae (mosquito)  and Bombyx mori (silkworm) [33, 34]. Proteins of Tribolium castaneum and Apis mellifera  were obtained from HGSC. Homologs to the insect protein sequences were isolated in annotated genomes of human , yeast  and nematode . We obtained the Anopheles gambiae (mosquito) genome annotated with 16112 proteins (anopheles-21.2b) from Ensembl. The annotated human genome sequence draft (hg17) was obtained from UCSC , the worm genome (celegans-21.116a) from Ensembl, and the yeast genome from Saccharomyces Genome Database SGD . Proteins where obtained for D. yakuba from FlyBase for use in Ka/Ks analysis. The locust (Locusta migratoria) UniGene collection with 12,161 ESTs and cDNA sequences was obtained from LocustDB [11, 42].
Sequence alignment was performed with BLAST  using the BLOSUM62 scoring matrix and default parameters. Gene prediction was performed using the gene-finder algorithm BGF used in BGI GeneFinder  based on GenScan  and FgeneSH .
We grouped homologous protein sequences into paralogous groups. Protein sequences were considered paralogous if their alignment had an E-value less than or equal to 1e-5 and the alignment covered 70% or more of one of the aligned proteins. We represented paralogous groups by the longest member in the group, with the size of the group determined by the number of unique sequences in it.
Proteome characterizations using genomic based pipeline
We defined protein sets based on Drosophila proteins in our processing pipeline to characterize proteomes. Similarity with genome sequences, predicted proteins, and ESTs was used to cull sets determined in the processing pipeline as described below. Thus, it is important to note that the various protein sets we computationally arrive at characterize insect and eukaryote proteomes through homology.
The insect core set was arrived at by selecting proteins in the Drosophila protein data set with similarity to mosquito and silkworm protein sequences predicted by genome analysis, and with similarity to the locust EST sequence data. Protein sequences for predicted genes in silkworm and mosquito were aligned against fruit fly using blastp  and considered homologous with an E-value cutoff of 1e-5 or less; in addition, we required that the length of the aligned sequences be within 70% of each other (Figure 5).
The insect-specific protein set was derived from the insect core set, where proteins without significant alignment to the genome sequences of human, nematode, or yeast were included (E-values of 1e-5 or less). In addition, sequences in the insect core set were retained for the insect-specific set if any alignment covered less than 30% of the insect protein sequence. The insect-specific proteins were further assessed against the NCBI protein database, retaining sequences without significant similarity and less than 30% alignment coverage with all non-insect proteins (Figure 5).
Proteins in the insect core set with an E-value cutoff of 1e-5 or less in alignments with each of the non-insect eukaryotes, and involving 50% or more of the insect protein in the alignments, were included in the eukaryote core protein set.
Interpro annotation of insect proteins
Functional annotations for proteins in each of the working insect proteomes were determined using the annotation tool Interproscan  and Gene Ontology nomenclature . GO terms were downloaded from Gene Ontology Consortium.
Ka/Ks ratio calculation
We selected the most similar orthologs to Drosophila melanogaster in the Drosophila yakuba proteome, YN00 , to calculate Ka/Ks ratios.
This project was supported by the National Basic Research Program of China (No:2006CB102002), Chinese Academy of Sciences (GJHZ0518), Ministry of Science and Technology under program CNGI-04-15-7A, National Natural Science Foundation of China (90208019; 90403130; 30221004), and China National Grid. Other support came from Danish Platform for Integrative Biology, Ole Rømer grants from the Danish Natural Science Research Council and National Science Foundation (DBI 0217241). We thank four anonymous reviewers for their generous and constructive suggestion.
- 2.Gibert P, Capy P, Imasheva A, Moreteau B, Morin JP, Petavy G, David JR: Comparative analysis of morphological traits among Drosophila melanogaster and D. simulans: genetic variability, clines and phenotypic plasticity. Genetica. 2004, 120: 165-179. 10.1023/B:GENE.0000017639.62427.8b.PubMedCrossRefGoogle Scholar
- 4.Heming BS: Insect Development and Evolution. 2003, New York: Cornell University Press, 139-151.Google Scholar
- 6.Kristensen NP: Phylogeny of extant hexapods. The insects of Australia; A textbook for students and research workers. Edited by: Naumann ID, Carne PB, Lawrence JF, Nielsen ES, Spradberry JP, Taylor RW, Whitten MJ, Littlejohn MJ. 1991, Melbourne: Melbourne Univ. Press, 125-140. 2Google Scholar
- 9.Sabatier L, Jouanaguy E, Dostert C, Zachary D, Dimarcg JL, Bulet P, Imler JL: Pherokine-2 and -3: two Drosophila molecules related to pheromone/odor-binding proteins induced by viral and bacterial infections. Europ J Biochem. 2003, 270: 3398-3407. 10.1046/j.1432-1033.2003.03725.x.PubMedCrossRefGoogle Scholar
- 21.Leal WS: Pheromone reception. Topics in current chemistry. 2005, 240: 1-36.Google Scholar
- 28.Li W-H: Molecular Evolution (Sinaur Associates, Sunderland, Massachusetts. 1997).Google Scholar
- 30.Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, Hodgson A, George RA, Hoskins RA, Laverty T, Muzny DM, Nelson CR, Pacleb JM, Park S, Pfeiffer BD, Richards S, Sodergren EJ, Svirskas R, Tabor PE, Wan K, Stapleton M, Sutton GG, Venter C, Weinstock G, Scherer SE, Myers EW, Gibbs RA, Rubin GM: Finishing a whole-genome shotgun: release 3 of the the Drosophila melanogaster euchromatic genome sequence. Genome Biol. 2002, 3: research0079.1-14. 10.1186/gb-2002-3-12-research0079.CrossRefGoogle Scholar
- 31.Ensembl Genome Browser. [http://www.ensembl.org/index.html]
- 32.Holt Robert, Mani Subramanian G, Halpern Aaron, Sutton Granger, Charlab Rosane, Nusskern Deborah, Wincker Patrick, Clark Andrew, Ribeiro José, Wides Ron, Salzberg Steven, Loftus Brendan, Yandell Mark, Majoros William, Rusch Douglas, Lai Zhongwu, Kraft Cheryl, Abril Josep, Anthouard Veronique, Arensburger Peter, Atkinson Peter, Baden Holly, de Berardinis Veronique, Baldwin Danita, Benes Vladimir, Biedler Jim, Blass Claudia, Bolanos Randall, Boscus Didier, Barnstead Mary: The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002, 298: 129-49. 10.1126/science.1076181.PubMedCrossRefGoogle Scholar
- 33.Wang J, Xia Q, He X, Dai M, Ruan J, Chen J, Yu G, Yuan H, Hu Y, Li R, Feng T, Ye C, Lu C, Wang J, Li S, Wong GK, Yang H, Wang J, Xiang Z, Zhou Z, Yu J: SilkDB: a knowledgebase for silkworm biology and genomics. Nucleic Acids Research. 2005, D399-402. 33 DatabaseGoogle Scholar
- 34.Xia Q, Zhou Z, Lu C, Cheng D, Dai F, Li B, Zhao P, Zha X, Cheng T, Chai C, Pan G, Xu J, Liu C, Lin Y, Qian J, Hou Y, Wu Z, Li G, Pan M, Li C, Shen Y, Lan X, Yuan L, Li T, Xu H, Yang G, Wan Y, Zhu Y, Yu M, Shen W: A Draft Sequence for the Genome of the Domesticated Silkworm (Bombyx Mori). Science. 2004, 306: 1937-40. 10.1126/science.1102210.PubMedCrossRefGoogle Scholar
- 36.Honeybee Genome Project. [http://www.hgsc.bcm.tmc.edu/projects/honeybee/]
- 39.Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, Coulson A, D'Eustachio P, Fitch DH, Fulton LA, Fulton RE, Griffiths-Jones S, Harris TW, Hillier LW, Kamath R, Kuwabara PE, Mardis ER, Marra MA, Miner TL, Minx P, Mullikin JC, Plumb RW, Rogers J, Schein JE, Sohrmann M, Spieth J: The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 2003, 1: E45-10.1371/journal.pbio.0000045.PubMedCentralPubMedCrossRefGoogle Scholar
- 40.The UCSC Genome Browser Database. [http://genome-test.cse.ucsc.edu/]
- 41.Saccharomyces Genome Database (SGD). [http://www.yeastgenome.org/]
- 44.Li Heng, Gao Lei, Fang Lin, Liu Tao, Li Hai-Hong, Li Yan, Fang Li-Jun, Xie Hui-Min, Zheng Wei-Mou, Liu Jin-Song, Xu Zhao, Jin Jiao, Li Yu-Dong, Xing Zi-Xing, Gao Shao-Gen, Hao Bai-Lin: Test datasets and evaluation of gene prediction programs on the rice genome. J Comput Sci & Technol. 2005, 20: 446-453. 10.1007/s11390-005-0446-x.CrossRefGoogle Scholar
- 48.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-9. 10.1038/75556.PubMedCentralPubMedCrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.