Genome analysis of a major urban malaria vector mosquito, Anopheles stephensi
Anopheles stephensi is the key vector of malaria throughout the Indian subcontinent and Middle East and an emerging model for molecular and genetic studies of mosquito-parasite interactions. The type form of the species is responsible for the majority of urban malaria transmission across its range.
Here, we report the genome sequence and annotation of the Indian strain of the type form of An. stephensi. The 221 Mb genome assembly represents more than 92% of the entire genome and was produced using a combination of 454, Illumina, and PacBio sequencing. Physical mapping assigned 62% of the genome onto chromosomes, enabling chromosome-based analysis. Comparisons between An. stephensi and An. gambiae reveal that the rate of gene order reshuffling on the X chromosome was three times higher than that on the autosomes. An. stephensi has more heterochromatin in pericentric regions but less repetitive DNA in chromosome arms than An. gambiae. We also identify a number of Y-chromosome contigs and BACs. Interspersed repeats constitute 7.1% of the assembled genome while LTR retrotransposons alone comprise more than 49% of the Y contigs. RNA-seq analyses provide new insights into mosquito innate immunity, development, and sexual dimorphism.
The genome analysis described in this manuscript provides a resource and platform for fundamental and translational research into a major urban malaria vector. Chromosome-based investigations provide unique perspectives on Anopheles chromosome evolution. RNA-seq analysis and studies of immunity genes offer new insights into mosquito biology and mosquito-parasite interactions.
KeywordsGene Ontology Malaria Simple Repeat Synteny Block Pericentric Heterochromatin
Mosquitoes in the genus Anopheles are the primary vectors of human malaria parasites and the resulting disease is one of the most deadly and costly in history ,. Publication and availability of the Anopheles gambiae genome sequence accelerated research that has not only enhanced our basic understanding of vector genetics, behavior, and physiology and roles in transmission, but also contributed to new strategies for combating malaria . Recent application of next-generation sequencing technologies to mosquito genomics offers exciting opportunities to expand our understanding of mosquito biology in many important vector species and harness the power of comparative genomics. Such information will further facilitate the development of new strategies to combat malaria and other mosquito-borne diseases. An. stephensi is among approximately 60 species considered important in malaria transmission and is the key vector of urban malaria on the Indian subcontinent and the Middle East ,. The fact that a recent resurgence of human malaria in Africa could have been caused by the sudden appearance of An. stephensi indicates that An. stephensi may pose an even greater risk to human health in the future . Of the three forms, type, mysorensis, and intermediate, the former is responsible for the majority, if not all, of urban malaria transmission across its range and accounts for approximately 12% of all transmission in India . Thus efforts to control it can be expected to contribute significantly to the malaria eradication agenda ,. An. stephensi is amenable to genetic manipulations such as transposon-based germline transformation , genome-wide mutagenesis , site-specific integration , genome-editing , and RNAi-based functional genomics analysis . Our understanding of the interactions between An. stephensi and the malaria parasites is rapidly improving -. Thus An. stephensi is emerging as a model species for genetic and molecular studies. We report the draft genome sequence of the Indian strain of the type form of An. stephensi as a resource and platform for fundamental and translational research. We also provide unique perspectives on Anopheles chromosome evolution and offer new insights into mosquito biology and mosquito-parasite interactions.
Results and discussion
Draft genome sequence of An. stephensi: Assembly and verification
Scaffold N50 size
Maximum scaffold length
Minimum scaffold length
Total length of scaffolds
Contig N50 size
Maximum contig length
Minimum contig length
Total length of contigs
Physical map information
Scaffolds per arm (n)
Mapped genome (%)
Total genome (%)
Global transcriptome analysis
We identified 241 and 313 genes with female- or male-biased expression, respectively (Additional file 2: Sex-biased genes list and GO terms). The male-biased genes are enriched for those whose products are involved in spermatogenesis and the auditory perception. Male mosquitoes detect potential mates using their Johnston’s organ, which has twice the number of sensory neurons as that of the females ,. The female-biased genes are enriched for those whose products are involved in proteolysis and other metabolic processes likely relevant to blood digestion.
Manual annotation was performed on genes involved in innate immunity including those that encode the LRR immune (LRIM) and the Anopheles Plasmodium-responsive leucine-rich repeat 1 (APL1) proteins, and the genes of the Toll, immune deficiency (IMD), insulin/insulin-like growth factor signalling (IIS), mitogen-activated protein kinase (MAPK), and TGF-β signalling pathways. A number of studies have demonstrated the importance of these genes or pathways in mosquito defense against parasites or viruses -,-. Manual analysis showed overall agreement with the automated annotation and improved the gene models in some cases (Additional file 2). A high level of orthology is generally observed between An. stephensi and An. gambiae and we highlight here a few potentially interesting exceptions. An. stephensi may have only one APL1 gene (ASTEI02571) instead of the three APL1 gene cluster found in An. gambiae (Additional file 1: Figure S1). We also observed the apparent lack of TOLL1B and 5B sequences in An. stephensi, which in An. gambiae are recent duplications of TOLL1A and 5A, respectively.
Expression profiles of all immunity genes were analyzed using the 11 RNA-seq samples to provide insights into their biological functions (Additional file 2: RNA-seq expression profile of immunity-related genes). For example, FKBP12, a protein known to regulate both transforming growth factor (TGF)-β and target of rapamycin (TOR) signaling, showed abundant transcript levels across immature stages and adult tissues (Additional file 1: Figure S2). The high expression levels of AsteFKBP12 in all examined stages and tissues were unexpected. Examination of existing publicly-available microarray data confirmed these expression levels and patterns . FKBP12 in mammals forms a complex with rapamycin and FKBP-rapamycin-associated protein (FRAP) to inhibit TOR . Given that TOR signaling is fundamental to many biological functions in mammals  and cumulative data support the same for D. melanogaster, a high level of FBKP12 expression may be critical for tight regulation of TOR activity in An. stephensi and perhaps An. gambiae. Expression patterns of the An. gambiae FKBP12 ortholog, AGAP012184, from microarray datasets  support the hypothesis that this protein is involved in a broad array of Anopheline physiologies including: development, blood-feeding, molecular form-specific insecticide resistance, circadian rhythms, desiccation resistance, mating status, and possibly also broad regulation of infection based on studies with murine (Plasmodium berghei) and human (Plasmodium falciparum) malaria parasites. Whether these same physiologies and others are regulated by FKBP12 in An. stephensi will require experimental confirmation. Given that signalling pathways regulating embryonic pattern formation in Drosophila (for example, the Toll pathway ) have been co-opted in the adult fly for regulation of various physiologies including metabolism and immune defense, the data presented here support the hypothesis that pathways integral to adult biology in adult Anophelines also have been similarly co-opted from important developmental roles.
Saliva of blood-feeding arthropods contains a cocktail of pharmacologically active components that disarm vertebrate host's blood clotting and platelet aggregation, induce vasodilation, and affect inflammation and immunity. These salivary proteins are under accelerated evolution due most likely to their host's immune pressure. A previous salivary gland transcriptome study identified 37 corresponding salivary proteins in An. stephensi, most of which are shared with An. gambiae, including mosquito and Anopheles-specific protein families . A more extensive sialotranscriptome based on approximately 3,000 ESTs identified the templates for 71 putative secreted proteins for An. gambiae. The combined data verify the identity of 71 putative salivary secreted proteins for An. stephensi, seven of which have no similarities to An. gambiae proteins (Additional file 2: Automatic annotated salivary genes). The current assembly of the An. stephensi genome shows that many salivary gland genes are present as tandem repeated genes and represent families that arose by gene duplication events. Tandem repeated gene families often are poorly annotated by automated approaches, therefore, manual annotation was necessary to improve the salivary gland gene models (Additional file 2). In particular, An. gambiae has eight genes of the D7 family, which has modified odorant binding domains (OBD) that strongly bind agonists of platelet aggregation and vasoconstriction (histamine, serotonin, epinephrine, and norepinephrine) . Three of these genes have two OBDs while the remaining five have only one domain each. As in An. gambiae, the short forms are oriented in tandem and in the opposite orientation of the long-form genes. However, An. stephensi has apparently collapsed the second long form to create a sixth short form.
Comparative analysis of additional gene families
Functional annotations of a number of gene families in An. stephensi were obtained based on their InterPro ID  (Additional file 2: Gene families counts table). We also compared gene numbers in these gene families across several species. An. stephensi and An. gambiae showed similar gene numbers in most of the gene families  and this is consistent with the close phylogenetic relationship between the two species. As observed with manually annotated immunity-related genes (Additional file 1: Figure S3), strong one-to-one relationship was observed between An. stephensi and An. gambiae genes in odorant binding proteins (OBPs) (Additional file 1: Figure S4A) and other gene families studied. There are a few gene families that showed obvious difference in numbers between An. stephensi and An. gambiae. We performed phylogenetic analysis of these gene families. The results (Additional file 1: Figure S4B and Figure S4C) indicate gene expansion in the odorant receptors (OR) and fibrinogen-related proteins in An. gambiae. Interestingly, a plurality of expanded genes is physically clustered in An. gambiae, suggesting that the gene expansions in An. gambiae may have arisen from local duplications. For example, the An. stephensi single-copy OR gene ASTEI08685 has four orthologs in An. gambiae (AGAP004354, AGAP004355, AGAP004356, and AGAP004357). The putative orthologs of these `expanded' genes tend to be single- or low-copy in An. stephensi and other related species in Vectorbase, supporting the interpretation that the lack of duplicated copies in An. stephensi is not due to assembly or annotation error. Further analysis that includes all species in the ongoing 16 Anopheles genomes project  will facilitate future comparative analysis of gene family expansions and gene losses.
Transposable elements and other interspersed repeats
Length occupied (bp)
Genome landscape: a chromosomal arm perspective
An. stephensi has a lower density of transposable elements across all chromosome arms than An. gambiae (Figure 5; Additional file 1: Tables S2 and S3; Additional file 2: Genome Landscape). The density of transposable elements on the An. stephensi X is more than twice that of the autosomes. A comparison of the An. stephensi simple repeats with those in An. gambiae euchromatin showed that densities in the latter were approximately 2-2.5× higher (Figure 5; Additional file 1: Tables S2 and S3). The greatest densities of simple repeats were found on the X chromosome and this is consistent with a previous study in An. gambiae. Although An. stephensi shows lower densities of simple repeats across all arms compared to An. gambiae, its X appears to harbor an over-representation of simple repeats compared to its autosomes. Scaffold/Matrix-associated regions (S/MARs) can potentially affect chromosome mobility in the cell nucleus and rearrangements during evolution , and these were found to be enriched in the 2 L and 3R arms (Figure 5; Additional file 1: Tables S2 and S3).
Molecular organization of pericentric heterochromatin
Anopheles mosquitoes have heteromorphic sex-chromosomes where males are heterogametic (XY) and females homogametic (XX) . The high repetitive DNA content of Y chromosomes makes them difficult to assemble and they often are ignored in genome projects. An approach called the chromosome quotient  was used to identify 57 putative Y sequences spanning 50,375 bp (Additional file 2). All of these sequences are less than 4,000 bp in length and appear to be highly repetitive. Five BACs that appeared to be Y-linked based on the CQs of their end sequences were analyzed by sequencing and their raw PacBio reads were assembled with the HGAP assembler . Eleven contigs spanning 196,498 bp of predicted Y-linked sequences were obtained (Additional file 2). The 57 Y-linked sequences and 11 contigs from the Y-linked BACs represent currently the most abundant set of Y sequences in any Anopheles species. RepeatMasker analysis using the annotated An. stephensi interspersed repeats showed that approximately 65% of the An. stephensi Y sequences are interspersed repeats. LTR retrotansposons alone occupy approximately 49% of the annotated Y (Additional file 2).
Synteny and gene order evolution
Rates of chromosome evolution in Drosophila and Anopheles
Recent studies have established that both Anopheles and Drosophila species have high rates of chromosomal evolution as compared with mammalian species ,-. We compared the number of breaks per megabase for the X chromosome and all chromosomes to understand the differences in the dynamics of chromosome evolution between Drosophila and Anopheles (Additional file 1: Table S7). These results reveal a higher ratio of the rates of evolution of sex chromosome to all chromosomes in Anopheles than Drosophila, with means of 2.116 and 1.197, respectively (Figure 8B). We correlated densities of different molecular features including simple repeats, TEs, genes, and S/MARs with the rates of rearrangement calculated for each arm (Additional file 1: Tables S8-S13). The strongest correlations were found among the rates of evolution across all chromosome arms and the densities of microsatellites, minisatellites, and satellites in both An. gambiae and An. stephensi. The highly-positive correlations between rates of inversion across all chromosome arms and satellites of different sizes are due most likely to the co-occurring abundance of satellites and inversions on the X chromosome. Rates of inversions and satellite densities are much lower on the autosomes. S/MARs in autosomes were correlated negatively and genes correlated positively with polymorphic inversions.
Genetic diversity of the genome
The genome sequencing effort reported in the current study is based on an inbred laboratory strain to ensure good assembly. Nonetheless, we performed genome-wide SNP analysis based on the available data. A total of 530,997 SNPs were detected (Additional file 2: SNP analysis raw data). A total of 319,751 SNPs were assigned to chromosomes based on mapping information (Additional file 1: Table S14). The SNP calls were assessed for their effect on the primary sequence of transcripts (Additional file 2: Summary of transcript consequences for An stephensi Indian strain SNP calls). These analyses will help future population genomic studies and facilitate association studies. We found that the X chromosome has a markedly lower frequency of SNPs than the autosomes in agreement with the similar observation in An. gambiae. The observed pattern may be explained by a smaller effective population size of the X chromosome due to male hemizygosity and lower sequence coverage of the X chromosome .
The genome assembly of the type-form of the Indian strain of An. stephensi was produced using a combination of 454, Illumina, and PacBio sequencing and verified by analysis of BAC clones and ESTs. Physical mapping was in complete agreement with the genome assembly and resulted in a chromosome-based assembly that includes 62% of the genome. Such an assembly enabled analysis of chromosome arm-specific differences that are seldom feasible in next-gen genome projects.
Comparative analyses between An. stephensi and An. gambiae showed that the Anopheles X has a high rate of chromosomal rearrangement when compared with autosomes, despite the lack of polymorphic inversions in the X chromosomes in both species. Additionally, the difference between the rates of X chromosome and all chromosome evolution is much more striking in Anopheles than in Drosophila. The high rate of evolution on the X correlates well with the density of simple repeats. Our data indicate that overall high rates of chromosomal evolution are not restricted to Drosophila but may be a feature common to Diptera.
The genome landscape of An. stephensi is characterized by relatively low repeat content compared to An. gambiae. An. stephensi appears to have larger amount of repeat-rich heterochromatin in pericentric regions but far less repetitive sequences in chromosomal arms as compared with An. gambiae. Using a newly developed chromosome quotient method, we identified a number of Y-chromosome contigs and BACs, which together represent currently the most abundant set of Y sequences in any Anopheles species.
The current assembly contains 11,789 predicted protein coding genes, 127 miRNA genes, 434 tRNA genes, and 53 fragments of rRNA genes. An. stephensi appears to have fewer gene duplications than An. gambiae according to orthology analysis, which may explain the slightly lower number of gene models.
This genome project is accompanied by the first comprehensive RNA-seq-based transcriptomic analysis of an Anopheles mosquito. Twenty gene clusters were identified according to gene expression profiles, many of which are stage- or sex-specific. GO term analysis of these gene clusters provided biological insights and leads for important research. For example, male-biased genes were enriched for genes involved in spermatogenesis and the auditory perception.
Close attention was paid to genes involved in innate immunity including LRIMs, APL1, and proteins in the Toll, IMD, insulin, and TGF-β signaling pathways. A high level of orthology is generally observed between An. stephensi and An. gambiae. RNA-seq analysis, which was corroborated by other expression analysis methods, provided novel insights. For example, a protein known to interact with both TOR and TGF-β signaling pathways showed abundant mRNA expression in a wide range of tissues, providing new leads for insights into both TOR and TGF-β signaling in mosquitoes.
Material and methods
The Indian strain of An. stephensi, a representative of the type form was sequenced. The lab colony from which we selected mosquitoes for sequencing was originally established from wild mosquitoes collected in India. The lab colony has been maintained continuously for many generations so we did not attempt to inbreed it.
DNA was isolated from more than 50 adult male and female An. stephensi using the Qiagen (Hilden, Germany) DNeasy Blood and tissue kit following the suggested protocol. The integrity of the DNA was verified by running an aliquot on a 1% agarose gel to visualize any degradation. Total RNA was isolated using the standard protocol of the mirVana RNA isolation kit (Life Technologies, Carlsbad, CA, USA) and quality was verified using Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).
The An. stephensi genome was sequenced to 19.4× coverage using 454 FLX Titanium sequencing performed by the Virginia Bioinformatics Institute (VBI) core laboratory. Sequencing was performed on four different libraries: a single-end shotgun library, and 3 kb, 8 kb, and 20 kb mate-pair libraries. A 200 bp insert size library produced from male An. stephensi genomic DNA was prepared and subjected to a single lane of Illumina HiSeq. Genomic DNA from male An. sequence was subjected to 10 SMRT cells of Pacific Biosciences (PacBio) v1 sequencing. Only males were sequenced with PacBio because we are interested in increasing the probability of finding Y chromosome sequences. Sanger sequencing performed by Amplicon Express was used to sequence 7,263 BAC-ends.
We used several approaches to combine the Illumina and 454 data to generate a better assembly. Newbler can take raw Illumina data as input, so we tried a Newbler assembly with the 454 and Illumina data. However, this resulted in a worse assembly than 454 alone. We had much more success with the strategy used to assemble the Solenopsis invicta genome . We assembled the Illumina data first, and then cut the assembly into pseudo-454 reads. These reads were then used along with the real 454 data as input to Newbler .
De novo Illumina assembly with Celera
We assembled the paired-end Illumina reads using the Celera assembler  with the parameters: `overlapper = ovl; unitigger = bogart; utgBubblePopping = 1; kickOutNonOvlContigs = 1; cgwDemoteRBP = 0; cgwMergeMissingThreshold = 0.5; merSize = 14'. The Celera assembler output comprised 41,213 contigs spanning 212.8 Mb. The N50 contig size of this assembly was 16.8 kb.
De novo 454 and Illumina pseudo-454 reads assembly with Newbler 2.8
The contigs of the aforementioned Illumina assembly were shredded informatically into 400 bp pieces with overlapping 200 bp to approximate 454 reads. To artificially simulate coverage depth, we started the shredding at offsets with the values of 0, 10, and 20. Shredding the Illumina assembly resulted in 2,452,038 pseudo-454 reads simulating 4.17× coverage.
We generated an assembly of the 454 and pseudo-454 reads with Newbler 2.8 using the `-het -scaffold -large -s 500' parameters. The resulting assembly contained 23,595 scaffolds spanned 221 Mb. The scaffold N50 size was 1.34 Mb. Mitochondrial DNA (1 scaffold), and other contamination (87 scaffolds) were identified by blastn and removed from the assembly.
Gap-filling with PacBio reads
PacBio data was used to fill gaps in the scaffolds to further improve the genome assembly. We error-corrected raw PacBio reads using the 454 sequencing data with the Celera pacBioToCa pipeline. pacBioToCa produced 0.88 Gb of error-corrected PacBio reads. Using the error-corrected PacBio data as input, Pbjelly  was used to fill gaps with parameters: `-minMatch 30 -minPctIdentity 98 -bestn 10 -n Candidates 5 -maxScore -500 -nproc 36-noSplitSubreads'. Pbjelly filled 1,310 gaps spanning 5.4 Mb.
Further scaffolding with BAC-ends
The scaffolds of the assembly were improved subsequently through the integration of 3,527 BAC-end pairs (120 kb ± 70 kb) using the Bambus scaffolder  (Additional file 2: BAC-ends dbGSS accession numbers). The BAC-end sequences were mapped to the scaffolds using Nucmer . The output files were used to generate the `.contig' format files required for Bambus. In total, 275 links between scaffolds were detected. Of these, 169 were retained as potential valid links, which are links connected by uniquely mapped BAC-ends. Links confirmed by less than two BAC-ends were rejected. A total of 46 links were retained that together connected 22 scaffolds, increasing the N50 scaffold size from 1,378 kb to 1,572 kb.
CEGMA (Core Eukaryotic Genes)
We used CEGMA  to search for the number of core eukaryotic genes to test the completeness and correctness of the genome assembly. CEGMA provides additional information as to whether the entire core eukaryotic genes are present (>70%) or only partially present (>20% and <70%). In total, CEGMA found 96.37% of the 248 core eukaryotic genes to be present, and 97.89% of the core eukaryotic genes to be partially present.
We checked whether BAC-ends align concordantly to the genome to study the structural correctness of the de novo assembly. BAC-ends were aligned to the scaffolds using NUCMER. In order to ensure unambiguous mapping, only sequences that aligned to a unique location with >95% coverage and 99% identity were used. In total, 21.6% of the BAC-end sequence pairs could be aligned to a unique position in the An. stephensi genome with these stringent criteria. Pairs of BAC-end sequence that aligned discordantly to a single scaffold were considered indicative of potential misassembly. Only four of 717 aligned BAC-end pairs aligned discordantly with the assembly confirming overall structural correctness.
An. stephensi EST sequences were downloaded from both the NCBI and VectorBase. We screened the EST sequences to remove any residual vector sequence. The screened ESTs were aligned to the assembly with GMAP . In total, 35,367 of 36,064 ESTs aligned to the assembly. Of these, 26,638 aligned over at least 95% of their length with an identity of >98%. The high percentage of aligned ESTs demonstrates the near-completeness of the An. stephensi genome assembly.
Fluorescent in situ hybridization (FISH): Slides were prepared from ovaries of lab reared, half-gravid females of the An. stephensi Indian wild-type strain. Slide preparation and hybridization experiments followed the techniques described in Sharakhova et al.. Fluorescent microscope images were converted to black and white and inverted in Adobe Photoshop. FISH signals were mapped to specific bands or interbands on the physical map for An. stephensi presented by Sharakhova et al..
Constructing the physical map
For the chromosomal based genome assembly, all probes mapped by in situ hybridization by Sharakhova  and this study were aligned to the final version of the An. stephensi genome using NCBI blast + blastn. Different blastn parameters were used for probes from different sources to determine if the probe was kept in the final assembly. An e-value of 1e-40 and an identity of >95% was required for probes from An. stephensi. An e-value of 1e-5 was required for probes from species other than An. stephensi. Probes that mapped to more than one location in the genome were discarded. The work by Sharakhova et al. hybridized 345 probes however, only approximately 200 probes from that study were maintained in the final chromosomal assembly. An additional 27 PCR products and BAC clones were hybridized to increase the coverage of our chromosomal assembly.
The genome assembly was annotated initially using the MAKER pipeline . This software synthesizes the results from ab initio gene prediction with experimental gene evidence to produce final annotations. Within the MAKER framework, RepeatMasker  was used to mask low-complexity genomic sequence based on the repeat library from previous prediction. First, ESTs and proteins were aligned to the genome by MAKER using BLASTn and BLASTx, respectively. MAKER uses the program Exonerate to polish BLAST hits. Next, within the MAKER framework, SNAP  and AUGUSTUS  were run to produce ab initio gene predictions based on the initial training data. SNAP and AUGUSTUS were run once again inside of MAKER using the initial training obtained from the ESTs and protein alignments to produce the final annotations.
Orthology and molecular species phylogeny
Orthologs of predicted An. stephensi genes were assigned by OrthoDB . Information about orthologous genes for An. gambiae, Ae. aegypti, and D. melanogaster also were downloaded from OrthoDB. Enrichment analysis was performed for categories of orthologs using the methods provided in the ontology section. The molecular phylogeny of the 10 selected species was determined from the concatenated protein sequence alignments using MUSCLE  (default parameters) followed by alignment trimming with trimAl  (automated1 parameters) of 3,695 relaxed single-copy orthologs (a maximum of three paralogs allowed in no more than two species, longest protein selected) from OrthoDB . The resulting 2,246,060 amino acid columns with 932,504 distinct alignment patterns was analyzed with RAxML  with the PROTGAMMAJTT model to estimate the maximum likelihood species phylogeny with 100 bootstrap samples.
RNA-seq from 11 samples including: 0 to 1, 2 to 4, 4 to 8, and 8 to 12 h embryos, larva, pupa, adult males, adult females, non-blood-fed ovaries, blood-fed ovaries, and female carcasses without ovaries as described  were used for transcriptome analysis. These RNA-seq samples are available from the NCBI SRA (SRP013839). Tophat  was used to align these RNA-seq reads to the An. stephensi genome and HTSeq-count  was used to generate an occurrence table for each gene in each sample. The numbers of alignments to each gene in each sample then were clustered using MBCluster.Seq , an R package designed to cluster genes by expression profile based on Poisson or Negative-Binomial models. MBCluster.Seq generated 20 clusters. To visualize these results we performed regularized log transformation to the original occurrence tables for all 20 clusters using DESeq2 . The results were plotted using ggplot2 .
Gene ontology (GO) terms were assigned for the 20 clusters of predicted An. stephensi genes. GO terms were assigned using Blast2Go . The predicted proteins are blasted against the NCBI non-redundant protein database and scanned with InterProScan  against InterPro's signatures. After GO terms were assigned, GO-slim results were generated for the available annotation based on the Generic GO slim mapping. The GO terms assigned by Blast2GO were subject to GO term enrichment. Over-represented GO terms were identified using a hypergeometric test using the GOstats package in R .
Functional annotation of key gene families
We obtained the InterPro ID information for proteins in An. stephensi from the ontology analysis. We functionally annotated gene families based on the assigned InterPro ID. The gene families, including genes involved in immunity, chemosensation, and detoxification were studied. For comparative genome analysis, we retrieved the InterPro ID for seven other species (An. gambiae, An. darlingi, A. aegypti, Culex quinquefasciatus, D. melanogaster, Bombyx mori, and Tribolium castaneum) using Biomart  from vectorbase  and Ensembl Metazoa . We compared gene numbers in gene families of interest. For gene families with obvious differences in numbers between An. stephensi and An. gambiae, we preformed phylogenetic analysis of these genes. First we aligned these genes from Anopheles species using MUSCLE . Then, we constructed phylogenetic tree using Neighbor-joining method with 1,000 bootstrap replicates by CLC Genomics Workbench 4 .
We used tRNAScan-SE  with the default eukaryotic mode to predict 434 tRNAs in the An. stephensi genome (Additional file 1: Table S15; Additional file 2: Non-coding RNA annotation). Other non-coding RNAs were predicted with INFERNAL  by searching against Rfam database version 11.0 . A total of 53 fragmental ribosomal RNA, 34 snRNA, 7 snoRNA, 127 miRNA, and 148 sequences with homology to the An. gambiae self-cleaving riboswitch were predicted with an e-value cutoff of 1e-5.
Transposable elements and other interspersed repeats
Transposable element discovery and classification were performed on the An. stephensi scaffold sequences using previously-described pipelines for LTR-retrotransposons, non-LTR-retrotransposons, SINEs, DNA-transposons, and MITEs, followed by manual inspection . The manually-annotated TE libraries then were compared with the RepeatModeler output to remove redundancy and to correct mis-classification by RepeatModler. A repeat library was produced that contains all manually-annotated TEs and non-redundant sequences from RepeatModeler. The repeat library was used to run RepeatMasker at default settings on the An. stephensi assembly to calculate TE copy number and genome occupancy.
The number of microsatellites, minisatellites, and satellites present in the mapped scaffolds for each chromosome were derived by dividing the scaffolds into strings of 100,000 bp and then concatenating them into a multi-FASTA file to represent an An. stephensi pseudo chromosome. Scaffolds were oriented when possible, and all unoriented scaffolds were given the default positive orientation for that chromosome. The multiFASTA file for each pseudo-chromosome was analyzed using a local copy of TandemRepeatsFinder v 4.07b . Parameters for the analysis followed those used by Xia et al.: microsatellites were those of period size 2 to 6 with copy number of >8. Minisatellites had period size 7 to 99 while repeats were considered satellites if they had a period size of >100. Both satellites and minisatellites were considered only if they had a copy number of >2. Simple repeats were recorded only if they had at least 80% identity.
Identification of S/MARs
Scaffold/matrix associated regions were identified using the SMARTest bioinformatic tool provided by Genomatix . Densities of genes and TEs per 100 kb window were calculated using Bedtools coverage based on the genome annotation and TE annotation, respectively.
Synteny, gene order evolution, and inversions
One-to-one orthologs from An. gambiae and An. stephensi were identified using OrthoDB  and their locations on the An. gambiae and An. stephensi scaffolds determined. Comparative positions of the genes on the scaffolds based on ontology relationships were plotted using genoPlotR . Scaffolds that mapped using two or more probes were oriented properly, but those anchored by only one probe were used in their default orientation. The number of synteny blocks for each pair of homologous chromosome arms between An. stephensi and An. gambiae was determined from the images output from genoPlotR. Two criteria were imposed to determine the number of synteny blocks: the orientation of two or more orthologous genes, and whether the genes remained in the same order on the chromosome of An. stephensi as in An. gambiae. Thus, a group of two or more genes is assigned to the same synteny block if it has the same orientation and order in both species. Synteny blocks were numbered 1, 2, 3, 4, and so on along the chromosome by assigning An. gambiae as the default gene order. An. stephensi was considered rearranged compared to An. gambiae when the numbering of synteny blocks was the same in both species but the order was rearranged in An. stephensi. After quantifying the number of synteny blocks and the amount of gene rearrangement between the two species, we estimated the number of chromosomal inversions between them using the programs Genome Rearrangements in Mouse and Man (GRIMM ).
We used CLC Genomics Workbench 4  to identify SNPs using a combination of the male and female Illumina data (Accession number: SRP013838). The required coverage was 20 and minimum variant frequency was 35. SNP calls made on the assembly were assessed for their effect on transcripts from the gene build using the Ensembl e-hive, variation database, and variation consequence pipeline (available from github  and ). The Ensembl variation consequence pipeline uses the Ensembl API in the same manner as the Variant Effect Predictor  and produces equivalent output. The variation consequence pipeline directly loaded the analysis results into an Ensembl MySQL variation database which was used to generate summary statistics of transcript consequences classified using Sequence Ontologs .
The An. stephensi genome assembly has been deposited in GenBank under the accession number ALPR00000000 and is available at . The raw sequence data used for genome assembly are available in the NCBI SRA: 454 - SRP037783, Illumina - SRP037783, and PacBio - SRP037783. The BAC-ends used for scaffolding are available from the NCBI dbGSS accession numbers: KG772729 - KG777469. RNA-Seq data can be accessed at the NCBI SRA with ID SRP013839.
This work is supported by the Fralin Life Science Institute and the Virginia Experimental Station, and by NIH grants AI77680 and AI105575 to ZT, AI094289 and AI099528 to IVS, AI29746 to AAJ, AI095842 to KM, AI073745 to MAR, AI080799 and AI078183 to SL, and AI042361 and AI073685 to KDV. AP and IVS are supported in part by the Institute for Critical Technology and Applied Science and the NSF award 0850198. RMW is supported by Marie Curie International Outgoing Fellowship PIOF-GA-2011-303312. XC is supported by GDUPS (2009). This work was also supported in part by NSF grant CNS-0960081 and the HokieSpeed and BlueRidge supercomputers at Virginia Tech. YS thanks the Department of Biotechnology, Government of India for the financial support. Drew Cocrane provided assistance with in situ hybridization.
- 3.Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JMC, Wides R, Salzberg SL, Loftus B, Yandell M, Majoros WH, Rusch DB, Lai Z, Kraft CL, Abril JF, Anthouard V, Arensburger P, Atkinson PW, Baden H, de Berardinis V, Baldwin D, Benes V, Biedler J, Blass C, Bolanos R, Boscus D, Barnstead M, et al: The genome sequence of the malaria mosquito Anopheles gambiae. Science (New York, NY). 2002, 298: 129-149. 10.1126/science.1076181.CrossRefGoogle Scholar
- 4.Rafinejad J, Vatandoost H, Nikpoor F, Abai MR, Shaeghi M, Duchen S, Rafi F: Effect of washing on the bioefficacy of insecticide-treated nets (ITNs) and long-lasting insecticidal nets (LLINs) against main malaria vector Anopheles stephensi by three bioassay methods. J Vector Borne Dis. 2008, 45: 143-150.PubMedGoogle Scholar
- 9.Alonso PL, Brown G, Arevalo-Herrera M, Binka F, Chitnis C, Collins F, Doumbo OK, Greenwood B, Hall BF, Levine MM, Mendis K, Newman RD, Plowe CV, Rodríguez MH, Sinden R, Slutsker L, Tanner M: A research agenda to underpin malaria eradication. PLoS Med. 2011, 8: e1000406-10.1371/journal.pmed.1000406.PubMedPubMedCentralCrossRefGoogle Scholar
- 12.Isaacs AT, Jasinskiene N, Tretiakov M, Thiery I, Zettor A, Bourgouin C, James AA: PNAS Plus: Transgenic Anopheles stephensi coexpressing single-chain antibodies resist Plasmodium falciparum development. Proc Natl Acad Sci U S A. 2012, 109: E1922-E1930. 10.1073/pnas.1207738109.PubMedPubMedCentralCrossRefGoogle Scholar
- 18.Luckhart S, Giulivi C, Drexler AL, Antonova-Koch Y, Sakaguchi D, Napoli E, Wong S, Price MS, Eigenheer R, Phinney BS, Pakpour N, Pietri JE, Cheung K, Georgis M, Riehle M: Sustained activation of Akt elicits mitochondrial dysfunction to block Plasmodium falciparum infection in the mosquito host. PLoS Pathog. 2013, 9: e1003180-10.1371/journal.ppat.1003180.PubMedPubMedCentralCrossRefGoogle Scholar
- 23.Marinotti O, Cerqueira GC, de Almeida LG, Ferro MI, Loreto EL, Zaha A, Teixeira SM, Wespiser AR, Almeida ESA, Schlindwein AD, Pacheco AC, Silva AL, Graveley BR, Walenz BP, Lima Bde A, Ribeiro CA, Nunes-Silva CG, de Carvalho CR, Soares CM, de Menezes CB, Matiolli C, Caffrey D, Araújo DA, de Oliveira DM, Golenbock D, Grisard EC, Fantinatti-Garboggini F, de Carvalho FM, Barcellos FG, Prosdocimi F, et al: The genome of Anopheles darlingi, the main neotropical malaria vector. Nucleic Acids Res. 2013, 41: 7387-7400. 10.1093/nar/gkt484.PubMedPubMedCentralCrossRefGoogle Scholar
- 24.Zhou D, Zhang D, Ding G, Shi L, Hou Q, Ye Y, Xu Y, Zhou H, Xiong C, Li S, Yu J, Hong S, Yu X, Zou P, Chen C, Chang X, Wang W, Lv Y, Sun Y, Ma L, Shen B, Zhu C: Genome sequence of Anopheles sinensis provides insight into genetics basis of mosquito competence for malaria parasites. BMC Genomics. 2014, 15: 42-10.1186/1471-2164-15-42.PubMedPubMedCentralCrossRefGoogle Scholar
- 25.VectorBase, Anopheles stephensi Indian strain. , [https://www.vectorbase.org/Anopheles_stephensiI/Info/Index]
- 26.Criscione F, Qi Y, Saunders R, Hall B, Tu Z: A unique Y gene in the Asian malaria mosquito Anopheles stephensi encodes a small lysine-rich protein and is transcribed at the onset of embryonic development. Insect Mol Biol. 2013, 22: 433-441. 10.1111/imb.12034.PubMedPubMedCentralCrossRefGoogle Scholar
- 30.Price I, Ermentrout B, Zamora R, Wang B, Azhar N, Mi Q, Constantine G, Faeder JR, Luckhart S, Vodovotz Y: In vivo, in vitro, and in silico studies suggest a conserved immune module that regulates malaria parasite transmission from mammals to mosquitoes. J Theor Biol. 2013, 334: 173-186. 10.1016/j.jtbi.2013.05.028.PubMedPubMedCentralCrossRefGoogle Scholar
- 31.Horton AA, Wang B, Camp L, Price MS, Arshi A, Nagy M, Nadler SA, Faeder JR, Luckhart S: The mitogen-activated protein kinome from Anopheles gambiae: identification, phylogeny and functional characterization of the ERK, JNK and p38 MAP kinases. BMC Genomics. 2011, 12: 574-10.1186/1471-2164-12-574.PubMedPubMedCentralCrossRefGoogle Scholar
- 37.Vectorbase, Gene AGAP012184 Expression Report. , [http://funcgen.vectorbase.org/expression-browser/gene/AGAP012184]
- 42.Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, de Castro E, Coggill P, Corbett M, Das U, Daugherty L, Duquenne L, Finn RD, Fraser M, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, et al: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012, 40: D306-D312. 10.1093/nar/gkr948.PubMedPubMedCentralCrossRefGoogle Scholar
- 43.Neafsey DE, Christophides GK, Collins FH, Emrich SJ, Fontaine MC, Gelbart W, Hahn MW, Howell PI, Kafatos FC, Lawson D, Muskavitch MA, Waterhouse RM, Williams LJ, Besansky NJ: The evolution of the Anopheles 16 genomes project. G3 (Bethesda). 2013, 3: 1191-1194. 10.1534/g3.113.006247.CrossRefGoogle Scholar
- 47.Baricheva EA, Berrios M, Bogachev SS, Borisevich IV, Lapik ER, Sharakhov IV, Stuurman N, Fisher PA: DNA from Drosophila melanogaster β-heterochromatin binds specifically to nuclear lamins in vitro and the nuclear envelope in situ. Gene. 1996, 171: 171-176. 10.1016/0378-1119(96)00002-9.PubMedCrossRefGoogle Scholar
- 53.Timoshevskiy VA, Kinney NA, de Bruyn BS, Mao C, Tu Z, Severson DW, Sharakhov IV, Sharakhova MV: Genomic composition and evolution of Aedes aegypti chromosomes revealed by the analysis of physically mapped supercontigs. BMC Biol. 2014, 12: 27-10.1186/1741-7007-12-27.PubMedPubMedCentralCrossRefGoogle Scholar
- 54.Sharakhov IV, Serazin AC, Grushko OG, Dana A, Lobo N, Hillenmeyer ME, Westerman R, Romero-Severson J, Costantini C, Sagnon N, Collins FH, Besansky NJ: Inversions and gene order shuffling in Anopheles gambiae and A. funestus. Science. 2002, 298: 182-185. 10.1126/science.1076803.PubMedCrossRefGoogle Scholar
- 55.Schaeffer SW, Bhutkar A, McAllister BF, Matsuda M, Matzkin LM, O'Grady PM, Rohde C, Valente VLS, Aguadé M, Anderson WW, Edwards K, Garcia AC, Goodman J, Hartigan J, Kataoka E, Lapoint RT, Lozovsky ER, Machado CA, Noor MA, Papaceit M, Reed LK, Richards S, Rieger TT, Russo SM, Sato H, Segarra C, Smith DR, Smith TF, Strelets V, Tobari YN, et al: Polytene chromosomal maps of 11 Drosophila species: the order of genomic scaffolds inferred from genetic and physical maps. Genetics. 2008, 179: 1601-1655. 10.1534/genetics.107.086074.PubMedPubMedCentralCrossRefGoogle Scholar
- 62.Lawniczak MK, Emrich SJ, Holloway AK, Regier AP, Olson M, White B, Redmond S, Fulton L, Appelbaum E, Godfrey J, Farmer C, Chinwalla A, Yang SP, Minx P, Nelson J, Kyung K, Walenz BP, Garcia-Hernandez E, Aguiar M, Viswanathan LD, Rogers YH, Strausberg RL, Saski CA, Lawson D, Collins FH, Kafatos FC, Christophides GK, Clifton SW, Kirkness EF, Besansky NJ: Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences. Science. 2010, 330: 512-514. 10.1126/science.1195755.PubMedPubMedCentralCrossRefGoogle Scholar
- 63.Wurm Y, Wang J, Riba-Grognuz O, Corona M, Nygaard S, Hunt BG, Ingram KK, Falquet L, Nipitwattanaphon M, Gotzek D, Dijkstra MB, Oettler J, Comtesse F, Shih CJ, Wu WJ, Yang CC, Thomas J, Beaudoing E, Pradervand S, Flegel V, Cook ED, Fabbretti R, Stockinger H, Long L, Farmerie WG, Oakey J, Boomsma JJ, Pamilo P, Yi SV, Heinze J, et al: The genome of the fire ant Solenopsis invicta. Proc Natl Acad Sci U S A. 2011, 108: 5679-5684. 10.1073/pnas.1009690108.PubMedPubMedCentralCrossRefGoogle Scholar
- 82.Anders S, Pyl PT, Huber W: HTSeq - A Python framework to work with high-throughput sequencing data. Bioinformatics. 2014Google Scholar
- 90.Vectorbase. , [https://www.vectorbase.org/]
- 91.Ensembl Metazoa. , [http://metazoa.ensembl.org]
- 92.CLC bio, a QIAGEN Company. , [http://www.clcbio.com]
- 96.Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu ZJ, Loftus B, Xi Z, Megy K, Grabherr M, Ren Q, Zdobnov EM, Lobo NF, Campbell KS, Brown SE, Bonaldo MF, Zhu J, Sinkins SP, Hogenkamp DG, Amedeo P, Arensburger P, Atkinson PW, Bidwell S, Biedler J, Birney E, Bruggner RV, Costas J, Coy MR, Crabtree J, Crawford M, et al: Genome sequence of Aedes aegypti, a major arbovirus vector. Science. 2007, 316: 1718-1723. 10.1126/science.1138878.PubMedCrossRefGoogle Scholar
- 98.Frisch M, Frech K, Klingenhoff A, Cartharius K, Liebich I, Werner T: In silico prediction of scaffold/matrix attachment regions in large genomic sequences. Genome Res. 2002, 12: 349-354. 10.1101/gr.206602. Article published online before print in January 2002.PubMedPubMedCentralCrossRefGoogle Scholar
- 100.EnsEMBL Hive - a system for creating and running pipelines on a distributed compute resource. , [https://github.com/Ensembl/ensembl-hive]
- 101.The Ensembl Variation Perl API and SQL schema. , [https://github.com/Ensembl/ensembl-variation/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.