Background

Understanding the genetic basis underlying phenotypic variation responsible for evolutionary change and adaptation in natural populations remains a major goal and one of the most interesting challenges in evolutionary biology. Not long ago, despite the development of new molecular tools, establishing genotype-phenotype associations, mapping adaptive loci, and identifying gene function, was limited to a few taxa due to technological and cost constraints. With the latest advances in sequencing technologies, the relationships between genetic variation and adaptive traits can now be investigated in a broader range of species for which, in some cases, there is extensive knowledge of ecological and evolutionary history, but few genomic resources [17]. Moreover, with the development of population genomics it has become possible not only to assess the genetic basis of adaptation directly at a genomic level, but also to distinguish the evolutionary effects of forces acting on the whole genome from those influencing only particular loci [8, 9].

Intraspecific colour variation is commonly found in many different taxa, including mammals [10], fishes [11], amphibians [12], reptiles [13, 14], birds [15, 16], and many invertebrates (e.g. land snails, spiders, grasshoppers and butterflies; see [17] for references). Colour patterns may serve a wide variety of adaptive functions, ranging from a visual signal used in mate choice, to crypsis or aposematism to avoid predators, to aiding in the regulation of body temperature [18]. Through their interactions with other physiological and/or ecological traits, colour polymorphisms may also influence the habitat choice, dispersal capability and adaptation to a changing or novel environment, thus influencing the ecological success and evolutionary dynamics of populations and species [19]. When combined with genomic and ecological data, these colour polymorphisms can be an excellent system for understanding adaptation and speciation and for the study of the micro-evolutionary forces that maintain genetic variation [20]. Negative frequency-dependent selection, resulting from processes such as predation or sexual selection [2123], heterozygote advantage [24], and disruptive selection/divergence with gene-flow [25, 26] are some of the mechanisms suggested to be involved in the maintenance of colour polymorphisms. Alternative strategies that result in almost the same fitness values for colour morphs have also been reported [27].

The meadow spittlebug, Philaenus spumarius (Linnaeus, 1758) (Hemiptera, Aphrophoridae), a widespread and highly polyphagous sap-sucking insect species in the Holarctic region, shows a well studied balanced polymorphism of dorsal colour/pattern variation [28]. It is the most investigated species of its genus and has high genetic and morphological variation [29]. Sixteen adult colour phenotypes are known to occur in natural populations [30] but only 13 are referred in the literature. These are divided into non-melanic (populi, typicus, vittatus, trilineatus and praeustus) and melanic forms (marginellus, flavicollis, gibbus, leucocephalus, lateralis, quadrimaculatus, albomaculatus and leucopthalmus) [28, 3032]. The occurrence and frequency of the colour phenotypes differ among populations and may result from different selective pressures such as habitat composition, climatic conditions (including altitudinal and latitudinal gradients), industrial melanism and predation (reviewed in [30, 32]). Silva and colleagues [33] have shown higher longevity and fertility of the trilineatus phenotype in laboratory conditions, which was also found to have the highest reflectance [34] and to be more prone to parasitoid attacks [35], supporting the idea that complex mechanisms are involved in the maintenance of this polymorphism. Crossing experiments have revealed the Mendelian inheritance of this trait, which is mainly controlled by an autosomal locus p with seven alleles, with complex dominance and co-dominance relationships, being likely regulated by other loci [31, 36]. The typicus phenotype is the most common (over 90% frequency in most populations) and it is the bottom double recessive form. It is believed to be the ancestral form because its main colour pattern characteristics are shared with several other cercopid species [36]. The completely melanic form leucopthalmus is dominant over typicus, and several other forms, with pale heads and/or spots, are dominant over the completely dark form. The trilineatus phenotype, pale with three dark stripes, is controlled by the top dominant allele p T [36, 37]. Halkka and Lallukka [38] suggested the colour genes may be linked to genes involved in response to the physical environment through epistatic interactions, constituting a supergene, and selection may not be directly related to colour. Evidence that balanced polymorphisms can result from tight genetic linkage between multiple functional loci, known as supergenes [39], has been reported in mimetic butterflies [40, 41], land snails [42] and birds [43]. In P. spumarius the genetic architecture of its balanced dorsal colour polymorphism and the possible existence of a supergene remain to be investigated.

A genome-wide association study has the potential to identify the genetic and/or genomic region(s) associated with these dorsal colour patterns. In this study we used restriction site-associated DNA (RAD) sequencing [1] to obtain a set of Single Nucleotide Polymorphisms (SNPs) that were tested for associations with three dorsal colour phenotypes in P. spumarius. The phenotypes used were: typicus (TYP), the most common and non-melanic recessive phenotype; trilineatus (TRI), the non-melanic dominant phenotype; and marginellus (MAR), the most common melanic phenotype found in the population from which samples were collected. The first partial draft genome and transcriptome of P. spumarius are presented here and were used to help the characterisation of the genomic regions found to be associated with colour variation. The size of the genome of this insect species was also estimated by flow cytometry.

Methods

This research does not involve any endangered or protected species and did not require any permits to obtain the spittlebug individuals.

Sampling and DNA extraction

A total of 36 female specimens of P. spumarius from three different colour phenotypes – 12 typicus (TYP), 12 trilineatus (TRI), and 12 marginellus (MAR) – were collected from a Portuguese population near Foz do Arelho locality (39°25'2.95"N; 9°13'39.18"W) in 2011. Adult insects were captured using a sweep net suitable for low-growing vegetation and an entomological aspirator (pooter). Specimens were preserved in absolute ethanol and stored at 4°C. The wings and abdomen were removed to avoid DNA contamination by endosymbionts, parasitoids and parasites and only the thorax and head were used. Genomic DNA was extracted using the DNeasy Blood & Tissue Kit (Qiagen).

Illumina sequencing of genomic libraries

Three RAD libraries with twelve individuals each were prepared following a modified RAD sequencing protocol [1], using PstI-HF (New England BioLabs) restriction enzyme to digest 300 ng of genomic DNA per sample. Digested DNA was ligated to P1 barcoded adapters using twelve different barcodes for each library. Adapter-ligated fragments were pooled and sheared targeting a 500 bp average fragment size using a sonicator. To remove adapter dimers, libraries were purified with Agencourt AMPure XP (Beckman Coulter) magnetic beads after P2 adapter ligation with a volume DNA/beads ratio of 1:0.8. After end-repair using a commercial kit (New England BioLab), libraries were amplified by Polymerase Chain Reaction (PCR) performing an initial denaturation step at 98°C for 30 s, followed by 18 cycles of one denaturation step at 98°C for 10 s, annealing at 65°C for 30 s, extension at 72°C for 30 s and a final 5 min extension step. PCR-enriched libraries were purified with AMPure XP beads and the DNA concentration of each library was quantified in a Qubit 2.0 (Invitrogen). Libraries, in a proportional representation, were paired end sequenced in three lanes of an Illumina HiSeq 2000 at Genepool (Ashworth Laboratories).

SNP calling and genotyping

Raw reads were trimmed, demultiplexed and aligned using the pyRAD software pipeline v3.0.5 [44], which follows the method of [45]. Reads were first clustered by individual and highly similar reads assembled into “clusters” using the programs MUSCLE v3.8.31 [46] and VSEARCH v1.9.3 [47] that allowed reads within “clusters” to vary not only for nucleotide polymorphisms but also for indels. All bases with a Phred quality score below 20 were converted to N (undetermined base). For each individual, consensus sequences based on estimates of the sequencing error-rate and heterozygosity were obtained for each locus. Similarity threshold required to cluster reads together and individuals into a locus was 0.88. Minimum “cluster” depth for each individual was six reads. Only loci with a minimum coverage of nine individuals (25%) were retained in the final dataset. To limit the risk of including paralogs in analysis, loci sharing more than 50% heterozygous sites were not considered and the maximum number of heterozygous sites in a consensus sequence (locus) allowed was five. After clustering sequences, a data matrix for each locus was generated. Further filtering and summary statistics were, posteriorly, performed using VCF Tools v 0.1.13 [48]. Loci were excluded from the final matrix based on (i) a missing data higher than 90% per individual, (ii) a minor allele frequency lower than 5% and (iii) a missing data per loci higher than 25%. Linkage disequilibrium (LD) was also measured using the squared correlation coefficient (r 2) in VCFtools. In association analysis, the detection of statistical associations may be affected when a marker is replaced with a highly correlated one [49]. Taking this into account, highly correlated SNPs in the same locus (r 2 = 1) were randomly eliminated and only one of them was retained in the final VCF matrix. The filtered VCF file with the genotypes for each individual was converted into the file formats needed for further analyses using PGDSpider v 2.0.4.0 [50], fcGENE v1.0.7 [51] and/or using customised python scripts.

Association with dorsal colour phenotypes

For the SNPs dataset, single-SNP associations between allele frequencies and dorsal colour phenotypes were tested using a Fisher’s exact test of allelic association in PLINK v 1.07 [52]. Three pairwise analyses were performed: MAR vs. TRI, MAR vs. TYP and TRI vs. TYP. Allele frequencies in each pair, the odds ratio and p-values were obtained for each SNP and a false discovery rate (FDR) of 5% was applied [53] to each pairwise analysis to test for false positives.

To test for single and multi-SNP correlations between SNPs and colour morphs, a Bayesian Variable Selection Regression (BVSR) model proposed by [54] was also performed in the same three pairs and carried out in piMASS v 0.9. Generally used for association studies with continuous response variables, piMASS is also appropriate for studies with binary phenotypes [54]. This method uses the phenotype as the response variable and genetic variants (SNPs) as covariates to evaluate SNPs that may be associated with a particular phenotype [54]. SNPs statistically associated with phenotypic variation are identified by the posterior distribution of γ, or the posterior inclusion probability (PIP). In our multi-locus analyses, markers with a PIP greater than 99% empirical quantile (PIP0.99 SNPs) were considered as highly associated with colour morphs. For all PIP0.99 SNPs we reported their PIP and the estimates of their phenotypic effect (β). A positive β in the pairwise morph1-morph2 (e.g. MAR-TRI) analysis means that the frequency of the minor allele (maf) is higher in morph2 (TRI in the example) and a negative β means that maf is higher in morph1 (MAR in the example). Thus, to investigate the phenotypic effect size of each PIP0.99 SNP, the | β | was considered. The model contains additional parameters that are estimated from the data: proportion of variance explained by the SNPs (PVE), the number of SNPs in the regression model (nSNPs) and the average phenotypic effect of a SNP that is in the model (σSNP). For all pairwise analyses, we obtained 4 million Markov Chain Monte Carlo samples from the joint posterior probability distribution of model parameters (recording values every 400 iterations) and discarded the first 100,000 samples as burn-in. piMASS also outperforms a single-SNP approach to detect causal SNPs even in the absence of interactions between them [54]. For single-marker tests, SNPs above 95% empirical quantile for Bayes Factor (BF) (BF0.95 SNPs) were considered to be strongly associated to the colour phenotypes. Those above 99% empirical quantile for BF (BF0.99 SNPs) were considered to have the strongest associations. Imputation of the missing genotypes was performed in BIMBAM v1.0 [55].

Genetic differences among populations were tested using a G–test [56] and estimates of F ST were obtained following the method of [57] implemented in GENEPOP v4.2.2 [58]. To better visualise and explore the correlation between significant SNPs, obtained in the several association analyses, and colour phenotypes, a Principal Component Analysis (PCA) was done using R Package SNPRelate (Bioconductor v3.2; R v3.2.3) implemented in the vcf2PCA.R script [59].

De novo sequencing and assembly of the meadow spittlebug genome

To attempt potential de novo assembly of the genome, genomic DNA of one P. spumarius individual from Quinta do Bom Sucesso, Lagoa de Óbidos (Portugal) was extracted using the DNeasy Blood & Tissue Kit (Qiagen) and sequenced externally in GenoScreen (Lille, France) (http://www.genoscreen.fr/). A whole-genome shotgun sequencing approach using one lane of Illumina HiSeq 2000 to generate a paired-end library of approximately 366 million 100 bp reads was carried out. After sequencing, the quality of the sequence reads was assessed in FastQC v0.10.1 [60] and low quality sequences were trimmed by using Trimmomatic v 0.35 [61] and the default parameters. De novo assembly of large genomes tends to be computationally demanding, requiring very large amounts of memory to facilitate successful assembly. Taking these conditions into account, the assembler SOAPdenovo2 [62, 63] was chosen to assemble the sequenced P. spumarius genome. This assembler implements the de Bruijn graph algorithm tailored specifically to perform the assembly of short Illumina sequences and is optimised for large genomes. A k-mer parameter of 33 was used for this assembly. The quality of the assembly results was investigated through several metrics: N50, percentage of gaps, number of contigs, number of scaffolds and genome coverage (total number of base pairs).

De novo sequencing and assembly of the meadow spittlebug transcriptome

Fresh adult specimens of P. spumarius were obtained from Lexington, Fayette Co., Kentucky, USA in July 2013 and frozen at −80°C. Total RNA was extracted from 6 adult specimens by first grinding the entire body using a 1 mL glass tissue grinder with 1 mL Trizol (Invitrogen). This was followed by passing the homogenate over a Qiagen Qiashredder column. The eluate was extracted with 200 μL chloroform, and the RNA was precipitated with 500 μL isopropanol. Pellets were resuspended in RNAse-free water.

Paired-end RNA libraries were prepared using Illumina’s TruSeq Stranded RNA sample preparation kit with an average cDNA size of 250 bp (range 80–550 bp). These libraries were sequenced using an Illumina HiSeq2500 machine with a TruSeq SBS sequencing kit version 1 analysed with Casava v1.8.2. Raw reads were filtered for duplicates using a custom script and trimmed for 5′ bias and 3′ quality using the FASTX-toolkit [64]. Transcriptome was assembled using SOAPdenovo-Trans v1.02 [65] with a k-mer of 49.

Genome size estimation by flow cytometry

Genome size estimates were obtained through flow cytometry [66]. A total of 22 individuals were analysed, seven females and six males of P. spumarius, and nine females of P. maghresignus, a closely related species of the same genus. A suspension of nuclei from both the Philaenus sample and a reference standard (Solanum lycopersicum, S.l., ‘Stupické’ with 2C = 1.96 pg; [67]) were prepared by chopping the thorax and the head of the insect together with 0.5 cm2 of S. lycopersicum fresh leaf with a razor blade in a Petri dish containing 1 mL of WPB (0.2 M Tris. HCl, 4 mM MgCl2.6H2O, 1% Triton X-100, 2 mM EDTA Na2.2H2O, 86 mM NaCl, 10 mM metabisulfite, 1% PVP-10, pH adjusted to 7.5 and stored at 4°C; [68]). The nuclear suspension was filtered through a 30 μm nylon filter and 50 μg mL−1 of propidium iodide (PI, Fluka, Buchs, Switzerland) and 50 μg mL−1 of RNAse (Fluka, Buchs, Switzerland) were added to stained DNA and avoid staining of double stranded RNA, respectively. After 5 minutes of incubation, the nuclear suspension was analysed in a Partec CyFlow Space flow cytometer (532 nm green solid-state laser, operating at 30 mW; Partec GmbH., Görlitz, Germany). Data was acquired using the Partec FloMax software v 2.4d (Partec GmbH, Münster, Germany) in the form of four graphics: histogram of fluorescence pulse integral in linear scale (FL); forward light scatter (FS) vs. side light scatter (SS), both in logarithmic (log) scale; FL vs. time; and FL vs. SS in log scale. To remove debris, the FL histogram was gated using a polygonal region defined in the FL vs. SS histogram. At least 1,300 nuclei were analysed per Philaenus’ G1 peak [69]. Only CV values of 2C peak of Philaenus below 5% were accepted [70]. The homoploid genome size (2C in pg; [71]) was assessed through the formula: sample nuclear DNA content (pg) = (sample G1 peak mean/S. lycopersicum G1 peak mean) * genome size of S. lycopersicum. The obtained values were expressed in picograms (pg) and in giga base pairs (Gb), using the formula by [72] (1 pg = 0.978 Gb).

Differences in genome size between males and females were evaluated using a one-way analysis of molecular variance (ANOVA), followed by a Tukey test for multiple comparisons at P < 0.05. Statistical analyses were performed using SigmaPlot for Windows v. 12.5 (Systat Software).

Characterisation of RAD loci

A consensus sequence, with IUPAC ambiguity codes for variable sites, was generated for each RAD locus across individuals using the python script loci_consensus.py [73].

Homology to non-coding and coding regions was investigated for the inferred loci by locally querying consensus sequences against Arthropoda sequences available in the NCBI nucleotide database (RefSeq release 73, last modified 2 November 2015 and GenBank release 211, last modified 14 December 2015), using BLASTN 2.2.28+ [74]. A protein blast (RefSeq release 73, last modified 2 November 2015 and GenBank release 211, last modified 14 December 2015), using BLASTX 2.2.28+ [75], was also performed. An E-value threshold of 1e-5 was used.

RAD loci were also queried using BLASTN against the drafts of the P. spumarius genome and transcriptome assembled in this study. In this case, an E-value threshold of 1e-15 was chosen as the cutoff for restricting the alignments to the most significant ones. The top five contigs and/or scaffolds were subsequently investigated by querying them using BLASTN against Arthropoda sequences available in nucleotide and protein databases of NCBI.

Results

RAD sequencing and SNPs data matrix

The sequencing set produced a total of 341 million reads. After filtering reads based on quality scores, 269 million reads were retained, corresponding to an average of 7.4 million reads per individual. Before filtering, individuals yielded 335,767 to 12,711,816 sequenced reads of 90 bp each (Additional file 1: Figure S1).

The average number of reads per locus per individual used to estimate a consensus sequence was 51.0 (Additional file 1: Figure S2). For the clustering results, a total of 133,127 loci, consisting of 12,144,351 aligned nucleotides, inferred with a minimum of nine individuals (25%) per locus, and a total of 470,470 SNPs with a mean percentage of missing data per individual of 63.92%, were produced. Aligned loci, including gaps inserted in the course of the alignment, ranged from 90 to 109 bp in length (mean = 91 bp). When filtering by percentage of missing data, three individuals (TYP_5, TYP_13 and TRI_13; Additional file 1: Figure S1, S2 and S3) had more than 90% missing data and were excluded. After filtering, a set of 928 loci, 85,056 bases and 2,195 SNPs was retained. However, only 1,837 SNPs on 928 loci were considered for the analyses after those in the same locus sequence with a complete LD (r 2 = 1) were randomly excluded.

Single-SNP associations with colour phenotypes

The dataset was tested for allele frequency differences between pairs of dorsal colour phenotypes – MAR vs. TYP, TRI vs. TYP and MAR vs. TRI – using the Fisher’s exact test and a Bayesian regression approach. Single-marker association analyses performed using the frequentist method found 205 SNPs with p-value < 0.05, corresponding to 11.16% of the analysed SNPs, but these were not significant after FDR correction (Additional file 2: Table S1). Single-SNP analyses using the Bayesian regression approach identified a total of 230 BF0.95 SNPs (>95% quantile Bayes Factor) associated with dorsal colour phenotypes, corresponding to 12.52% of the analysed markers. When a more strict, 99% quantile, threshold was applied 50 BF0.99 SNPs (2.7%) showed the strongest associations to colour morphs, including eight shared among colour morph comparisons (Fig. 1) (Table 1) The number of BF0.95 SNPs and BF0.99 SNPs for each pairwise comparison were: 92 and 19, respectively, for MAR-TYP; 92 and 20, respectively, for TRI-TYP; 101 and 19, respectively, for MAR-TRI. Estimates of the phenotypic effects associated with BF0.99 SNPs for each comparison were moderate with 0.10 < | β | < 0.15 but much higher than the overall average for each pairwise analysis (| β | = 0.0001, MAR-TRI; | β | = 0.0037, MAR-TYP; | β | = 0.0028, TRI-TYP) (Table 1). Allele frequencies for the 50 SNPs involved in the differentiation of these colour morphs varied across the three colour phenotypes (Table 1). For the 50 BF0.99 SNPs, F ST estimates between pairs of colour morphs were highly significant (p-value < 0.0001) (Additional file 2: Table S2), with the highest genetic differentiation between TRI and MAR (F ST  = 0.2145), intermediate between TRI and TYP (F ST = 0.2125) and the lowest between MAR and TYP (F ST  = 0.1787) (Additional file 2: Table S3). Principal Component Analysis using the associated BF0.99 SNPs showed a clear distinction among the three morphs when compared with the PCA using all 1,837 SNPs (Fig. 2a). Principal component 1 explained 13% of the total variation and indicated a differentiation between TRI and the other two colour morphs while PC2 explained 10% of the differences, separating TYP from MAR (Fig. 2b).

Fig. 1
figure 1

Bayes factor for each SNP in each pairwise comparison in single-SNP association tests. a MAR vs. TRI; b MAR vs. TYP; and c TRI vs. TYP. The horizontal dash lines correspond to the Bayes factor 95% empirical quantile threshold and the straight lines to the 99% empirical quantile. Light grey dots: SNPs with a BF < 99% empirical quantile; Dark grey dots: SNPs with a BF > 99% empirical quantile; Red dots: SNPs with a BF > 99% empirical quantile and shared among comparisons

Table 1 SNPs associated with dorsal colour morphs for each pairwise comparison and obtained through Single-SNP association tests using Bayesian regression approach
Fig. 2
figure 2

Genetic variation of the 33 individuals summarised on principal component axis 1 (PC1) and 2 (PC2) from a Principal Component Analysis using SNPs identified through Bayesian regression analyses. a All 1,837 SNPs; b 50 SNPs BF0.99 identified in Single-SNP association tests; c 50 SNPs PIP0.99 identified in Multi-SNP Association tests; and d 40 SNPs shared between both association analyses

Multi-SNP Associations with colour phenotypes

The 1,837 SNPs dataset explained between 60 and 65% of the variance in dorsal colour phenotypes across all pairwise analyses of colour morphs. The highest proportions of variation explained by the investigated SNPs were detected in comparisons involving the TRI phenotype (Table 2). The highest proportion was observed in TRI-TYP analysis (PVE = 0.6515) while the lowest proportion was found in MAR-TYP analysis (PVE = 0.6018) (Table 2). Estimates of the mean number of SNPs (nSNPs) underlying dorsal colour variation ranged from 63 to 67 (Table 2). However, 95% credible intervals for these parameters estimates were typically large. The average effect of associated SNPs was high and similar among analyses but once again higher in comparisons involving TRI (σSNP = 1.1200, MAR-TRI; σSNP = 0.9776, TRI-TYP; σSNP = 0.9495, MAR-TYP) (Table 2). When considering models with the highest BFs (log10(BF) > 10) only, the mean number of SNPs included in the model (nSNPs_BF) for each comparison decreased up to values between nine and 12 while the mean effect size of the SNPs (σSNP_BF) increased ranging between 2.4 and 4.1 (Table 2). The posterior inclusion probabilities (PIPs) for the analysed SNPs were quite similar among all pairwise analyses but slightly higher in comparisons involving TRI (PIP = 0.0366, MAR-TRI; PIP = 0.0362, TRI-TYP and PIP = 0.0345, MAR-TYP) (Fig. 3) (Table 2). A subset of 19 SNPs with the highest inclusion probabilities (PIP0.99 SNPs) were identified for each analysis and investigated (Table 3). This number was within the 95% credible intervals for the number of SNPs found to be associated with dorsal colour variation by the models with the highest BF (Additional file 1: Figure S4) (Table 3). Estimates of the strength of association between genotypic variation at individual SNPs and phenotypic variation (| β |) varied among the analyses and all were greater than 0.5. We obtained SNPs with larger effect sizes for MAR-TRI analysis than for all other analyses. Six PIP0.99 SNPs were shared between two pairwise analyses (Table 3). In total, 50 different SNPs revealed a multi-association with colour morphs and, from those, 40 were also significant in the single-SNP analyses shown previously. For the 50 PIP0.99 SNPs, population differentiation tests were also highly significant (p-value < 0.000) (Additional file 2: Table S2). Similarly, the highest genetic differentiation was observed between TRI and TYP (F ST  = 0.2159), intermediate between TRI and MAR (F ST  = 0.1907) and the lowest genetic differences were observed between MAR and TYP (F ST  = 0.1650) (Additional file 2: Table S3). Principal Component Analysis for all 50 PIP0.99 SNPs of multi-association tests (Fig. 2c) and for the 40 intersected SNPs (Fig. 2d) showed the expected differentiation among dorsal colour morphs. Principal Component 1 explained 13 to 14% of the variance, differentiating TRI from the other morphs while PC2 explained 11% of the differences and revealed a differentiation between TYP and MAR.

Table 2 Parameter estimates from Bayesian variable selection regression for each pairwise analysis
Fig. 3
figure 3

Posterior inclusion probabilities (PIPs) for each SNP in each pairwise comparison in multi-SNP association tests. a MAR vs. TRI; b MAR vs. TYP; and c TRI vs. TYP. The horizontal dash lines correspond to the PIP 95% empirical quantile threshold and the straight lines to the 99% empirical quantile. Light grey dots: SNPs with a PIP < 99% empirical quantile; Dark grey dots: SNPs with a PIP > 99% empirical quantile; Red dots: SNPs with a PIP > 99% empirical quantile and shared among comparisons

Table 3 SNPs PIP0.99 associated with dorsal colour morphs obtained through Multi–SNP association tests using Bayesian regression approach

Linkage patterns

The associated loci detected here had on average low levels of linkage disequilibrium for both analyses including all samples or analyses on each colour phenotype separately (Additional file 1: Figure S5). However, strong allelic correlations (r 2 > 0.7) were found for five pairs of SNPs within MAR and for two pairs in TYP phenotypes (Additional file 2: Table S4). Only two pairs, in MAR, consisted of SNPs present in the same RAD locus.

Genome size estimation

Philaenus spumarius and P. maghresignus estimates of genome size were 5.27 ± 0.25 pg (5.15 Gb) and 8.90 ± 0.20 pg (8.90 Gb), respectively. In P. spumarius, males and females differed significantly in genome size (F 1,11 = 14.292, p-value = 0.0030), with males presenting on average a lower genome size (5.07 ± 0.20 pg; 4.96 Gb) than females (5.44 ± 0.15 pg; 5.33 Gb) (Additional file 2: Table S5). Overall, the quality of the analyses was excellent, with a mean CV value of 2.97% being obtained for the sample’s G1 peak.

De novo sequencing and assembly of meadow spittlebug genome and transcriptome

The genome sequencing set produced a total of 366 million reads. After filtering reads based on quality, 353 million reads (96.46%) were retained (Additional file 2: Table S6). SOAPdenovo2 produced 6,843,324 contigs and 4,010,521 scaffolds. The N50 was 686 bp and the percentage of gaps was 20.47%. In total, 1,218,749,078 bp were assembled which based on the total estimated genome size of 5.3 Gb, corresponds to approximately 24% of the P. spumarius genome.

For the transcriptome, the total number of 150 nt reads for each paired-end of the library was 17 million resulting in 5110.8 Mb of sequence (Additional file 2: Table S6). After quality filtering, 14 million (86.81%) read pairs were used in the assembly (Additional file 2: Table S6). The transcriptome assembly produced 173,691 contigs and 31,050 scaffolds. In this case, the observed N50 obtained was 803 bp and the percentage of gaps 0.39%. A total of 81,442,967 bp were assembled. Assembly statistics for the genome and transcriptome are summarised in Additional file 2: Table S6.

Characterisation of RAD loci

No significant hits were found when querying the 928 RAD loci against Arthropoda sequences of NCBI nt database and only 15 hits (E-value < 1e-05) were found against Arthropoda sequences of NCBI nr database (Additional file 2: Table S7). However, this was not unexpected considering RAD loci sequences are less than 100 bp and the most closely related insect species with an available genome is the pea aphid Acyrthosiphon pisum, which belongs to a separate hemipteran infraorder.

A total of 392 RAD loci (42.24%) aligned to the draft of P. spumarius genome (E-value threshold of 1e-15), 18 of which were associated with colour morphs (34.62% of the colour-associated loci sequences) (Additional file 2: Table S8). On the other hand, 134 loci, corresponding to 14.44% of the total loci, aligned to P. spumarius transcriptome assembly. Five of those were colour-associated (9.62% of the colour-associated loci) (Additional file 2: Table S8). From the 18 colour-associated loci that aligned with the genome, four (22.22%) also aligned with the transcriptome. The proportion of colour-associated loci that aligned either with the genome or with the transcriptome was not significantly different from the proportions of the other loci (Fisher’s exact test p-value = 0.8096). Some RAD loci had more than one contig/scaffold hit (Additional file 1: Figure S6).

Transcriptome and genome scaffolds/contigs with RAD loci alignments, ranging from 100 to 12,325 bp (gaps included), were queried against Arthropoda nt and nr databases using BLASTN and BLASTX. Out of 210 transcriptome sequences, 22 (E-value < 1e-05) had homology with the nucleotide database (Additional file 2: Table S9) and 98 with the protein database (Additional file 2: Table S10). The majority of those sequences hits have E-values < 1e-12 in nucleotide (86.36%) and in protein (69.38%) blasts. On the other hand, one genome scaffold, out of 484 with RAD loci hits, matched with the nucleotide sequences (E-value < 1e-25) (Additional file 2: Table S11) and 90 with the protein database (Additional file 2: Table S12). The majority of those protein hits have E-values < 1e-12 (55.55%). Of the transcriptome and genome sequences with protein hits, five and three included associated loci, respectively (Additional file 2: Table S13). Four of these genome and transcriptome sequences matched with two known proteins, the other four with uncharacterised ones. One of the identified proteins, to which the colour-associated locus 16628 aligned (genome scaffold 1372429 and transcriptome scaffolds 17697 and 17698), was a lysosomal-trafficking regulator, known to be involved in the trafficking of materials into lysosomes. Furthermore, a mutation of this protein in humans is associated with a pigmentation disorder [76]. The other identified protein, to which the colour-associated locus 22795 aligned (transcriptome scaffold 29739), was the nucleolar and coiled-body phosphoprotein 1. This locus is one of the eight shared among colour morph comparisons.

Discussion

In this study, we aimed to identify candidate genomic regions associated with colour polymorphism in the meadow spittlebug P. spumarius, an insect species with a very large genome (5.3 Gb), as estimated here by flow cytometry. This large size is among the largest genomes reported in insects [77], making genomic analysis in this species particularly challenging. By using restriction site-associated DNA (RAD) sequencing in individuals of three dorsal colour phenotypes (typicus, marginellus, and trilineatus), we were able to detect association with colour in 3% of the analysed SNPs (60 out of 1,837). These phenotypes did not reveal significant genome-wide differences but when considering only the associated SNPs, the three colour morphs were differentiated and the trilineatus phenotype showed the highest genetic differentiation. Interestingly, greater differences involving life-history traits such as longevity, number of eggs, and number of oviposition events are also known to occur in trilineatus [33]. It may be that the genetic differences detected in this morph also reflect some part of the genetic basis of these life-history differences among colour morphs. Therefore, we may not only be on the track to finding a colour gene but also perhaps an extensive region, or several regions of the genome, that links colour variation and other life-history or physiological traits, as previously suggested [38]. Our finding of several colour-associated SNPs, some of them mapped to different genome and transcriptome scaffolds, suggests a complex genetic architecture involving this colour polymorphism.

In the single-SNP association analyses, the 50 individual SNPs found to be associated at 99% quantile (BF0.99 SNPs) showed moderate phenotypic effects (0.10 < | β | < 0.15). In the multi-SNP association analyses, 50 SNPs with posterior inclusion probabilities at quantile 99% (PIP0.99 SNPs) showed large effects for pairs of colour phenotypes (σSNP > 0.9 and individual PIP0.99 SNPs | β | > 0.5). From these, 40 were common to the SNPs identified in single SNP analyses (BF0.99 SNPs), increasing the confidence for the detected associations. Although inferences about the genetic architecture are only tentative in this study, due to the relatively small proportion of the genome covered, our results suggest that differences among the three dorsal colour phenotypes are associated with several loci with large effects. However, it is still not entirely clear if these constitute the major locus determining dorsal colour pattern, revealed by Mendelian crosses in P. spumarius [31, 36]. Large effect loci controlling colour pattern have been reported for Heliconius species [78, 79], land snail Cepaea nemoralis [42], and more recently in Timema cristinae stick insects [80]. Several other examples [8183] have shown that adaptive traits are affected by loci with large phenotypic effects and that this genetic architecture may be more common than initially thought. The majority of the colour-associated loci that we detected here did not show significant allelic correlations, being likely in independent genomic regions. However, a few loci were strongly correlated, indicating either physical linkage, random drift of rare alleles, or occurrence of recent mutations. The existence of tightly linked loci (a supergene) that can be maintained due to chromosomal rearrangements or selection of co-adapted loci with epistatic effects is also possible. In the mimetic butterfly Heliconius melpomene, a cluster of three tightly linked loci (HmN, HmYb and HmSb), lying just a few centimorgans apart, as well as other unlinked loci have been shown to control distinct wing colour pattern elements in this species [84]. In a closely related species Heliconius numata, polymorphic colour variation is controlled by a single locus P, forming a supergene, resulting from chromosomal rearrangements [85]. A single gene, doublesex, with closely linked mutations, also controls supergene mimicry in Papilio polytes [86].

Various genes and pathways have been reported to be involved in insect coloration and pigmentation. These pathways comprise genes regulating the distribution of pigments in space and time, as well as genes that are involved in the synthesis of pigments [87]. Several colour genes have been described, mostly in Drosophila spp. (see [87] for a review) and are known to be involved in colour variation in the silk worm Bombyx mori [88] and Papilio spp. [86, 89, 90] as well. Novel and unexpected genes were found to be responsible for wing colour patterning in Heliconius species. Red wing elements are associated with expression of the transcription factor optix [91], which in turn is regulated by two distinct cis-regulatory loci [92]. Another gene, cortex, a member of a conserved cell cycle regulator family, appears to have adopted a novel function controlling colour pattern in Heliconius and probably across the Lepidoptera [93]. Regulatory regions are also known to control colour pattern in Drosophila flies [94, 95]. However, none of the colour-associated loci that we found in our study matched these candidate genes and/or genomic regions of other insects. Approximately 10% of the loci with colour associations aligned with the P. spumarius transcriptome indicating that those loci are in coding regions that are expressed in adult stage. A similar proportion of alignment to the transcriptome was found between associated loci and all loci (Fisher’s exact test p-value = 0.8096), suggesting that there is no enrichment/depletion of coding regions in the associated loci in relation to the total number of loci. Around 35% of the colour-associated loci aligned with the genome and 22% also aligned with the transcriptome. If we assume a good representation of the total transcriptome, this result point to the majority of the associated loci being in non-coding regions. Considering that our assembled genome represents only 24% of the total genome size, the low percentage of hits in the genome was expected. Also, the low number of nucleotide and protein matches of genome and transcriptome sequences is certainly due to the degree of similarity of P. spumarius to other available Arthropoda sequences being too low to allow significant matches. Increasing the genomic resources for this or related species will allow exploring the candidate loci here described and provide insight into some of the key questions that remain to be answered. What are the specific genes contributing to this balanced colour polymorphism? What mutations cause allelic differences in these genes and how do they contribute to the different colour phenotypes? Are there epistatic or additive effects among the alleles responsible for the polymorphism? Does this constitute a supergene? Are coding or regulatory mutations involved? In the future, it would also be interesting to investigate the evolutionary history of the colour polymorphism within Philaenus since identical variation in dorsal colour/pattern can be observed in the other species of the genus, suggesting an ancestral polymorphism maintained through the speciation process.

Conclusions

This work was a first approach to investigate the genetic architecture of P. spumarius dorsal colour polymorphism, by studying single and multi-SNP association with three of the colour phenotypes. We detected several loci with large effects occurring in multiple genomic regions. Trilineatus was found to be the most differentiated colour phenotype for these loci, and since it is also the most differentiated for several life-history and physiological traits, we suggest that genetic bases of colour and of these traits are linked. The development of genomic and transcriptomic resources in this work was a first step toward characterizing these loci and will be very useful for further research on the genetic basis of dorsal colour pattern variation in P. spumarius.