The genome of the largest bony fish, ocean sunfish (Mola mola), provides insights into its fast growth rate
The ocean sunfish (Mola mola), which can grow up to a length of 2.7 m and weigh 2.3 tons, is the world’s largest bony fish. It has an extremely fast growth rate and its endoskeleton is mainly composed of cartilage. Another unique feature of the sunfish is its lack of a caudal fin, which is replaced by a broad and stiff lobe that results in the characteristic truncated appearance of the fish.
To gain insights into the genomic basis of these phenotypic traits, we sequenced the sunfish genome and performed a comparative analysis with other teleost genomes. Several sunfish genes involved in the growth hormone and insulin-like growth factor 1 (GH/IGF1) axis signalling pathway were found to be under positive selection or accelerated evolution, which might explain its fast growth rate and large body size. A number of genes associated with the extracellular matrix, some of which are involved in the regulation of bone and cartilage development, have also undergone positive selection or accelerated evolution. A comparison of the sunfish genome with that of the pufferfish (fugu), which has a caudal fin, revealed that the sunfish contains more homeobox (Hox) genes although both genomes contain seven Hox clusters. Thus, caudal fin loss in sunfish is not associated with the loss of a specific Hox gene.
Our analyses provide insights into the molecular basis of the fast growth rate and large size of the ocean sunfish. The high-quality genome assembly generated in this study should facilitate further studies of this ‘natural mutant’.
KeywordsOcean sunfish Mola mola Growth rate Body size Cartilaginous skeleton Positive selection
The ocean sunfish Mola mola (Family Molidae; Order Tetraodontiformes), is the heaviest known bony fish with the largest recorded specimen measuring 2.7 m in length and weighing 2.3 tons . It is widely distributed in tropical and temperate sea zones, such as the Mediterranean, the North and South Atlantic, the Gulf of Mexico, and the East and West Pacific . The ocean sunfish (hereafter referred to as ‘sunfish’) has several unique characteristics compared with other tetraodontiform fishes such as pufferfish, boxfish and triggerfish. The sunfish passes through two distinct larval stages during its transition to adult stage: the first is a typical pufferfish-like stage in which the fry resembles a miniature pufferfish possessing large pectoral fins, a tail fin and body spines; and the second is a highly transformed stage during which the tail is completely absorbed . The most notable characteristics of sunfish are its fast growth rate and large body size. A captive sunfish gained approximately 400 kg in just 15 months with an average growth rate of 0.82 kg/day , which is in stark contrast to the typical growth rate of other teleost fishes (0.02 to 0.49 kg/day) [3, 4]. Furthermore, although sunfish is a bony fish (Osteichthyes), its endoskeleton is mainly composed of cartilage . This lighter cartilaginous skeleton, together with its thick layer of low-density, subcutaneous, gelatinous tissue, may contribute to the buoyancy of this enormous fish . Another unique characteristic of the sunfish is the degeneration of the vertebral column resulting in the loss of the caudal fin, which is instead replaced by a broad and stiff lobe called the ‘clavus’ . As a result, the lateralis muscles of the sunfish insert upon the deep muscles of the anal and dorsal fins that function as the main locomotory organs. Because of this morphological change, the sunfish swims in a peculiar manner. Unlike the drag-based swimming of most bony fishes, the sunfish swims by moving its dorsal and anal fins synchronously to generate a lift-based thrust that resembles the flight of a bird . This unusual swimming mode may involve modifications in the nervous system that controls locomotion. Indeed, anatomical studies have shown that the peripheral nervous system of the sunfish differs from other Tetraodontiformes fishes . The sunfish is also the most fecund vertebrate with a 137 cm female producing an estimated 300 million eggs .
Although the behavioural and ecological features of the sunfish have been studied widely, the genomic basis of its unique phenotype remains unexplored. In this study, we have sequenced the genome of the sunfish, and performed comparative genomic analyses with several other fish genomes. We analyzed genes and pathways associated with the regulation of growth and found notable changes in several genes in the growth hormone and insulin-like growth factor 1 (GH/IGF1) axis, an important pathway for regulating growth. In addition, many genes that are associated with the extracellular matrix (ECM) exhibit elevated dN/dS values, which may be related to the evolutionary changes in the endoskeleton of the sunfish.
Genome assembly and annotation
Population size history
We identified approximately 489,800 heterozygous single nucleotide polymorphisms (SNPs) in the genome assembly of the sunfish and estimated the heterozygosity to be 0.78 × 10−3, which is lower than other marine fishes such as Atlantic cod (2.09 × 10−3) and stickleback (1.43 × 10−3) . Based on the identified heterozygous sites, we ran the pairwise sequentially Markovian coalescent (PSMC) model  to infer the historical changes in the effective population size (Ne) of sunfish. The PSMC analysis suggests that there was an increasing trend of Ne from ~3 to ~0.9 mya (Additional file 1: Figure S1). Around 2.15 mya, a large asteroid (more than 1 km in diameter) is thought to have fallen into the Southern Ocean in the Eltanin Fault zone and generated a super-tsunami that resulted in a large-scale marine extinction . This event might have released more habitats that enabled sunfish to expand its population size. Around 0.9 mya, the Ne stopped expanding and began to decline slightly, which could be related to the mid-Pleistocene climate transition (MPT, ~1.2-0.55 mya, Additional file 1: Figure S1) . The MPT period was accompanied by the extinction of many marine species such as Stilostomellidae and Pleurostomellidae . We also found a Ne peak around 150 thousand years ago (kya), followed by a rapid decrease of Ne. However, the bootstrap support for these estimates is rather weak (Additional file 1: Figure S1).
Positively selected and fast-evolving genes
We found that both copies of igf1r (igf1ra and igf1rb) contain positively selected sites in the sunfish (Fig. 2 and Additional file 1: Table S7). Interestingly, some of these positively selected sites are located within functional domains of these receptors. For example, positively selected sites within Igf1ra were located within the fibronectin type III domain (FnIII, Pfam identifier PF00041) and receptor L domain (RL, PF01030) (Fig. 2). Previous studies have shown that the FnIII domain is important for ligand binding [22, 23], thus mutations in this domain could change the affinity between IGF1R and its ligand. Although the RL domain does not directly bind ligands, many of the determinants responsible for hormone binding and ligand specificity map to this central site . Conversely, Igf1rb contains three positively selected sites located within the furin-like cysteine-rich region (CR, PF00757) and two sites in the protein tyrosine kinase domain (TK, PF07714) (Fig. 2). These two domains are involved in receptor aggregation  and other functions such as enzyme activity, subcellular localization and interactions with other molecules . Interestingly, mutations in the ligand-binding domain and the RL domain of IGF1R can result in growth retardation in humans . Thus, positive selection of sites within these domains of Igf1ra and Igf1rb in the sunfish may have enhanced the function of these genes. We also found evidence for positively selected sites in the insulin receptor gene (insr), which is known to bind to IGF-1 and promote growth  (Fig. 2). Mutations in human INSR are known to cause Donohue syndrome, which is characterized by stunted growth . A positively selected site (R457) in the sunfish Insr maps to a mutation (K487E) in the human INSR associated with Donohue syndrome.
Genes involved in bone formation
In contrast to other bony fishes (Osteichthyes), the endoskeleton of the sunfish is mainly composed of cartilage . The genetic mechanism underlying this derived phenotype is not known. To look for clues to this mechanism, we analyzed genes known to be involved in bone formation such as those encoding proteoglycans, the bone morphogenetic protein (BMP) signalling pathway, transcription factors, bone differentiation and secretory calcium-binding phosphoproteins (SCPP). However, the sunfish possesses intact orthologues for most of these genes except for some SCPP genes (see Additional file 4). The details of the searches and the genes identified are given below.
Genes encoding proteoglycans are important regulators of cartilage and bone formation. We searched for the small leucine-rich proteoglycan (SLRP) gene family clusters: (a) Fmod, Prelp, Optc; (b) Ecm2, Aspn, Omd, Ogn; (c) Dcn, Lum, Kera, Epyc; and (d) Ecm2-like and Bgn. We first ran BLASTP (with default settings) of the human, zebrafish and/or fugu reference proteins against the annotated sunfish proteins. For genes that could not be identified using this method, we proceeded to a TBLASTN of the human and fish proteins against the sunfish genome assembly followed by a BLASTX of the resulting sunfish genomic loci against the NCBI non-redundant (NR) protein database. Using this strategy, we identified orthologues for all the above genes in the sunfish genome on (a) scaffold10.1, (b) scaffold39.1, (c) scaffold20.1, and (d) scaffold77.1, except Optc and Omd. In addition, we identified second copies of Ogn and Fmod located within nine genes of each other on scaffold13.1 (Additional file 5). We BLASTX-searched (with default settings) the sunfish loci of (a) and (b) against the NCBI NR protein database to identify Optc and Omd respectively, but did not identify these genes. Optc and Omd are present in zebrafish and cavefish but are absent in the Percomorphaceae fishes such as fugu, Tetraodon, tilapia and platyfish. Thus, they may not be responsible for the cartilaginous skeleton of the sunfish. We also searched for the lectican-hyaluronan- and proteoglycan-binding link protein (HAPLN) gene family clusters (Hapln2 and Bcan; Vcan and Hapln1; Acan and Hapln3; Ncan and Hapln4) and other non-clustered proteoglycans (Fn1, Lepre1, Tuft1, Podn). Orthologues for all these genes are present in the sunfish (Additional file 4).
The BMP signalling pathway
We identified homologues for Bmp2, Bmp4, Bmp5, Bmp6 and Bmp7; the receptors Bmpr1a, Bmpr1b, and Bmpr2; the Smad family genes (Smad1, Smad4, Smad5, Smad6, Smad7) and the Smurf family genes (Smurf1, Smurf2) in the sunfish genome. For Smad4, we identified up to four copies in the sunfish (Additional file 4).
We identified homologues for Sp7/osterix, Bapx1/Nkx3-2, Pdlim7, Mitf, Nfatc1, Msx1, Msx2 (the zebrafish homologue is known as msxd), Hand1 and Hand2 in the sunfish genome (Additional file 4). Runx2 is a key transcription factor involved in bone formation. Knockout of Runx2 in the mouse results in the formation of cartilaginous skeleton . The sunfish Runx2 gene is located on scaffold14.1:1,387,125..1,422,518 (SUNFISH_GLEAN_10009694, see Additional file 4; the first 45 amino acids were removed because they are not present in other fish Runx2 proteins). An alignment of Runx2 proteins shows that the sunfish Runx2 is highly conserved (e.g. its DNA-binding domain is perfectly conserved and its central and C-terminal domains also look intact) (Additional file 1: Figure S2).
Bone differentiation genes
We searched for genes responsible for bone differentiation in the sunfish genome. We identified leptin and leptin receptor, Bglap/osteocalcin, Fam20c, Bmp1, osteocrin, osteopotentia homologue, sclerostin, Phospho1 and Phospho2. We did not find Mmp1 in sunfish. Mmp1 may have arisen through tandem duplications in tetrapods based on our observations from the Genomicus database (Ensembl release 76). For osteocrin, we identified a partial prediction (one exon only) on scaffold56.1:442 kb between the genes Gmnc and Iws1, which are linked to osteocrin in the tilapia genome (on LG14). Thus, sunfish seems to have all the important genes involved in bone differentiation in teleost fishes (Additional file 4).
SCPP gene family
The SCPP gene family encodes secretory calcium-binding phosphoproteins that participate in the mineralization of collagenous bone and dentin as well as noncollagenous enamel. The SCPP genes originated from tandem duplication of the secreted protein acidic and rich in cysteine (SPARC)-like 1 gene (Sparcl1) that was itself derived from SPARC around the time a mineralized skeleton arose in vertebrates . SCPP genes encode ECM proteins and are divided into two categories: acidic and proline/glutamine-rich (P/Q-rich). Whereas acidic SCPP genes are involved in the mineralization of collagenous bone and dentin, P/Q-rich SCPP genes participate in the deposition of non-collagenous enamel . The elephant shark, which is a cartilaginous fish (Chondrichthyes) and lacks endochondral bone, does not contain any SCPP genes , whereas bony vertebrates contain both categories of SCPP genes although the complement of each category varies between lineages due to lineage-specific expansion and losses .
Hox genes specify segmental identities along the anterior-posterior axis of developing vertebrate embryos and are also important for the anterio-posterior and proximal-distal patterning of limbs [39, 40]. Loss of Hox gene function can lead to morphological changes [41, 42]. Therefore, we hypothesized that loss of Hox genes could contribute to phenotypic evolution. To determine whether sunfish has uniquely lost any Hox gene(s), we analyzed the Hox gene complement in the sunfish. The sunfish possesses seven Hox gene clusters similar to fugu but contains more Hox genes (47 genes in sunfish compared with 45 in fugu) (Additional file 1: Figure S4). It possesses an intact hoxa7a and hoxb7a that are pseudogenes in fugu, making fugu the only known teleost completely lacking a Hox7 paralogous group member. In addition, sunfish has an intact hoxc1a, which is a pseudogene in fugu but is completely lost in medaka (Additional file 1: Figure S4). Conversely, sunfish lacks hoxd11b, which is present in fugu. However, this gene is also lost in fishes such as medaka and the East African cichlid (Additional file 1: Figure S4) . Thus, it is unclear whether the loss of this gene has any effect on the overall morphology of sunfish. As such, the unusual morphology of the sunfish does not seem to be related to the loss of any specific clustered Hox gene. However, we cannot rule out the possibility of altered expression pattern(s) of Hox genes contributing to this unusual phenotype. In fact, the loss of the pelvic fin in fugu has been shown to be associated with the altered expression pattern of Hoxd9a .
Given its large size, fast growth rate and unusual body shape, the sunfish has received considerable attention from the research community. However, the genetic mechanisms underlying this ‘natural mutant’ have not been explored. The reference genome of the sunfish presented in this study provides first insights into the genomic changes that have occurred in this natural mutant. A major finding of our study is the identification of fast evolving and positively selected genes that are associated with growth in vertebrates. Although growth factors such as GH and IGF-1 do not show marked changes, several crucial receptor genes (ghr1, igf1ra/b in the GH/IGF1 axis) were found to have experienced positive selection or fast evolution. In addition, we found evidence for fast evolution and/or positive selection in some downstream genes in the GH/IGF1 axis (irs2a, akt3, grb2 and jak2a), indicating that the GH/IGF1 axis might have played a crucial role in the large body size and rapid growth rate of the sunfish.
Despite being a member of the bony fish clade, the sunfish possesses a degenerate and cartilaginous skeleton. We found that a number of genes involved in the ECM, especially the genes encoding collagen, have undergone positive selection or fast evolution. In particular, several collagen genes involved in cartilage formation were found to be positively selected, which might have contributed to the predominantly cartilaginous skeleton of this gigantic fish. In addition, it has been demonstrated that the GH/IGF1 axis also has important roles in the regulation of bone and cartilage development . We suspect that molecular changes in the GH/IGF1 axis unravelled in this study could also have affected the evolution of skeletal structures in the sunfish. We did not find any definitive molecular signatures related to the loss of the caudal fin in the sunfish. One reason could be that the molecular changes responsible for this phenotype could have occurred mainly in cis-regulatory elements.
In summary, our analyses provided first insights into the molecular basis of the fast growth rate and large size of the sunfish. It also provides some clues to the genetic basis of the predominantly cartilaginous skeleton of this teleost fish. The high-quality genome assembly generated in this study should facilitate future studies of adaptations and population genetics of this enigmatic fish.
Genome sequencing and assembly
The sunfish blood sample was collected in 1998 from a sunfish stranded in an intra-coastal waterway in Florida, USA and stored frozen at −80 °C. The frozen blood was transported to Singapore in dry ice. Genomic DNA was extracted using the standard phenol/chloroform extraction method, suspended in TE buffer, and stored at 4 °C. We constructed eight paired-end libraries with insert sizes of 170 base pairs (bp), 500 bp, 800 bp, 2 kb, 5 kb, 10 kb, 20 kb and 40 kb and generated a total of 98.22 Gb sequence data (Additional file 1: Table S1, NCBI Project number PRJNA305960, SRA number SRA319445) on an Illumina HiSeq 2000 platform. We used SOAPdenovo 2  to assemble the genome using K = 23 (map_len = 32 for libraries with shorter than 1 kb inserts, and map_len = 35 for libraries with inserts longer than 1 kb). Reads with low-quality bases (i.e., reads with more than 40 bases with quality scores less than 8 (Phred + 64), or containing more than ten Ns) or potential sequencing errors were removed or corrected by k-mer frequency-based methodology. After these quality control and filtering steps, a total of 68.87 Gb (96-fold coverage) clean data were retained for the sunfish assembly (Additional file 1: Table S2). The total length of the assembled genome is 642 Mb (Additional file 1: Table S4). To assess the assembly quality, we used the Core Eukaryotic Genes Mapping Approach (CEGMA)  and found that 99.6 % of the CEGMA genes are complete in the assembly. Analysis of Benchmarking Universal Single-Copy Orthologs (BUSCO) genes  showed that the assembly contains 74 % complete and 18 % partial vertebrate BUSCO orthologues. These results suggest that the assembly is of high quality. The genome assembly of the sunfish is available at DDBJ/ENA/GenBank under the accession number MBDK00000000.
Estimation of genome size using k-mer analysis
We estimated the genome size using the k-mer method  and the formula: G = k-mer_number/k-mer_depth, where the k-mer_number is the total number of k-mers, and k-mer_depth denotes the peak frequency that occurred more than any other frequencies. For sunfish, the k-mer size was 17, k-mer_number was 20,842,591,260 and the k-mer_depth was 28, so the sunfish genome size was estimated to be 730,752,424 bp (Additional file 1: Table S3 and Figure S5).
We employed Tandem Repeats Finder (TRF)  to identify tandem repeats. RepeatMasker  and RepeatProteinMask  were used to identify and classify transposable elements (TEs) by aligning the sunfish genome sequences against a library of known repeats, Repbase , using default parameters. RepeatModeler  was used for de novo identification of repeats. All repeats obtained by various methods were combined together to form a non-redundant list of sunfish repeats (Additional file 1: Tables S5 and S6). For comparison, we also predicted repetitive sequences in other fishes using the same method (data shown in Fig. 1).
We used both homology-based and de novo methods for predicting protein-coding genes. For homology prediction, protein sets of human, fugu, Tetraodon, medaka and zebrafish were downloaded from Ensembl (release 73)  and mapped onto the sunfish genome using TBLASTN (v2.2.19)  (with E-value threshold of 1e-5). Genewise (with alignment rate threshold of 0.25) was used to generate gene models. For de novo prediction, Augustus (v2.5.5)  with the following parameters was used: uniqueGeneId = true; noInFrameStop = true; gff3 = on; strand = both, with gene model parameters trained on 1000 high-quality sunfish genes from homology-based predictions. These 1000 high-quality genes have intact gene models and span more than 90 % of the TBLASTN alignment. Finally, homology-based and de novo gene models were merged to form a comprehensive and non-redundant reference gene set using GLEAN  with the following parameters: minimum coding sequence length 150 bp and maximum intron length 10 kb.
Gene family clustering
We used TreeFam  to cluster gene families in five fish genomes (sunfish, fugu, Tetraodon, medaka and zebrafish). We set maximum gene family size as 250 and BLAST E-value cutoff as 1e-7. In total, we obtained 14,768 gene families (Fig. 1). Sunfish-specific genes are the genes in gene families that exist only in sunfish.
Phylogenetic reconstruction and divergence time estimation
The orthologous genes of seven ray-finned fishes (fugu, FUGU4; Tetraodon, TETRAODON8; stickleback, BROADS1; medaka, MEDAKA1; tilapia, Orenil1.0; zebrafish, Zv9 and spotted gar, LepOcu1) were acquired from Ensembl (release 76) . In this analysis, we used fugu as the query (each gene at > 70 % identity with the fugu gene). The Ensembl dataset was further filtered to retain only one-to-one orthologues from all seven species. Inparanoid (v4.1)  was used to identify one-to-one orthologues in the sunfish using fugu for comparison (using default settings, i.e. minimum 50 % alignment span, minimum 25 % alignment coverage, minimum BLASTP score of 40 bits, minimum inparalog confidence level of 0.05, and score exceeding 300). Both the Ensembl and Inparanoid datasets were then combined, to obtain the final dataset of 1690 one-to-one ray-finned fish orthologues (Additional file 5). Their protein sequences were aligned using Clustal Omega (v1.2.0) . Coding sequence alignments were generated based on the protein alignments using PAL2NAL (v14) . Concatenated alignments were prepared for both the protein and coding sequence alignments by concatenating alignments of all orthologues. Gaps were removed from both the concatenated alignments using Gblocks (v0.91b)  at default settings. The lengths of the trimmed protein and coding sequence alignments were 888,142 and 2,705,059 positions, respectively.
For phylogenetic analyses, we used the Maximum Likelihood (ML) method. ModelGenerator (v0.85)  was used to obtain the best-fit substitution model for each concatenated alignment. The general time reversible (GTR) model with optimization of substitution rates and Gamma model of rate heterogeneity (GTR + I + G) and JTT (Jones-Taylor-Thornton) with an estimate of proportion of invariable sites, and Gamma model of rate heterogeneity and empirical base frequencies (JTT + I + G + F) were the selected models for the coding sequence and protein alignments, respectively. We used the rapid bootstrap algorithm with a thorough ML search (‘-f a’ option) as implemented in RAxML (v8.0.26)  and 100 bootstrap replicates to generate the trees.
The sequences of 1690 single-copy gene families from eight fishes (as described above) were concatenated and preprocessed for estimating divergence times based on the topology obtained in the phylogenomic analysis. MCMCTREE  (PAML package) was used to infer divergence times based on the approximate likelihood calculation method. Two nodes were calibrated using fossil records  as follows: ancestral node of fugu and Tetraodon: 32.25 to 56.00 mya; ancestral node of zebrafish and medaka; 149.85 to 165.20 mya.
SNP calling and heterozygosity estimation
We used reads from 500 bp insert-size library to call SNPs. Firstly, we mapped sequencing reads to the assembled genome by the Burrows-Wheeler Alignment (BWA) tool  with the parameter ‘-n 4’. The average coverage depth of mapped reads was about 26 and covered about 98 % of the genome. We employed the Picard-tools  to prepare and filter alignments for SNP calling, including removing low-quality alignments, sorting alignments and merging duplication reads. Next, we used the Genome Analysis Toolkit (GATK)  to realign and recalibrate the bases that were disrupted by indel sites, and called SNPs. We filtered the SNPs that are next to another SNP within 10 bp and whose depth is less than five. The heterozygosity rate of sunfish was calculated as the number of heterozygous SNPs divided by effective genome size (genomic bases covered by at least five reads). We identified 489,862 heterozygous SNPs, and the heterozygosity rate of sunfish is 0.78 × 10−3.
Population history estimation
The pairwise sequentially Markovian coalescent (PSMC) model is a coalescent-based hidden Markov model which can be used to estimate the history of effective population sizes based on genome-wide diploid sequence data . Based on modelling of two sequences of the diploid genomes, this method can infer population size histories beyond 20,000 years. This method has been previously used for inferring demographic histories of vertebrates over a long evolutionary period [70, 71, 72]. Based on the result of SNP calling, we estimated the population size history of sunfish by PSMC . To calculate the mutation rate required, we generated LASTZ  pairwise alignments of sunfish, Tetraodon and medaka. Based on the pairwise alignments, the difference per site between sunfish and Tetraodon is 0.2255, 0.2425 for sunfish and medaka, and 0.3244 for Tetraodon and medaka. The accumulated mutations in sunfish for the nearest common ancestor (NCA) of Tetraodon and sunfish to present is (0.2255 + 0.2425 - 0.3244) / 2 = 0.0718, and the site mutation rate of sunfish is 0.0718 / 68,000,000 = ~1 × 10−9 per site per year. In addition, we set the average generation time of sunfish as 4 years, as this is its medium population doubling time . We performed PSMC with the parameters of the maximum 2 N0 coalescent time of 15, the maximum number of iterations of 30 and the pattern of parameters of ‘4 + 10 * 1 + 20 * 2 + 4 + 6’. We also performed 100 rounds of bootstrapping with the same parameters. We combined all the results and plotted the figure using the plot tool in PSMC (Additional file 1: Figure S1).
Fast-evolving genes and genes with positively selected sites
To perform the dN/dS analysis, we generated a new set of orthologues using the gene sets of sunfish, fugu, Tetraodon, medaka and zebrafish. We used BLAST to obtain the reciprocal best hits (RBHs) for each pair of species (BLAST E-value cutoff of 1e-5). Finally, we identified 10,660 teleost homologues among these five species. We employed the software PRANK-MSA (v140110)  with the following parameters: gaprate = 0.025, gapext = 0.75, to generate coding sequence alignment for each orthologous group. We then used GUIDANCE (v1.41)  to mask nucleotides of low quality to Ns under the parameters of bootstraps = 10; seqCutoff = 0.6; colCutoff = 0.93. We regarded the sites with GUIDANCE site-wise score of < 0.5 as low-quality sites.
To examine the selective constraints on the genes, we estimated the dN/dS ratio (ω) using PAML (v4.4b)  with the coding sequence alignments obtained above. Firstly, we ran the branch models (mode = 2; NSsite = 0)  to estimate the ω across the tree ‘((sunfish, (fugu, tetraodon)), medaka, zebrafish);’ with the following parameter settings: Codonfreq = 2; kappa = 2.5; initial omega = 0.2. We used three hypotheses: 1) H0 hypothesis, all branches have the same ω; 2) H1 hypothesis, the branch leading to sunfish has a different ω whereas the other branches have the same ω; 3) H2 hypothesis, all branches have independent ω. We used likelihood values and degree of freedoms of the three hypotheses to perform a likelihood-ratio test (LRT). We picked up genes whose likelihood values of H1 are significantly larger (adjusted LRT p-value of < 0.05) than H0 and likelihood values of H2 are not significantly larger than H1. The genes with larger ω values in sunfish than other branches suggest their fast evolution in sunfish. Finally, we identified 1.067 fast-evolving genes with significant false discovery rate (FDR)-corrected p-values (<0.05) in the sunfish.
In addition, we also ran the branch-site models (model = 2; NSsite = 2) to detect the genes with positively selected sites in sunfish. For null hypothesis we set ‘fix_omega = 1; omega = 1’, whereas for the alternative hypothesis we set ‘fix_omega = 0; omega = 1.5’ with the tree ‘((sunfish #1, (fugu, tetraodon)), medaka, zebrafish)’. Using an FDR-corrected LRT p-value (adjusted LRT p-value) cutoff of 0.05, we identified 1117 positively selected genes in sunfish.
BLAST, basic local alignment search tool; BMP, bone morphogenetic proteins; ECM, extracellular matrix; GH/IGF1, growth hormone/insulin-like growth factor 1; INSR, insulin receptor; LRT, likelihood-ratio test; MPT, mid-Pleistocene climate transition; mya, million years ago; PSMC, pairwise sequentially Markovian coalescent; SCPP, secretory calcium-binding phosphoproteins; SNP, single nucleotide polymorphism
This study was funded by BGI, Shenzhen, China and the Biomedical Research Council of Agency for Science, Technology and Research (A*STAR), Singapore. We would like to thank Paul J Linser, The Whitney Laboratory for Marine Bioscience, University of Florida for the sunfish blood sample and the University of Kansas Biodiversity Institute for fin clip samples of two sunfishes (KU# T2959 from California, USA and T2979 from aquarium trade specimen, locality unknown).
Availability of data and materials
Datasets supporting the results of this article are available in the GigaDB repository associated with this publication . The genome assembly of the sunfish has been deposited at DDBJ/ENA/GenBank under the accession number MBDK00000000, and raw data is available from the SRA under bioproject number PRJNA305960.
BV, SB and GZ conceived the study and designed the experiments. HP and HY performed the genome sequencing, assembly and annotation. HP, HY and CL performed the positive selection and population history analyses. VR, APL, BHT and BV analyzed the Scpp and Hox gene families. MML performed the phylogenomic analyses. JW and HY participated in discussions and provided suggestions. BV, HP and HY wrote the manuscript with input from all authors. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Sunfish genomic DNA was extracted as per the guidelines set by the Institutional Animal Care and Use Committee (IACUC) of the Biological Resource Centre, A*STAR, Singapore.
- 2.Fraser-Brunner A. The Ocean Sunfishes (Family Molidae), vol. 1. 1951.Google Scholar
- 3.Powell DC. A fascination for fish: adventures of an underwater pioneer. 2nd ed. Berkeley: University of California Press; 2001.Google Scholar
- 4.Nakatsubo T, Hirose H. Growth of captive ocean sunfish, Mola mola. Suisan Zoshoku. 2007;55:403–7.Google Scholar
- 5.Cleland J. On the anatomy of the short sunfish (Orthragoriscus mola). Nat Hist Rev (Lond). 1862;2:170–85.Google Scholar
- 28.Lee J, Pilch PF. The insulin receptor: structure, function, and signaling. Am J Physiol Cell Physiol. 1994;266:C319–34.Google Scholar
- 51.Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinform. 2009;25:04.10.01–4.10.4.Google Scholar
- 68.Picard Tools. http://broadinstitute.github.io/picard/. Accessed 30 March 2016.
- 73.Harris RS. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. The Pennsylvania State University; 2007.Google Scholar
- 74.FishBase. http://www.fishbase.org. Accessed 30 March 2016.