The draft genome of the carcinogenic human liver fluke Clonorchis sinensis
- 17k Downloads
Clonorchis sinensis is a carcinogenic human liver fluke that is widespread in Asian countries. Increasing infection rates of this neglected tropical disease are leading to negative economic and public health consequences in affected regions. Experimental and epidemiological studies have shown a strong association between the incidence of cholangiocarcinoma and the infection rate of C. sinensis. To aid research into this organism, we have sequenced its genome.
We combined de novo sequencing with computational techniques to provide new information about the biology of this liver fluke. The assembled genome has a total size of 516 Mb with a scaffold N50 length of 42 kb. Approximately 16,000 reliable protein-coding gene models were predicted. Genes for the complete pathways for glycolysis, the Krebs cycle and fatty acid metabolism were found, but key genes involved in fatty acid biosynthesis are missing from the genome, reflecting the parasitic lifestyle of a liver fluke that receives lipids from the bile of its host. We also identified pathogenic molecules that may contribute to liver fluke-induced hepatobiliary diseases. Large proteins such as multifunctional secreted proteases and tegumental proteins were identified as potential targets for the development of drugs and vaccines.
This study provides valuable genomic information about the human liver fluke C. sinensis and adds to our knowledge on the biology of the parasite. The draft genome will serve as a platform to develop new strategies for parasite control.
KeywordsFatty Acid Binding Protein Liver Fluke Syntenic Block Clonorchiasis Fatty Acid Biosynthesis Pathway
mab-3 related transcription factor 1
expressed sequence tag
fatty acid synthase
Kyoto Encyclopedia of Genes and Genomes
National Center for Biotechnology Information
open reading frame
cyclic AMP-dependent protein kinase
sex determining region Y-box 6.
Clonorchis sinensis, the oriental liver fluke, is an important food-borne parasite that causes human clonorchiasis in most Asian countries, including China, Japan, Korea, and Vietnam [1, 2, 3]. Increasing epidemiological evidence demonstrates the great socio-economic impact of this neglected tropical parasite, which afflicts more than 35 million people in Southeast Asia and approximately 15 million in China alone [1, 4]. The origin of most clonorchiasis cases is the consumption of raw freshwater fish containing C. sinensis metacercariae, which excyst in the duodenum and then migrate from the common bile ducts to the peripheral intrahepatic bile ducts of their host . Although clinical manifestations are often asymptomatic, repeated and chronic infections of C. sinensis can result in serious hepatobiliary diseases, including cholangitis, obstructive jaundice, hepatomegaly, fibrosis of the periportal system, cholecystitis, and cholelithiasis . Most importantly, both experimental and epidemiological evidence strongly implies that liver fluke infection is one of the most significant causative agents of bile duct cancer-cholangiocarcinoma (CCA)-which is a frequently fatal tumor [7, 8, 9, 10].
The life cycle of C. sinensis is complex and similar to that of Opisthorchis viverrini, involving asexual reproduction in an aquatic snail (myracidium, sporocyst, redia, and cercaria stages) and sexual reproduction in piscivorous mammals (adult worm stage). Mammalian hosts include humans, dogs, and cats [1, 6]. C. sinensis adult worms establish themselves as parasites in the intrahepatic bile ducts and extrahepatic ducts of the liver, and they can even invade the mammalian gall bladder . Long-term parasitism by liver flukes results in chronic stimulation of the epithelial cells of the bile ducts due to fluke excretory-secretory (ES) products, a variety of molecules released from parasites into the host bile environment . Proteomic studies have identified the components of C. sinensis ES products that are thought to act as stimuli for host bile duct epithelium [12, 13]. In vitro biochemical studies have indicated that ES products from liver flukes have important roles in feeding behavior, detoxification of bile components, and immune evasion . For instance, granulin-like growth factor secreted by the carcinogenic liver fluke O. viverrini was shown to induce host cell proliferation, and the proliferative activity could be blocked by antibodies against granulin. These data indicate that secreted proteins, along with many other molecules, are released by parasites to induce local cell growth . Transcriptome data sets for C. sinensis, which include substantial representation of ES products, also enable a better understanding of the mechanism of infection of this carcinogenic parasite .
Epidemiological studies in regions affected by liver flukes have shown a strong association between the incidence of CCA and the infection rate of parasites . Despite the considerable impact of liver fluke-associated hepatobiliary diseases on public health, there are currently no effective strategies to combat CCA. This study provides genomic information for the carcinogenic human liver fluke C. sinensis based on de novo sequencing, and the draft genome described will serve as a valuable platform to develop new interventions for the prevention and control of liver flukes.
Results and discussion
De novo sequencing and genome assembly
Summary of the C.sinensis genome assembly
Total length (Mb)
The average GC content of the C. sinensis genome is 43.85%. Using non-overlapping sliding windows along the genome, we found a random distribution of sequencing depth over areas with different GC content (within a range of 30 to 60%) covering more than 99.9% of the genome sequence (Figure S2a in Additional file 1). Regions with lower (< 0.2) or higher (> 0.6) GC content were not found. The GC content of C. sinensis is higher than that of four other genomes that we examined (Figure S2c in Additional file 1).
To evaluate the single-base accuracy of the assembled genome, we mapped all of the trimmed reads onto the super-scaffold using Bowtie  (no more than three mismatches). Approximately 79% of the reads were uniquely mapped (Table S2 in Additional file 1). For more than 98% of the assembled genome, there are more than ten reads mapped for each position, and the maximum sequencing depth is 30× (Figure S2d in Additional file 1), which can provide a very high single-base accuracy . To further evaluate the assembly accuracy, 14 pairs of primers were designed to amplify specific genomic fragments. All PCR products were sequenced on an ABI3730, and the resulting sequence traces aligned to the genome with over 99.6% identity (Table S3 in Additional file 1). The assembled genome contains 88.2% of the 15,121 ESTs produced by the Sanger method that have consensus lengths of 100 bp or more  (Table S4 in Additional file 1).
We called variants with the program glfSingle, which was designed for genome data from a single individual. We found 2.3 million variants (Figure S3 in Additional file 1), with a transition/transversion ratio of 2.07. The heterozygosity was approximately 0.4% for the whole genome, about three times that of Schistosoma japonicum .
Several families of repeat elements covering 0.35% of the genome were identified by comparing the genome sequence with the known repetitive sequences in RepBase database. We further de novo predicted C. sinensis-specific repeats with the RepeatModeler software [23, 24], and found 691 different repeat families/elements, constituting 25.6% (132.2 Mb) of the genome (Table S5 in Additional file 1). According to our estimate of genome size, approximately 128 Mb (19.9%) has not been assembled; most of the unassembled sequence may consist of repetitive sequences. The proportion of repeats is comparable to S. japonicum (40.1% ) and Schistosoma mansoni (45% ). We identified both non-long terminal repeats (non-LTRs) and LTR transposons, comprising 10.34% and 1.03% of the genome, respectively. Few short interspersed repetitive elements (SINEs) were found.
Gene model annotation
General pattern of protein-coding genes of C.sinensis with S. mansoni and S. japonicum
Number of gene models
Average gene length (bp)
Average protein length (bp)
Average exon length (bp)
Average number of exons
Averge intron length (bp)
CDS proportion (%)
Intron proportion (%)
To assess the completeness of our gene models, we investigated the coverage of the CEGMA  set of 458 core eukaryotic genes. Most of these core genes (425; 92.8%) were found, of which 392 were aligned over more than 50% of their sequences, suggesting the completeness of the genome (Table S8 in Additional file 1).
To investigate the amount of variation in gene families between C. sinensis and other metazoans, we assigned genes into families by clustering them according to their sequence similarities (see Materials and methods). We observed a minor amount of variation in the total number of gene families when looking across C. sinensis (6,910), S. japonicum (8,898), S. mansoni (7,313) and well characterized species like Caenorhabditis elegans (10,180), Drosophila melanogaster (7,640) and Homo sapiens (8,841) (Table S9 in Additional file 1).
In addition to protein-coding genes, we also identified 7 rRNA fragments and 235 tRNAs, 509 small nucleolar RNAs, 169 small nuclear RNAs, and 858 microRNA (miRNA) precursor genes in the C. sinensis genome (Table S11 in Additional file 1). To further annotate C. sinensis miRNA precursors, we mapped miRNA expression data [28, 29] to our miRNA precursors and found 159 miRNA precursors had evidence of expression (Additional file 2).
Phylogeny of C. sinensis
Synteny between C. sinensis and S. japonicum and S. mansoni
To investigate the long-range synteny between C. sinensis and the schistosome genomes, we selected all 79 scaffolds larger than 200 kb to perform alignments with the S. japonicum and S. mansoni genomes. Given that the average gene length of C. sinensis is about 10 kb, we chose those blocks with size larger than 10 kb as putative syntenic blocks. The largest syntenic block between C. sinensis and S. japonicum is 66 kb and the maximum gene number in one syntenic block is three. The largest syntenic block between C. sinensis and S. mansoni is 99 kb and the maximum gene number in one syntenic block is four (Additional file 3). More closely related species are needed to further understand the genome synteny of the flukes.
To investigate energy metabolism in C. sinensis, we mapped the gene models to the pathways represented in the Kyoto Encyclopedia of Genes and Genomes (KEGG). The results demonstrate that both the glycolytic pathway (Figure S5 in additional file 4) and the Krebs cycle (Figure S6 in Additional file 4) are intact; C. sinensis can obtain energy from both aerobic and anaerobic metabolism. Although liver flukes inhabit anaerobic bile ducts, the conserved biochemical pathway of aerobic metabolism can facilitate the survival of C. sinensis juveniles in their intermediate hosts. As expected, genes encoding key enzymes required for glycolysis, such as hexokinase, enolase, pyruvate kinase and lactate dehydrogenase, were present at high copy number. We did notice that some genes for enzymes involved in energy metabolism were conspicuously absent; it seems that loss of these metabolic enzymes in C. sinensis might relate to its parasitic lifestyle. We presumed that C. sinensis adult worms might utilize exogenous glucose through the glycolytic pathway or by absorbing nutrients from hosts under anaerobic conditions . Like schistosomes, C. sinensis can ingest glucose at rates as great as 26% of its body weight per hour, with glucose being broken down into lactic acid through glycolysis . Thus, glycolytic enzymes are crucial molecules for trematode survival.
Fatty acid metabolism and biosynthesis
We discovered many gene copies encoding fatty acid binding proteins, which are thought to have a role as fatty acid transporters in Fasciola hepatica . Bile contains high levels of fatty acids, which can act as a nutrient source for parasites. The fatty acid binding proteins found in liver flukes may play an important role in the uptake of nutrients from host bile, possibly making it unnecessary for flukes to synthesize their own fatty acids endogenously. Niemann-Pick C1 protein (NPC1), a gene involved in regulating biliary cholesterol concentration, was also identified in C. sinensis . The role of NPC1 in bile acid metabolic processes required for cholesterol absorption further indicates that C. sinensis is able to absorb lipids from its host for survival.
Proteases, kinases, and phosphatases
To gain access to their preferred location within hosts, parasites have to escape hosts' defense mechanisms. Diverse molecules and biochemical pathways have evolved to counter those defenses, including important enzymes like proteases. Particularly in liver flukes, proteases play key roles in invasion, migration and feeding/nutrition [32, 33]. Putative proteases we identified include metalloproteases, cysteine proteases, serine proteases and aspartic proteases, among others (Table S13 in Additional file 6). Among these, the largest group is the cysteine protease superfamily; these proteases have been identified as possible diagnostic antigens and vaccine candidates in S. japonicum . Only those of the cathepsin F subtype are well characterized and theseare thought to play a key role in parasite physiology and related pathobiological processes in C. sinensis [34, 35]. Other cysteine protease subtypes, such as cathepsins A, B, D, and E and even serine proteases, have not been previously recognized in C. sinensis, though they may contribute to catabolism of bilirubin and other host proteins. By comparing C. sinensis with O. viverrini, which has a similar life cycle, we were able to draw the general conclusion that serine proteases, metalloproteases and aspartic proteases may be principal players in host invasion and the progression of hepatobiliary disease [36, 37, 38].
Phosphorylation and dephosphorylation occur in all known eukaryotes through the antagonistic actions of protein phosphatases and protein kinases. Protein kinases play key roles in many eukaryotic processes, such as gene expression, metabolism, apoptosis, and cellular proliferation . We have identified many important protein kinases in C. sinensis (Table S14 in Additional file 6), including casein kinase II, serine/threonine-protein kinase, cell division protein kinase, adenylate kinase isoenzyme, pyruvate kinase, cyclin-dependent protein kinase, calcium/calmodulin-dependent protein kinase, mitogen-activated protein kinase kinase kinase, and cAMP-dependent protein kinase. Casein kinase II is a eukaryotic serine/threonine protein kinase with multiple substrates and roles in diverse cellular processes, including differentiation, gene silencing, cell proliferation, tumor suppression and translation; however, its function in trematodes remains unknown . Cyclic AMP-dependent protein kinase (PKA) is implicated in numerous processes in mammalian cells and plays an important role in parasite biology. Inhibition of Plasmodium falciparum PKA resulted in significant anti-parasitic effects . Therefore, PKA represents a promising target for the treatment of parasite infections. Calcium/calmodulin-dependent protein kinase is essential for signal transduction in cells and modulates a variety of physiological processes, such as learning and memory, metabolism and transcription. For Plasmodium gallinaceum zygotes, calcium/calmodulin-dependent protein kinase is required for the morphological changes that occur during ookinete differentiation . The mitogen-activated protein kinases (MAPKs) are highly conserved kinases involved in signal transduction and development . In general, protein kinases are promising candidates as targets for RNA interference-based treatments to prevent liver fluke infection.
Apart from protein kinases, many phosphatases were discovered in the draft genome, including glucose-6-phosphatase 3, magnesium-dependent phosphatase and protein tyrosine phosphatase (Table S15 in Additional file 6). Phosphatases are endogenous kinase inhibitors that reverse the action of kinases, and they can be classified by substrate specificity as either serine/threonine, tyrosine or dual specificity phosphatases . The physiological roles of serine/threonine protein phosphatases are numerous and have been studied extensively. Because of their critical regulatory roles in cellular processes, they have been regarded as promising targets for drug development in recent years.
Tegument and excretory-secretory products
The outermost surface of a trematode is a syncytium. For platyhelminth parasites, the tegument is generally viewed as the most susceptible target for vaccines and drugs because it is a dynamic host-interactive layer with roles in nutrition, immune evasion and modulation, pathogenesis, excretion and signal transduction [45, 46]. We characterized putative tegument proteins, including cathepsin B, epidermal growth factor receptor, glucose-6-phosphatase, glyceraldehyde-3-phosphate dehydrogenase, a calcium channel and a voltage-dependent channel subunit (Table S16 in Additional file 6). These proteins can be classified into several subtypes, such as proteases, receptors, nutrition and metabolism enzymes, channel proteins and transfer proteins. Most of these proteins have not previously been recognized in C. sinensis and may contribute to catabolism of host proteins and invasion of host tissue. Likely because of their critical roles, the genes encoding phospholipase D, phosphatidic acid phosphatase type 2A, glucose-6-phosphatase and calcium ATPase are found in high copy numbers. It is well known that phospholipase D is an important signaling molecule that increases nitric oxide synthesis and inducible nitric oxide synthase expression . Phosphatase type 2A plays a pivotal role in the control of signal transduction by lipid mediators such as phosphatidate, lysophosphatidate, and ceramide-1-phosphate . Our previous studies have revealed that some lipid metabolism enzymes, such as lysophospholipase  and phospholipase A2 , potentially contributed to liver fibrosis caused by C. sinensis infection. The roles of tegumental phospholipase D and phosphatase type 2A in C. sinensis pathogenesis warrant further study.
In the C. sinensis genome, we have identified some important ES products, including cortactin, aldolase, enolase, phosphoglycerate kinase, transketolase, programmed cell death 6 interacting protein, and fructose-bisphosphate aldolase (Table S17 in Additional file 6). The ES products of parasites have attracted attention because of their potential uses in the development of diagnostics, vaccines, and drug therapies. Previous studies have demonstrated the importance of ES products in many parasites, such as O. viverrini, C. sinensis, S. japonicum, S. mansoni, and Paragonimus westermani [45, 46, 47, 48, 51]. ES products comprise various proteins, the most predominant of which are proteases and detoxifying enzymes, which may serve vital roles in protecting parasites from host immune defenses . One of the ES products, enolase, is a cytosolic glycolytic enzyme that has been reported to localize on the cell surface and the tegument in helminths. The secretory enolase of S. japonicum may promote fibrinolytic activity to enable parasitic invasion and migration within the host. This enzyme could be used for vaccines and drug development applications . Similarly, fructose-bisphosphate aldolase is a conserved enzyme that was classified as a metabolic enzyme that modulates interactions between hosts and parasites . This enzyme might have important roles in C. sinensis.
Host-binding proteins and receptors
The highly co-evolved relationship between C. sinensis and its hosts depends on adaptations in the host-binding proteins and related receptors . A number of such molecules were characterized in our research (Table S18 in Additional file 6), such as fibronectin, calmodulin, plasminogen, epidermal growth factor receptor and fibroblast growth factor receptor. Fibronectin, a multifunctional protein, is well conserved across species and has multiple domains for interaction with extracellular matrix components, such as heparin and collagen . It was reported that fibronectin plays a role in activating phosphokinase A in the context of host invasion. Calmodulin has roles in the detoxification system, which has evolved sensors and responders that use Ca2+ as a messenger. Plasminogen plays important roles in processes such as fibrinolysis and the degradation of extracellular matrices, and it can enhance proteolytic activity and increase tissue damage when coupled with its receptor. One of the most well characterized plasminogen receptors in mammals is enolase, the glycolytic enzyme described above . Unexpectedly, a granulin-like growth factor was also observed (Table S18 in Additional file 6). This growth factor is a homologue of human granulin, a secreted growth factor associated with liver fluke-induced cancers . Further studies should focus on the identification of host receptors to provide therapeutic strategies for cancers. C. sinensis can live for years, sometimes decades, within the bile ducts of mammalian hosts as it develops, matures and reproduces, so it is expected that co-evolution of parasite and host proteins has occurred in the process of regulating host-parasite interactions.
Sex determination and reproduction
C. sinensis is a hermaphrodite, but the key genes responsible for sex determination are still unknown. We identified 53 genes related to sex determination, sex differentiation and sexual reproduction (Table S19 in Additional file 6). We also identified 25 genes in particular by their annotation with the Gene Ontology term 'hermaphrodite genitalia development'. That Gene Ontology annotation comes from C. elegans, a nematode that displays hermaphroditism . In addition, six genes were predicted to be related to sexual reproduction.
In C. sinensis, we also found the genes SOX6 (SRY (sex determining region Y)-box 6) and DMRT1 (doublesex and mab-3 related transcription factor 1), which are known sex determination genes in vertebrates. In mammals, SRY is thought to be a testis determination factor and a critical developmental regulator . The fact that SRY and SOX6 co-localize with splicing factors in the nucleus indicates that SOX6 may play a role in splicing of the testis-determining factor in C. sinensis development . Doublesex and mab-3 contain a zinc finger-like DNA-binding motif (DM domain) that performs several related regulatory functions. DMRT1 regulates a DM-domain-containing protein that has a conserved role in vertebrate sexual development . To date, most investigations of hermaphrodite development have focused on the nematodes, and our novel findings now provide valuable clues for biological research on the hermaphrodite phenomenon.
Liver flukes and cholangiocarcinoma
Of particular interest in this study was the identification of proteins that could contribute to carcinogenesis. Apart from the previously described granulin and thioredoxin peroxidase, fatty acid binding protein and phospholipase A2 are members of the CCA-related gene group (Table S20 in Additional file 6). Granulin in O. viverrini is defined as a proliferative growth factor and has been shown to be mitogenic at very low concentrations . The genomic results provide strong evidence that granulin is also encoded in C. sinensis, and further work will determine its significance in the process of carcinogenesis. Thioredoxin peroxidase is characterized as an antioxidant enzyme ubiquitously expressed in the tissues of the liver fluke and in epithelial cells within the host bile duct . Results suggest that thioredoxin peroxidase may play a significant role in protecting the parasite against damage and inducing inflammation in hosts. Our experiments have revealed the potential contribution of phospholipase A2 to hepatic fibrosis caused by C. sinensis infection. As an ES product, phospholipase A2 could bind to the receptor on the membrane of LX-2 cells . Fatty acid binding protein is thought to have functions in lipid transport in parasites , but whether fatty acid binding protein is involved in carcinogenesis requires further clarification.
It has been acknowledged that liver fluke-induced CCA is a multifactorial pathological process resulting from infection-induced inflammation and the release of carcinogenic substances by parasites . Both proteomic and transcriptomic approaches to the study of secreted and tegumental proteins have enhanced our understanding of the molecular mechanisms by which liver flukes establish a chronic infection, evade the host immune system and ultimately contribute to the onset of cancer . However, the intrinsic molecular mechanisms involved in these processes remain obscure. Long-term hepatobiliary damage may result from multiple factors, including mechanical irritation of the epithelial cells, DNA damage from endogenous and exogenous carcinogens, and immunopathological processes directed by ES products and tegumental proteins. Moreover, increased concentrations of N-nitroso compounds in humans infected with liver flukes may contribute to the risk of developing CCA through the alkylation or deamination of DNA . The results from our genomic study will help to elucidate previous hypotheses and aid us to explore more potentially important molecules associated with liver fluke-induced CCA.
This study provides the fundamental biological characterization of the carcinogenic human liver fluke C. sinensis, which has large socio-economic and public health effects in Asian countries [1, 2]. Recently, the advent of next-generation sequencing technology provided us with an unprecedented opportunity to obtain whole-genome sequence information for this neglected parasite. We report here the draft genome of C. sinensis based on DNA isolated from a single individual parasite. Briefly, our work contributes needed knowledge to decode the mechanisms underlying energy metabolism, developmental biology and pathogenesis in C. sinensis. Large pathogenic molecules involved in liver fluke-induced hepatobiliary disease have been discovered . Numerous multifunctional secreted proteases and tegumental proteins have been highlighted for further study as vaccine and drug targets. In conclusion, the results presented here characterize the genomic features of C. sinensis and reveal the evolutionary interplay between parasite and host. We believe that the discoveries made in the C. sinensis genome project will be quite valuable for the prevention and control of this liver fluke.
Materials and methods
DNA library construction and sequencing
Adult C. sinensis flukes were isolated from cat livers (Henan Province, China) and rinsed several times with phosphate-buffered saline. A single adult was chosen for genomic DNA extraction using phenol.
Two short-insert (350 bp and 500 bp) DNA libraries were constructed according to the Paired-End Sample Preparation Guide (Illumina, San Diego, CA, USA). Briefly, we nebulized 2.5 μg of DNA with compressed nitrogen gas, then polished the DNA ends and added an 'A' base to the ends of the DNA fragments. Next, the DNA adaptors (Illumina) were ligated to the above products, and the ligated products were purified on a 2% agarose gel. We excised and purified gel slices for each insert size (Qiagen Gel Extraction Kit; QIAGEN Co., Ltd, Shanghai, China). Two DNA libraries were amplified using the adaptor primers (Illumina) for 12 cycles, and fragments of approximately 450 bp and 600 bp (inserted DNA plus adaptors) isolated from agarose gels.
We performed cluster generation on the cBot (Illumina), following the cBot User Guide. Then, we performed a paired-end sequencing run on the Genome Analyzer IIx (Illumina) according to the user guide. A total of 188.6 million raw reads (115 bp each) were obtained. FastQScreen  was used to screen out Illumina adaptors and other contaminating sequences. After masking adaptor sequences and removing contaminated reads, clean reads were processed for computational analysis.
RNA library construction and sequencing
Adult C. sinensis flukes were isolated from cat livers (Guangdong Province, China) and rinsed several times with phosphate-buffered saline. Twenty flukes were pooled and total RNAs were extracted using the standard TRIZOL RNA isolation protocol (Invitrogen, Carlsbad, CA, USA).
For high-throughput sequencing, the sequencing library was constructed by following the manufacturer's instructions (Illumina). Fragments of 300 bp were excised and enriched by PCR for 18 cycles. Then, we performed a paired-end sequencing run on the Genome Analyzer IIx (Illumina) according to the user guide. After masking adaptor sequences and removing contaminated reads, a total of 31,965,154 clean paired-end reads (2 × 75 bp) were processed for scaffolding.
Sequence assembly and mapping
Then, the genome size was estimated from total sequencing length and sequencing depth.
Clean reads were trimmed to 103 bp to minimize problems associated with low quality ends. We used the Celera Assembler  to assemble contigs and scaffolds, and we constructed super-scaffolds with the RNA-seq data using RNAPATH  (ERANGE module ).
We used Bowtie  to align trimmed reads to the assembled genome with no more than three mismatches and generated a sequence alignment/map (SAM) file. Reads that matched repetitive sequences were filtered out. We converted the SAM file to a GLF file using SAMtools  and called variants with glfSingle  with the following parameters: (i) the coverage depth of a single base must be 10× to 60×; (ii) the root mean squared (RMS) mapping quality score of overlapping reads must be at least 99; and (iii) the posterior probability threshold is 0.999.
Known repetitive elements were identified using RepeatMasker [69, 70] with the Repbase database [71, 72] (version: 2009-06). A de novo repeat library was also constructed by using RepeatModeler, which contains two de novo repeat finding programs (RECON  and RepeatScout ). We used default parameters and generated consensus sequences and classification information for each repeat family. Then, we ran RepeatMasker on the genome again using the repeat library built with RepeatModeler.
Gene model annotation
Predicted proteins from S. japonicum and S. mansoni were aligned to C. sinensis to identify conserved genes. Because GeneWise  is time consuming, schistosome proteins were first aligned with the C. sinensis genome using genBlastA . Subsequently, we extracted matched genomic regions and used GeneWise to identify exon/intron boundaries.
Augustus and Genscan
C. sinensis ESTs
cDNA libraries from C. sinensis metacercaria and adults were constructed using the standard Trizol RNA isolation protocol (Invitrogen), and the two libraries yielded 9,455 and 2,696 EST sequences, respectively. We also downloaded 2,970 existing EST sequences from the NCBI dbEST database. In addition, 574,448 EST sequences  were produced using the Roche 454 platform. We mapped all of the EST sequences to the C. sinensis genome with GMAP .
Integration of resources using EvidenceModeler
Gene predictions generated by Augustus and Genscan, spliced alignments of S. japonicum and S. mansoni proteins and EST alignments from C. sinensis were integrated with EvidenceModeler .
Protein domain analysis
InterProScan  was run on all C. sinensis, S. japonicum and S. mansoni predicted protein sequences. Matches tagged as 'true positive' (status 'T') by InterProScan were retained. InterPro domain information for five other species (C. elegans, D. melanogaster, D. rerio, G. gallus and H. sapiens) was downloaded from Ensembl BioMart (Ensembl version 60) .
We mapped the C. sinensis reference genes to KEGG  pathways by BLAST (e-value < 1e-5). BLAST searches against the Swiss-Prot database and NCBI non-redundant database (e-value < 1e-5) were conducted to provide comprehensive functional annotation.
Gene family construction
Genes were clustered according to sequence similarity. We selected nine species in which to analyze gene families. The eight genomes were C. sinensis, S. japonicum, S. mansoni, C. elegans, D. melanogaster, A. gambiae, D. rerio, G. gallus and H. sapiens. Additional file 7 shows the sources of sequence data used in the present study [80, 82, 83, 84]. For each gene, the longest protein product was used for alignment purposes. The peptide sequences were first aligned to other sequences from the same genome using BLAST. Hits with e-value < 1e-10 were used for clustering by Markov clustering  (the parameter -I was set to 6).
Non-coding RNA annotation
The rRNA fragments were identified by aligning C. sinensis rRNA sequences from the NCBI Nucleotide database to the draft genome. The tRNA genes were found by running tRNAscan-SE  with eukaryote parameters. Other non-coding RNAs, including miRNAs, small nuclear RNAs and H/ACA-box small nucleolar RNAs, were identified by searching the Rfam database  with the software tool Infernal 1.0 .
Synteny with S. japonicum and S. mansoni
Seventy-nine scaffolds with length greater than 200 kb were selected to perform pairwise genome alignment with S. japonicum and S. mansoni using BLASTZ  with the following parameters: C = 2, T = 0, W = 6, H = 2000, Y = 3400, L = 6000 and K = 2200. The Chain/Net package was used for post-processing, including lavToPsl, chainMergeSort, chainPreNet, chainNet, netToAxt and axtToMaf, and so on. All three of the genomes were masked with RepeatMasker using the '-s' setting.
We selected nine species to construct a phylogenetic tree: C. sinensis, H. sapiens, G. gallus, D. rerio, D. melanogaster, A. gambiae, C. elegans, S. mansoni and S. japonicum. For each species, the longest transcript model was chosen to represent each gene, and genes shorter than 30 amino acids were excluded . BlastP was used to compare all orthologues of the C. sinensis protein sequences against a protein database built from the other eight species (e-value < 1E-10), and the Solar program was used to concatenate fragmentary alignments for each pair of genes . Genes that aligned with more than one-third of another gene in the same species were considered multi-copy genes and excluded from the analysis.
In total, 93 genes with single-copy orthologues in all species were identified. Individual multiple amino acid sequence alignments for each gene were created with CLUSTALW . Those alignments that lacked informative sites or had too many gaps were discarded. The remaining 44 genes were concatenated into a final alignment. Regions with many mismatches were also discarded to reduce alignment error. The best protein model was found by MEGA5  and used in the following analysis. The phylogeny tree was constructed by maximum likelihood methods using both MEGA5 and PHYML , which independently reached the same topology (only the results obtained from MEGA5 are presented here). Bootstrap values were based on 1,000 replicates. Tajima's relative rate test  was performed for C. sinensis and S. mansoni (or S. japonicum), with D. rerio (or any of the other five species) used as an out-group.
All of the genome shotgun and transcriptome data are available in the NCBI Sequence Read Archive [SRA: 029284 and 035384]. The assembled genome and gene models are available at . The genome sequences can also be downloaded from the DNA Data Bank of Japan [DDBJ: BADR01000001-BADR01060778 (contigs) and DF126616-DF142827 (scaffolds)]. The genome is available at the NCBI [NCBI: 72781], and the sequences are also available, from GenBank [GenBank: BADR00000000.1].
This work was supported by the Development Program of China (973 Program; no. 2010CB530000), the Sun Yat-sen University innovative talents cultivation program for excellent tutors and the program for detection techniques for important human parasitic diseases (no. 2008ZX1004-011).
- 2.Young ND, Jex AR, Cantacessi C, Campbell BE, Laha T, Sohn WM, Sripa B, Loukas A, Brindley PJ, Gasser RB: Progress on the transcriptomics of carcinogenic liver flukes of humans--unique biological and biotechnological prospects. Biotechnol Adv. 2010, 28: 859-870. 10.1016/j.biotechadv.2010.07.006.PubMedCrossRefGoogle Scholar
- 3.Young ND, Campbell BE, Hall RS, Jex AR, Cantacessi C, Laha T, Sohn WM, Sripa B, Loukas A, Brindley PJ, Gasser RB: Unlocking the transcriptomes of two carcinogenic parasites, Clonorchis sinensis and Opisthorchis viverrini. PLoS Negl Trop Dis. 2010, 4: e719-10.1371/journal.pntd.0000719.PubMedPubMedCentralCrossRefGoogle Scholar
- 4.Lai DH, Wang QP, Chen W, Cai LS, Wu ZD, Zhu XQ, Lun ZR: Molecular genetic profiles among individual Clonorchis sinensis adults collected from cats in two geographic regions of China revealed by RAPD and MGE-PCR methods. Acta Trop. 2008, 107: 213-216. 10.1016/j.actatropica.2008.05.003.PubMedCrossRefGoogle Scholar
- 5.Kim HG, Han J, Kim MH, Cho KH, Shin IH, Kim GH, Kim JS, Kim JB, Kim TN, Kim TH, Kim TH, Kim JW, Ryu JK, Moon YS, Moon JH, Park SJ, Park CG, Bang SJ, Yang CH, Yoo KS, Yoo BM, Lee KT, Lee DK, Lee BS, Lee SS, Lee SO, Lee WJ, Cho CM, Joo YE, Cheon GJ, et al: Prevalence of clonorchiasis in patients with gastrointestinal disease: a Korean nationwide multicenter survey. World J Gastroenterol. 2009, 15: 86-94. 10.3748/wjg.15.86.PubMedPubMedCentralCrossRefGoogle Scholar
- 11.Morphew RM, Wright HA, LaCourse EJ, Woods DJ, Brophy PM: Comparative proteomics of excretory-secretory proteins released by the liver fluke Fasciola hepatica in sheep host bile and during in vitro culture ex host. Mol Cell Proteomics. 2007, 6: 963-972. 10.1074/mcp.M600375-MCP200.PubMedCrossRefGoogle Scholar
- 12.Ju JW, Joo HN, Lee MR, Cho SH, Cheun HI, Kim JY, Lee YH, Lee KJ, Sohn WM, Kim DM, Kim IC, Park BC, Kim TS: Identification of a serodiagnostic antigen, legumain, by immunoproteomic analysis of excretory-secretory products of Clonorchis sinensis adult worms. Proteomics. 2009, 9: 3066-3078. 10.1002/pmic.200700613.PubMedCrossRefGoogle Scholar
- 14.Smout MJ, Laha T, Mulvenna J, Sripa B, Suttiprapa S, Jones A, Brindley PJ, Loukas A: A granulin-like growth factor secreted by the carcinogenic liver fluke, Opisthorchis viverrini, promotes proliferation of host cells. PLoS Pathog. 2009, 5: e1000611-10.1371/journal.ppat.1000611.PubMedPubMedCentralCrossRefGoogle Scholar
- 15.Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A whole-genome assembly of Drosophila. Science. 2000, 287: 2196-2204. 10.1126/science.287.5461.2196.PubMedCrossRefGoogle Scholar
- 16.Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.PubMedPubMedCentralCrossRefGoogle Scholar
- 19.Langmead B: Aligning short sequencing reads with Bowtie. Curr Protoc Bioinformatics. 2010, Chapter 11 (Unit 11.7):Google Scholar
- 25.Berriman M, Haas BJ, LoVerde PT, Wilson RA, Dillon GP, Cerqueira GC, Mashiyama ST, Al-Lazikani B, Andrade LF, Ashton PD, Aslett MA, Bartholomeu DC, Blandin G, Caffrey CR, Coghlan A, Coulson R, Day TA, Delcher A, DeMarco R, Djikeng A, Eyre T, Gamble JA, Ghedin E, Gu Y, Hertz-Fowler C, Hirai H, Hirai Y, Houston R, Ivens A, Johnston DA, et al: The genome of the blood fluke Schistosoma mansoni. Nature. 2009, 460: 352-358. 10.1038/nature08160.PubMedPubMedCentralCrossRefGoogle Scholar
- 26.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMedCentralCrossRefGoogle Scholar
- 28.CateGOrizer. [http://www.animalgenome.org/bioinfo/tools/catego/]
- 33.Dvorák J, Mashiyama ST, Braschi S, Sajid M, Knudsen GM, Hansell E, Lim KC, Hsieh I, Bahgat M, Mackenzie B, Medzihradszky KF, Babbitt PC, Caffrey CR, McKerrow JH: Differential use of protease families for invasion by schistosome cercariae. Biochimie. 2008, 90: 345-358. 10.1016/j.biochi.2007.08.013.PubMedCrossRefGoogle Scholar
- 34.Kang JM, Bahk YY, Cho PY, Hong SJ, Kim TS, Sohn WM, Na BK: A family of cathepsin F cysteine proteases of Clonorchis sinensis is the major secreted proteins that are expressed in the intestine of the parasite. Mol Biochem Parasitol. 2010, 170: 7-16. 10.1016/j.molbiopara.2009.11.006.PubMedCrossRefGoogle Scholar
- 37.Prakobwong S, Pinlaor S, Yongvanit P, Sithithaworn P, Pairojkul C, Hiraku Y: Time profiles of the expression of metalloproteinases, tissue inhibitors of metalloproteases, cytokines and collagens in hamsters infected with Opisthorchis viverrini with special reference to peribiliary fibrosis and liver injury. Int J Parasitol. 2009, 39: 825-835. 10.1016/j.ijpara.2008.12.002.PubMedCrossRefGoogle Scholar
- 38.Suttiprapa S, Mulvenna J, Huong NT, Pearson MS, Brindley PJ, Laha T, Wongkham S, Kaewkes S, Sripa B, Loukas A: Ov-APR-1, an aspartic protease from the carcinogenic liver fluke, Opisthorchis viverrini: functional expression, immunolocalization and subsite specificity. Int J Biochem Cell Biol. 2009, 41: 1148-1156. 10.1016/j.biocel.2008.10.013.PubMedPubMedCentralCrossRefGoogle Scholar
- 41.Syin C, Parzy D, Traincard F, Boccaccio I, Joshi MB, Lin DT, Yang XM, Assemat K, Doerig C, Langsley G: The H89 cAMP-dependent protein kinase inhibitor blocks Plasmodium falciparum development in infected erythrocytes. Eur J Biochem. 2001, 268: 4842-4849. 10.1046/j.1432-1327.2001.02403.x.PubMedCrossRefGoogle Scholar
- 58.Suttiprapa S, Loukas A, Laha T, Wongkham S, Kaewkes S, Gaze S, Brindley PJ, Sripa B: Characterization of the antioxidant enzyme, thioredoxin peroxidase, from the carcinogenic human liver fluke, Opisthorchis viverrini. Mol Biochem Parasitol. 2008, 160: 116-122. 10.1016/j.molbiopara.2008.04.010.PubMedPubMedCentralCrossRefGoogle Scholar
- 60.Mulvenna J, Sripa B, Brindley PJ, Gorman J, Jones MK, Colgrave ML, Jones A, Nawaratna S, Laha T, Suttiprapa S, Smout MJ, Loukas A: The secreted and surface proteomes of the adult stage of the carcinogenic human liver fluke Opisthorchis viverrini. Proteomics. 2010, 10: 1063-1078.PubMedPubMedCentralGoogle Scholar
- 61.FastQScreen. [http://www.bioinformatics.bbsrc.ac.uk/projects/fastq_screen/]
- 62.Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Li J, Zhang Z, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder OA, Leung FC, Zhou Y, Cao J, Sun X, Fu Y, et al: The sequence and de novo assembly of the giant panda genome. Nature. 2010, 463: 311-317. 10.1038/nature08696.PubMedPubMedCentralCrossRefGoogle Scholar
- 63.Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, Xie B, Ni P, Ren Y, Zhu H, Li J, Lin K, Jin W, Fei Z, Li G, Staub J, Kilian A, van der Vossen EA, Wu Y, Guo J, He J, Jia Z, Ren Y, Tian G, Lu Y, Ruan J, Qian W, et al: The genome of the cucumber, Cucumis sativus L. Nat Genet. 2009, 41: 1275-1281. 10.1038/ng.475.PubMedCrossRefGoogle Scholar
- 68.glfSingle. [http://www.sph.umich.edu/csg/abecasis/glfTools/]
- 69.Tarailo-Graovac M, Chen N: Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009, Chapter 4 (Unit 4.10):Google Scholar
- 70.Chen N: Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2004, Chapter 4 (Unit 4.10):Google Scholar
- 80.Ensembl. [http://www.ensembl.org]
- 82.Anopheles gambiae genome. [ftp://ftp.ncbi.nih.gov/genomes/Anopheles_gambiae/]
- 83.Schistosoma mansoni genome resource. [ftp://ftp.sanger.ac.uk/pub/pathogens/Schistosoma/mansoni/genome/]
- 84.Schistosoma japonicum genome resource. [http://www.chgc.sh.cn/japonicum/resource/]
- 85.Markov clustering. [http://www.micans.org/mcl/]
- 90.Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMedCentralCrossRefGoogle Scholar
- 94.Clonorchis sinensis Genome Database. [http://fluke.sysu.edu.cn]
- 95.Orphelia. [http://orphelia.gobics.de/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.