CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences
- 7.7k Downloads
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data.
CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at http://cowpeagenomics.med.virginia.edu/CGKB/.
KeywordsSimple Sequence Repeat Marker Tandem Repeat Finder Relational Database Management System HMMER Package Metabolic Pathway Database
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics and a valuable and dependable commodity for farmers and grain traders with ~21 million acres grown worldwide and an annual production of over 3 million tons [1, 2]. It is grown mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa . Despite its importance, cowpea has received relatively little attention from a research standpoint and remains to a large extent an underexploited crop where relatively large genetic gains can be made with only modest investments in both applied plant breeding and molecular genetics.
Cowpea growth and yield are constrained by a variety of biotic and abiotic factors. Insects, fungi, bacteria, parasitic plants and nematodes are the major biotic stresses, and drought, salinity and heat are among the major environmental limitations to cowpea productivity [1, 2]. One of the major goals of cowpea breeding and improvement programs is to combine resistances to numerous pests and diseases and other desirable agronomic traits, such as those governing maturity, photoperiod sensitivity, plant type, and seed quality. New opportunities for improving cowpea exist by leveraging the emerging genomic tools and knowledge gained through research on other major legume crops and model species.
The size of the cowpea nuclear genome has been estimated at 620 megabases (Mb), making it one of the smaller genomes present in leguminous plants as well as among vascular plants . It is well documented that vascular plant genomes contain significant amounts of heavily methylated (hypermethylated) repetitive DNA surrounding by less methylated gene-rich (hypomethylated) regions [4, 5, 6, 7]. Cytosine methylation content is positively correlated with genome size and complexity with 5-methyl-cytosine (5 mC) content ranging from 5% to 25% of total cytosine, depending on the species. Whereas the bulk of DNA methylation in mammals is confined to the short symmetrical sequence 5'-CG-3', DNA methylation in plant genomes is found in three nucleotide-sequence contexts: CG and two categories of non-CG sites: symmetrical CNG and asymmetric CHH sites (where N is any nucleotide and H is A, C or T) [8, 9, 10]. At the chromosomal level, an analysis of cytosine methylation levels in Arabidopsis chromosomes showed that there is a gradual increase of methylation along the genomic region analyzed: CpG methylation in the euchromatic fraction, CpG and CpNpG methylation at the euchromatin/heterochromatin transition and an additional asymmetrical methylation in the repeated-heterochromatic fraction. The density of DNA 5-methylcytosine methylation increased from the euchromatin towards the heterochromatin. The most methylated repeated family at CpG, CpNpG and asymmetrical sites is the 5S ribosomal DNA, highly methylated even though it is transcribed .
Methylation filtering (MF) allows for the selective cloning of hypomethylated regions of the plant nuclear genome . MF has been successfully applied to the shotgun sequencing of the genomes of several plant species, allowing an examination of the content of the gene-rich regions referred to as the genespace [12, 13, 14, 15, 16]. A pilot study was carried out to determine whether GeneThresher® methylation filtering technology  could be positively applied to analyzing the genespace of cowpea. Both methylation filtered (MF, GeneThresher® technology) and unfiltered (UF) libraries were constructed, clones were picked at random from both libraries and the insert sequences determined and analyzed to estimate filtering power. The gene enrichment achieved by GeneThresher was determined by comparing the rate of gene discovery between MF and UF sequences. Detection of genes was accomplished by an NCBI-BLASTX search (parameters: -e 0.01; -b 5; -v 5) of the curated Arabidopsis protein database [13, 17]. The results of this pilot study showed that the GeneThresher® technology produced a 4.1-fold enrichment of gene-rich clones from cowpea genomic DNA libraries and estimated the size of the hypomethylated, gene-rich space of cowpea to be approximately 151 Mb. Using empirically derived results from the Orion Sorghum GeneThresher project and a simulation conducted on finished Arabidopsis sequence , we estimated that in order to sequence tag some portion of ~95% of the genes in the cowpea genome, we would need to generate ~252,000 GeneThresher sequences assuming an average read length of 600 bp and a 151 Mb genespace. This 1 × of raw sequence would encompass ~67% of the predicted genespace (Timko, M.P., manuscript in preparation).
Construction and content
Data processing and analysis
The primary sequence dataset for the CGKB consists of a total of 298,848 GSS isolated by methylation filtering of the cowpea genomic DNA. The FASTA formatted cowpea sequence files generated using Phred basecalling were vector trimmed. Contaminant sequences, defined as sequences that at the time of initial annotation are believed to be derived from vector, microbial, fungal (yeast), viral or animal genomes, were removed. Chloroplast, mitochondrial and transposon/retrotransposon DNA sequences were also removed. Choloroplast, mitochondrial and transponson/retrotransposon sequences were determined by BLAST significant similarity scores equal to or less than 1 × 10-10 when compared to relevant public databases. A total of 9985 chloroplast sequences, 856 mitochondrial sequences and 2608 transponson/retrotransposon-like sequences were identified in the dataset and these sequences are downloadable from the CGKB website. The remaining 263,425 MF nuclear sequences with average length of 610 bp were subjected to the annotation and data analysis processing pipeline. We used the Perl programming language  to implement data processing and analysis pipelines incorporated with various analysis algorithms. Both BLAST and Hidden Markov Model (HMM)-based algorithms are CPU intensive for genome scale data analysis. These CPU intensive data analysis pipelines were run on a computer cluster of over 30 dual-processor Apple XServes. A job management system called Vela was also created as a robust way to submit and manage large numbers of running data analysis jobs to the Portable Batch System (PBS)  in our distributed Apple OSX-based computer cluster.
Comparative plant genome analysis
The complete sequence is currently available for two vascular plant genomes, Arabidopsis thaliana  and Oryza sativa . An international effort is also underway to sequence Medicago truncatula  as the nodal species for comparative and functional legume genomics. In addition, the draft genome of the woody angiosperm Populus trichocarpa (Torr. & Gray) (black cottonwood) has been completed and is available for comparison . We performed comparative genome analysis using these four plant genomes. Each cowpea GSS was searched with BLASTX against proteomes of Medicago truncatula, Arabidopsis thaliana, Oryza sativa and Populus trichocarpa for comparative analysis and knowledge integration. BLASTX results with the Arabidopsis proteome were used for the assignments of curated Gene Ontology terms and pathways from TAIR  for each annotated GSS. BLASTX results with UniProtKB-Swiss-Prot were used for data integration with the ENZYME database .
Searches were performed with each cowpea GSS using BLASTX with cutoff expectation (e) value of 1e-8, against UniProtKB-TrEMBL , UniprotKB-Swiss-Prot , NCBI GenBank Proteins , and UniProtKB-PIR (Protein Information Resource)  public FASTA formatted protein databases.
HMM-based gene-modeling and domain finding
The potential domains on annotated GSS were analyzed using the HMMER package  against the Pfam database [31, 32]. Possible exons and introns in each cowpea GSS were predicted using the HMM-based Genscan gene predication program [33, 34]. Although Genscan has not been widely applied to plant systems, we found that it in combination with the use of the HMMER package gives a reasonable estimate of coding potential. Additional gene predictions programs, such as FgeneSH that are better optimized for use in plant systems, are being applied and the results will be added to the database as they become available.
Tandem repeat finding
GGS data from cowpea should be an invaluable resource for the development of molecular markers and genetic maps for comparing syntenic relationships among legume and non-legume species. Simple sequence repeat (SSR) markers are one of the popular DNA markers for plant genome analysis and marker-assisted selection in crop breeding programs. Traditionally, SSR markers were generated through screening of SSR-enriched genomic libraries, a process that was very time-consuming and expensive. Recently, in-silico methods have been developed that allow rapid discovery of potential SSR markers from plant DNA sequence (EST, genomic fragments, BAC ends) datasets. Each GSS was analyzed using the Tandem Repeats Finder program  developed by Benson . The identity of the GSS containing one or more SSRs, along with information on repeat size, composition, and the primers for their amplification were parsed and loaded into relational tables for sorting, search, and joining.
We have also created an easy-to-use interface for the knowledge bases, mostly based on CGI written in Perl running on an Apache web server . Our user interface has several components. The most important ones are data download, sequence and library statistics, analysis toolkit, and the SSR, metabolic pathway, and annotation knowledge bases.
Cowpea GSS Knowledge bases
(1) The cowpea GSS annotation database
Annotation results with a total of 263,425 GSS via a homology-based approach
Annotated Cowpea GSS
Distinct Accession Numbers
Annotation Database Size
Percentage of Matched Sequences
(2) The cowpea SSR database
The interface allows users to query the database by cowpea GSS ID, consensus pattern, repeat copy number, consensus size, or richness of nucleotides in the repeats. A total of 30,877 SSRs were identified among the GSS, with 3,717 SSRs located in GGS with homology to known genes. All identified SSRs are available for downloading from the CGKB.
(3) The cowpea metabolic pathway database
The web-based sequence analysis toolkit
BlastAll is a local installation of NCBI BLAST program with the cowpea GSS FASTA formatted sequence database. This tool can be used to retrieve homologous overlapping cowpea sequences for sequence extension and contig building.
RetrieveAll is a program for quick cowpea GSS sequence retrieval via sequence ID or trace name.
Contig Builder is a local web based implementation of the Phrap program for contig building. Overlapping or homologous cowpea GSS can be uploaded into this tool to extend the sequence length or make contigs based on sequence overlaps.
Multiple Sequence alignment is a local web based implementation of the CLUSTALW multiple sequence alignment program to check the quality of the overlapping GSS region for forming contigs or extending the length of GSS sequences.
Discussion and conclusion
Genome-related public databases are an invaluable part of the scientific community. There are two major users of these resources. The first is the scientific focus group actively studying the target system or organism. Among this target audience are breeders who can use this resource for the design of molecular markers for use in marker-assisted breeding and introgression programs in cowpea and other legumes. The second group is the broader scientific community interested in relating this specialized information to other systems/organisms. The aim of the CGKB is to provide an annotated, well-organized, and rigorously analyzed dataset of MF clone sequences as a resource for cowpea researchers and pan-legume crop specialists. We have found that comparisons to the NCBI GenBank Protein and UniProtKB-TrEMBL allow for the best coding potential detection of the cowpea GSS since these protein knowledgebases represent global collection of proteins. The UniProtKB-Swiss-Prot is mainly useful for data integration of known domains and enzyme databases. Comparison of the cowpea GSS to the Arabidopsis thaliana proteome provides for comparative genome analysis and integration with plant related GO terms and metabolic pathways from TAIR.
The structure and organization of the CGKB allows for rapid modification of data storage and retrieval and addition/removal of functionalities. Among the future plans for the database are (i) contig building and singlet estimation on the cowpea genomic genespace sequences; (ii) incorporation of PCR primer sequence information for SSR amplification into the SSR database and other data analysis tools associated with marker development for cowpea; (iii) integration with cowpea genetic mapping activities for identification of potential trait-linked markers; (iv) BLASTX analysis with the Medicago proteome will be used to anchor cowpea GSS to the physical contig and genetic map of Medicago truncatula and inclusion of additional comparative genomic analysis and syntenic relationships to other legume and non-legume species; and (v) full data integration at database physical level with Arabidopsis and rice knowledge bases.
Availability and requirements
The CGKB is publicly available at the URL http://cowpeagenomics.med.virginia.edu/ The CGKB is published under the GNU General Public License (GPL) which implements the understandings of the Kirkhouse Trust Intellectual Property Statement.
We have chosen the GPL as the best way to ensure free and unrestricted access to the cowpea genomic data, and to subsequent discoveries resulting from use of this data. This free exchange of knowledge benefits the poor farmers of the world and promotes rapid scientific progress. Users are asked to register at the CGKB site: http://cowpeagenomics.med.virginia.edu/register.pl
We wish to thank Jianxiong Li, Bhavani Gowda, Shengcheng Han, Hongbo Zhang and Marta Bokowiec for their input during the development of the database. We also wish to thank Dr. Dawn Adelsberger for system administrative support on the Apple OSX-based computer cluster and Dr. Michael B. Black for his help in reviewing and editing of this manuscript. We also thank our colleagues at Orion Genomics, LLC, especially Arief Budiman and Joseph Bedell, for help in carrying out these studies and their discussion on the use of MF technology. This work was supported by a grant from the Kirkhouse Trust awarded to MPT.
- 13.Bedell JA, Budiman MA, Nunberg A, Citek RW, Robbins D, Jones J, Flick E, Rohlfing T, Fries J, Bradford K, McMenamy J, Smith M, Holeman H, Roe BA, Wiley G, Korf IF, Rabinowicz PD, Lakey N, McCombie WR, Jeddeloh JA, Martienssen RA: Sorghum genome sequencing by methylation filtration. PLoS Biol 2005, 3: e13. 10.1371/journal.pbio.0030013PubMedCentralCrossRefPubMedGoogle Scholar
- 16.Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, van Heeringen S, Karamycheva S, Bennetzen JL, SanMiguel P, Lakey N, Bedell J, Yuan Y, Budiman MA, Resnick A, Van Aken S, Utterback T, Riedmuller S, Williams M, Feldblyum T, Schubert K, Beachy R, Fraser CM, Quackenbush J: Enrichment of gene-coding sequences in maize by genome filtration. Science 2003, 302: 2118–2120. 10.1126/science.1090047CrossRefPubMedGoogle Scholar
- 17.Orion Genomics[http://www.oriongenomics.com/]
- 18.The Kirkhouse Trust[http://www.kirkhousetrust.org/]
- 19.The Perl Foundation[http://www.perl.org/]
- 20.Portable Batch System[http://www.openpbs.org/]
- 21.The Arabidopsis Information Resource[http://www.arabidopsis.org/]
- 22.The International Rice Genome Sequencing Project[http://rgp.dna.affrc.go.jp/IRGSP/]
- 23.The Medicagotruncatula Genome Project[http://www.tigr.org/tdb/e2k1/mta1/]
- 24.Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, Schein J, Sterck L, Aerts A, Bhalerao RR, Bhalerao RP, Blaudez D, Boerjan W, Brun A, Brunner A, Busov V, Campbell M, Carlson J, Chalot M, Chapman J, Chen GL, Cooper D, Coutinho PM, Couturier J, Covert S, Cronk Q, Cunningham R, Davis J, Degroeve S, Déjardin A, dePamphilis C, Detter J, Dirks B, Dubchak I, Duplessis S, Ehlting J, Ellis B, Gendler K, Goodstein D, Gribskov M, Grimwood J, Groover A, Gunter L, Hamberger B, Heinze B, Helariutta Y, Henrissat B, Holligan D, Holt R, Huang W, Islam-Faridi N, Jones S, Jones-Rhoades M, Jorgensen R, Joshi C, Kangasjärvi J, Karlsson J, Kelleher C, Kirkpatrick R, Kirst M, Kohler A, Kalluri U, Larimer F, Leebens-Mack J, Leplé J-C, Locascio P, Lou Y, Lucas S, Martin F, Montanini B, Napoli C, Nelson DR, Nelson C, Nieminen K, Nilsson O, Pereda V, Peter G, Philippe R, Pilate G, Poliakov A, Razumovskaya J, Richardson P, Rinaldi C, Ritland K, Rouzé P, Ryaboy D, Schmutz J, Schrader J, Segerman B, Shin H, Siddiqui A, Sterky F, Terry A, Tsai C-J, Uberbacher E, Unneberg P, Vahala J, Wall K, Wessler S, Yang G, Yin T, Douglas C, Marra M, Sandberg G, Van de Peer Y, Rokhsar D: The genome of black cottonwood Populus trichocarpa (Torr. & Gray). Science 2006, 313: 1596–1604. 10.1126/science.1128691CrossRefPubMedGoogle Scholar
- 25.ENZYME enzyme nomenclature database[http://ca.expasy.org/enzyme/]
- 28.FTP directory/genbank/at ftp.ncbi.nih.gov[ftp://ftp.ncbi.nih.gov/genbank/]
- 29.The Protein Information Resource[http://pir.georgetown.edu/]
- 32.The Pfam database of protein families and HMMs[http://www.sanger.ac.uk/Software/Pfam/]
- 33.Tandem repeats finder[http://tandem.bu.edu/trf/trf.html]
- 36.The Apache Software Foundation[http://www.apache.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.