A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm
The process of horizontal gene transfer (HGT) is believed to be widespread in Bacteria and Archaea, but little comparative data is available addressing its occurrence in complete microbial genomes. Collection of high-quality, automated HGT prediction data based on phylogenetic evidence has previously been impractical for large numbers of genomes at once, due to prohibitive computational demands. DarkHorse, a recently described statistical method for discovering phylogenetically atypical genes on a genome-wide basis, provides a means to solve this problem through lineage probability index (LPI) ranking scores. LPI scores inversely reflect phylogenetic distance between a test amino acid sequence and its closest available database matches. Proteins with low LPI scores are good horizontal gene transfer candidates; those with high scores are not.
The DarkHorse algorithm has been applied to 955 microbial genome sequences, and the results organized into a web-searchable relational database, called the DarkHorse HGT Candidate Resource http://darkhorse.ucsd.edu. Users can select individual genomes or groups of genomes to screen by LPI score, search for protein functions by descriptive annotation or amino acid sequence similarity, or select proteins with unusual G+C composition in their underlying coding sequences. The search engine reports LPI scores for match partners as well as query sequences, providing the opportunity to explore whether potential HGT donor sequences are phylogenetically typical or atypical within their own genomes. This information can be used to predict whether or not sufficient information is available to build a well-supported phylogenetic tree using the potential donor sequence.
The DarkHorse HGT Candidate database provides a powerful, flexible set of tools for identifying phylogenetically atypical proteins, allowing researchers to explore both individual HGT events in single genomes, and large-scale HGT patterns among protein families and genome groups. Although the DarkHorse algorithm cannot, by itself, provide definitive proof of horizontal gene transfer, it is a flexible, powerful tool that can be combined with slower, more rigorous methods in situations where these other methods could not otherwise be applied.
KeywordsHorizontal Gene Transfer Microbial Genome Horizontal Gene Transfer Event Individual Genome Database Match
Horizontal gene transfer (HGT) can be defined as the process by which an organism incorporates new genetic material from sources other than its parents or direct ancestors. This process is believed to be widespread in Bacteria and Archaea [1, 2], but little comparative data is available addressing its occurrence on a genome-wide basis. The exponentially increasing availability of complete microbial genome sequences should provide a powerful tool for exploring this phenomenon, but this promise has not yet been realized, due to the difficulty of obtaining consistent, reliable, quantitative HGT prediction data for multiple genomes in an automated, high-throughput pipeline.
Determining phylogenetic incongruence of individual genes by building phylogenetic trees is generally considered the most trustworthy way to prove that HGT has occurred [3, 4], but is very time consuming and computationally intensive. Although programs to automate this process have been developed [5, 6], parameter choices and data interpretation often require expert manual attention for each individual gene, as well as each genome, in order to achieve satisfactory performance . Comparisons between genes having different rates of protein evolution, as well as organisms at varying phylogenetic distances, are particularly challenging.
Alternatively, a large number of methods exist for predicting HGT by determining whether individual genes have atypical nucleic acid compositions or "signatures", compared to other sequences from the same genome. These methods are fast and automated, but suffer from high rates of false positive and false negative predictions [8, 9, 10]. They are only able to detect a limited subset of potential HGT events, which have occurred relatively recently, between organisms with widely divergent nucleic acid compositions. Also, signature-based methods are unable to provide any information about potential donor sources for the transferred material.
Recently, a new algorithm called DarkHorse has been developed for rapid, automated identification of phylogenetically atypical proteins from whole genomes, using a combination of sequence alignment, database mining, statistical, and linguistic analysis tools . This combination provides many of the advantages of phylogenetic tree-building methods, without the computational overhead. It is particularly well suited to automated, high-throughput screening of whole genomes at widely varying evolutionary distances, as well as analysis of proteins having different degrees of sequence conservation. Although this method cannot, by itself, provide definitive proof of horizontal gene transfer, it is a flexible, powerful tool that can be combined with slower, more rigorous methods in situations where these other methods could not otherwise be applied. The DarkHorse algorithm has now been implemented in a new software program, and applied to 955 sequenced bacterial and archaeal genomes, including more than 3 million predicted proteins. A searchable database of results is accessible through an Internet website interface, where users can explore HGT patterns for individual genes, genomes, or groups of genomes.
Construction and content
Software design and implementation
The DarkHorse algorithm  has been implemented as a pipeline of unix command-line Perl scripts, integrated with an underlying MySQL relational database. The software comprising this pipeline is available for download at the following website: http://darkhorse.ucsd.edu.
Program execution requires locally available copies of the NCBI Genbank nr protein sequence database  and the NCBI Taxonomy database , as well the MySQL server program . Prior to first-time program execution, a local reference database must be constructed and populated with metadata about each Genbank nr sequence, according to the schema shown in Additional file 1. This process is accomplished by an automated script, which extracts descriptive annotation and name of the source species associated with each Genbank fasta format sequence, then inserts the information into a relational database table. This table is linked to local copies of the NCBI taxonomy database names and nodes tables, and to a colon-delimited lineage string for each species, similar to those displayed on the NCBI Taxonomy Browser website . The DarkHorse software obtains lineages by joining the taxonomy names and nodes tables and recursively traversing parent nodes for each species until reaching the root of the taxonomy tree (tax_id = 1).
Once the initial database tables have been loaded, program execution proceeds as described previously . All predicted proteins in a query genome are first subjected to a non-stringent BLAST search against the Genbank nr database to identify potential protein orthologs. Search results are filtered to remove self-matches, then a set of one or more candidate orthologs is selected for each query based on a bitscore window, uniquely sized for each protein.
Ortholog candidate window sizes are calculated by combining the highest non-self bitscore for each individual query with a genome-wide heuristic called a filter threshold value. Filter threshold values, which typically range from 2% to 20%, are empirically selected for each genome based on abundance of phylogenetically related sequences in the Genbank nr database. Sparsely represented genomes receive lower threshold filter values, resulting in narrower windows, while abundantly represented genomes receive higher threshold values, corresponding to wider windows.
In practice, the lower limit of the bitscore window for each query is defined by first multiplying the highest non-self bitscore for that query by the filter threshold value, then subtracting this product from the top bitscore. As an example, the window for a query protein with a top bitscore of 1000 from a genome with threshold value of 10% would select all matches with bitscores between 900 and 1000 as candidate orthologs. Another, shorter protein from the same genome, with a top bitscore of only 500, would use a selection window of bitscores between 450 and 500. However, if these two query sequences came from a different, more poorly represented species, with a threshold filter value of 5%, the window sizes would have been 950–1000 and 475–500 respectively. Window size adjustments using this procedure have been shown to improve DarkHorse algorithm performance by removing both false positive and false negative ortholog candidates from consideration .
A unique feature of the DarkHorse algorithm is the calculation of a lineage probability index (LPI) score to identify the most phylogenetically likely match from each set of ortholog candidates. LPI scores reflect phylogenetic distance of the database match sequence from the query organism. Matches from organisms at similar phylogenetic distances receive similar LPI scores, regardless of the database abundance of their parent species.
LPI scores are calculated as described previously . Lineages associated with ortholog candidates are first split into individual component "terms", remembering the relative position of each term. An overall frequency is calculated for each individual term relative to the entire query genome. The lineage terms associated with each ortholog candidate are then recombined to calculate a weighted composite score for the candidate. This score is based on the sum of component terms, with higher weight given to more general terms (appearing closer to the left end of the original lineage string, at a higher taxonomic level). Weighted composite scores are normalized to account for differences in number of terms per lineage. For each query protein, the ortholog candidate with the highest composite LPI score is chosen as the "best" match. Finally, all LPI scores for proteins within a genome are divided by the highest score obtained for that genome, so that final scores are all normalized to values between zero and one.
Raw output from the DarkHorse program is formatted as a tab-delimited text file, which includes not only LPI scores for each query sequence, but also information on the abundance of protein relatives in the database (candidate set size), BLAST alignment quality statistics, descriptive annotation, and phylogenetic lineage of the "best" match sequence, representing the closest database relative of potential donor organisms.
Data generation procedures
The DarkHorse program was applied to 955 bacterial and archaeal genomes, and used to populate additional relational database tables according to the schema shown in Additional file 2. Genomes analyzed included both finished sequences, downloaded from NCBI Genbank , and draft sequences downloaded from the JGI Integrated Microbial Genomes website , and the JCVI Moore Marine Microbial Sequencing Project website . For each genome, an initial, low stringency BLAST search was performed for all predicted protein sequences against Genbank nr, saving as many as 500 alignments per query protein. Each saved alignment was required to cover at least 70% of both query and subject sequences, with an e-value of 1e-5 or lower. Out of 3,175,949 predicted proteins in 955 genomes, 2,809,612 had non-self BLAST matches meeting these minimum requirements. Some of the remaining 366,337 unmatched sequences may reflect inaccurate bioinformatic prediction of coding sequence boundaries, but many represent bona fide novel proteins.
Self-exclusion keywords were selected for each genome using the NCBI taxonomy database tables to identify names and taxonomy id numbers associated with that genome at the genus, species, and strain level. Organism-specific keywords were supplemented with a standardized set of control terms, designed to exclude cloning vectors, synthetic sequences, and phylogenetically unclassified sequences.
DarkHorse searches were performed using three different sets of self-exclusion keywords for each genome, corresponding to the different phylogenetic granularity settings named "strain", "species", and "genus". Taking Burkholderia cenocepacia AU 1054 as an example, strain level granularity would permit matches to organisms like Burkholderia cenocepacia PC184 and Burkholderia cenocepacia MC0-3, but exclude matches to all database entries labeled as belonging to strain AU1054. Species level granularity would exclude matches to all strains of Burkholderia cenocepacia, but still allow matches to organisms such as Burkholderia dolosa and Burkholderia xenovorans. Genus level granularity would only allow matches to proteins from organisms that were not members of genus Burkholderia. A supplemental set of searches was performed for each genome at each phylogenetic granularity excluding sequences isolated from phage or viral genomes, to provide additional information.
Filter threshold values were chosen empirically for each genome as previously described , to compensate for phylogenetic bias in the Genbank nr database. This determination is based on the number of ortholog candidates found for the most highly conserved query protein (maximum candidate set size) in each genome/phylogenetic granularity combination. The most highly conserved query proteins represent a worst-case scenario for possible misidentification of phylogenetic relatives, as multiple database matches to unrelated organisms create statistical noise that can mask the true best match. As filter threshold size increases and wider BLAST score windows are used, sudden increases in candidate set size have been found to correlate with large increases in this statistical noise, . From a practical standpoint, finding the inflection point in plots of maximum candidate set size versus filter threshold value provides a simple way to minimize this noise, which can be easily scaled up to accommodate a pipeline of hundreds or thousands of genomes.
Web search interface
Advanced search features include selection of entire taxonomic groups by lineage, as well as searches for individual proteins by annotation keywords or sequence similarity to user-selected reference proteins by amino acid BLAST query. Users can also target proteins having unusual DNA compositions, based on percent G+C for their underlying coding sequences.
One unique and powerful feature of the DarkHorse HGT Candidate database is the opportunity to explore the phylogenetic background of potential HGT donors as well as recipients. The breadth of the database allows not only query sequences, but also their database match partners to be evaluated for sequence similarity or novelty compared to taxonomically related organisms. Although the DarkHorse HGT Candidate database includes LPI calculations only for genes belonging to publicly available, sequenced genomes, it currently includes 747,660 entries where LPI scores are available for both query and match partner. This LPI score relationship is useful in predicting whether or not sufficient data will be available to build a full-scale phylogenetic tree supporting horizontal gene transfer.
A low LPI score for the match partner (potential donor) means there may be an insufficient number of phylogenetically related sequences in the database to draw any solid conclusions as to whether or not HGT has actually occurred. Selecting only those matches with reciprocal LPI relationships (low LPI query coupled with high LPI match partner) will lower sensitivity, but can be used to increase stringency, eliminating HGT candidates that cannot be easily corroborated using phylogenetic trees.
Based on the LPI score distributions shown in Figure 6, a combination of query LPI score less than 0.6 with a match partner (potential donor) LPI score greater than 0.75 is a reasonable starting place to identify well-supported HGT candidates. However, optimal cutoff points may vary for individual genomes, depending on branch lengths of the phylogenetic trees underlying their lineage descriptions, as well as phylogenetic distance between available sequenced genomes. The DarkHorse HGT Candidate web server therefore allows users to fine-tune LPI search parameters according to their individual research needs, for both query and potential donor sequences.
HGT events of different ages can be targeted by choosing different levels of phylogenetic granularity. Strain level, the narrowest granularity choice, is most useful for studying relatively recent HGT events, because it can identify proteins unique to a particular strain but absent from related strains. Species and genus settings recognize proteins that are phylogenetically atypical over a broader taxonomic range, making them suitable for exploring more ancient events.
HGT timescales can be investigated for a particular protein of interest by comparing its LPI scores at several different phylogenetic granularities, trying to find a point where the score changes from high to low. However, predicting the age of HGT events requires some caution, because phylogenetically atypical proteins (having low LPI scores) could occur for two different reasons: either gene gain by the query organism, or gene loss from its closest known relatives. In cases where the query organism is the only sequenced example at a particular taxonomic level, whether strain, species, or genus, it may not be possible determine age of gene acquisition until more data on related organisms becomes available.
Proteins as well as genomes vary widely in their relative rates of sequence variability, making automated identification of true orthologs difficult. For this reason, detailed BLAST match statistics are provided, so that users can evaluate the likelihood of true orthology in more detail for individual HGT candidates. Statistics provided include alignment length, percent identity, e-value, and bitscore, as well as percent of the query protein covered by the alignment. Search output also includes the number of non-self database matches falling within the bitscore window for each query. A high number of database matches is characteristic of conserved proteins that are well represented in database. Unusual or rapidly evolving proteins will have fewer database matches.
DNA composition statistics, including percent G+C, are sometimes used as a simple marker of foreign DNA within a genome, although there may be a wide disparity between individual genes due to other factors. Users may wish to use this information as a complement to the phylogenetic evidence provided by DarkHorse. The website search engine provides the option to display DNA composition statistics if desired. These statistics include percent G+C for each individual protein coding sequence, as well as mean and standard deviation for percent G+C of all coding sequence regions in the parent genome. Alternatively, proteins with unusual G+C content in their coding sequence DNA can be selected as part of the search process, either by z-score statistics, or by absolute percent G+C. As an example, a z-score minimum of 1.0 would select only queries whose G+C content was either higher or lower than the mean for all coding sequences in the genome by 1.0 standard deviations.
One type of horizontal gene transfer frequently described in the literature involves groups of adjacent genes called pathogenicity islands [19, 20]. To explore whether horizontal transfer candidates within a genome are adjacent or distant from each other, or located on the same chromosome, scaffold, or plasmid, the DarkHorse HGT Candidate web search engine provides the option of displaying coding sequence location coordinates. This option also includes nucleic acid scaffold id numbers and coding sequence locus ids, to facilitate cross-referencing between protein and nucleic acid sequences.
Utility and discussion
Comparison to existing databases
Currently available HGT databases
HGT prediction method
DarkHorse HGT Candidate Resource
Phylogenetic and DNA composition
The DarkHorse HGT Candidate Resource differs from these other databases in its size, speed, algorithm used to identify HGT candidates, and approach to addressing user queries. Since the DarkHorse algorithm works equally well with both draft and finished genome sequences, both types have been incorporated into the database, increasing its size and ability to perform broad, comprehensive studies that would be impossible using other tools. Despite the large database size, extensive pre-computation and efficient relational database design allows most user queries to be completed within seconds to minutes. It is intended that the DarkHorse HGT Candidate resource will be updated annually to include newly available genome sequences.
Broad statistical comparisons of horizontal gene transfer are needed across a wide diversity of genomes to understand many biological issues that cannot be addressed by other means. One fundamental unanswered question is how levels of horizontally acquired genes vary among organisms, and why? Model organisms with particularly high or low rates of HGT are needed to identify internal, genome-specific factors, as well as external, environmental factors that control the extent to which HGT occurs. These questions are of interest not only within an individual organism or taxonomic group, but also within particular environments, to understand how HGT has contributed historically to species diversification and adaptation, and predict how it may influence events in the future.
In addition to descriptive and mechanistic questions about frequency and control of HGT, it is important to understand which protein functions and families are most often transferred between unrelated organisms. The gene functions most frequently retained after horizontal transfer are likely to provide a significant selective advantage to their recipients. Antibiotic resistance is a well known example of this type, but the scope of our knowledge in this area is still very limited. The enormous and rapidly growing reservoir of sequenced microbial genomes could provide tremendous power answer these types of questions, but has not yet been effectively utilized. A comprehensive, user searchable database like the DarkHorse HGT Candidate Resource should prove an essential tool for leveraging this invaluable asset.
Availability and requirements
The resource described here is available at http://darkhorse.ucsd.edu. It is provided to academic researchers for educational, research and non-profit purposes, with no restrictions except for the demand to quote the site and reference this publication.
Those desiring to incorporate the DarkHorse algorithm, software, associated HGT candidate database, or information downloaded from the database into commercial products, or to use any of these materials for commercial purposes, should contact Technology Transfer & Intellectual Property Services, University of California, San Diego, 9500 Gilman Drive, Mail Code 0910, La Jolla, CA 92093-0910, Ph: (858) 534-5815, E-MAIL: .email@example.com.
This work was supported by the Gordon and Betty Moore Foundation, through a grant to CAMERA (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis).
- 12.NCBI Genbank nr protein sequence database[ftp://ftp.ncbi.nih.gov/blast/db/]
- 13.NCBI taxonomy database[ftp://ftp.ncbi.nih.gov/pub/taxonomy]
- 14.MySQL database home page[http://www.mysql.com/]
- 15.NCBI Taxonomy Browser[http://www.ncbi.nlm.nih.gov/Taxonomy/]
- 16.NCBI Genbank genomes[ftp://ftp.ncbi.nih.gov/genbank/genomes/]
- 17.DOE Joint Genome Institute Integrated Microbial Genomes[http://img.jgi.doe.gov/]
- 18.J. Craig Venter Institute, Moore marine Microbial Sequencing Project[https://research.venterinstitute.org/moore/]
- 21.NCBI microbial genomes[http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi]
- 22.JCVI Comprehensive Microbial Resource[http://cmr.jcvi.org/]
- 23.Microbial Genome Database for Comparative Analysis[http://mbgd.genome.ad.jp/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.