KB-Rank: efficient protein structure and functional annotation identification via text query
- 665 Downloads
The KB-Rank tool was developed to help determine the functions of proteins. A user provides text query and protein structures are retrieved together with their functional annotation categories. Structures and annotation categories are ranked according to their estimated relevance to the queried text. The algorithm for ranking first retrieves matches between the query text and the text fields associated with the structures. The structures are next ordered by their relative content of annotations that are found to be prevalent across all the structures retrieved. An interactive web interface was implemented to navigate and interpret the relevance of the structures and annotation categories retrieved by a given search. The aim of the KB-Rank tool is to provide a means to quickly identify protein structures of interest and the annotations most relevant to the queries posed by a user. Informational and navigational searches regarding disease topics are described to illustrate the tool’s utilities. The tool is available at the URL http://protein.tcmedc.org/KB-Rank.
KeywordsProtein structural chain Text query Relevance ranking Function Disease
National Center for Biotechnology Information
Molecular Modeling Database
Universal Protein Resource
Protein Data Bank
Worldwide Protein Data Bank
Research Collaboratory for Structural Bioinformatics
Protein Data Bank in Europe
Biological Magnetic Resonance Bank
Protein Data Bank Japan
Structural Biology Knowledgebase
The Cancer Cell Map
Integrated Network Objects with Hierarchies
National Cancer Institute
Pathway Interaction Database
The Binding Database
Open Data Drug and Drug, Target Database
Chemical entities of biological interest at the European Bioinformatics Institute
Chemical database of bioactive drug-like small molecules of the European Molecular Biology Laboratory
The Small Molecule Pathway Database
Single Nucleotide Polymorphisms modeled on protein structures
Online Mendelian Inheritance in Man
Structure integration with function, taxonomy and sequence
Enzyme commission to Protein Data Bank
Class architecture topology homology
Structural Classification of Protein
Protein family database
Medical Subject Headings
Knowledge Base Ranking Tool
Structure based drug design
Mitogen activated kinase 1/extracellular-signal-regulated kinase 1 kinase 1
- RAS-RAF-MEK-ERK pathway
RAS-RAF-mitogen-activated protein kinase (MAPK)/extracellular signal-regulated kinase (ERK) kinase (MEK)-ERK
Protein Data Bank identification code
The ability to search for proteins of interest via text query is a standard utility of protein biomedical resources such as NCBI Protein , MMDB , UniProt , sites created for the Protein Data Bank (PDB) by the members of the wwPDB  (RCSB Protein Data Bank , PDBe , BMRB , and PDBj ), and the Structural Biology Knowledgebase (SBKB) . These resources offer search services over a variety of annotations. For example, NCBI protein has curated information regarding protein sequences that is available for text query. UniProt hosts text searches over of a collection of annotation records of the protein sequences, which were collected based a review of the associations documented in the literature and/or were derived computationally. The SBKB provides searches for protein structures over summary text fields from the primary literature citations. The fields includes abstracts and associated terms such as medical subject headings or MeSH terms . The wwPDB websites offer a variety of searches that include those over the collections of text fields from primary literature citations and the cross-referenced annotations from other protein databases [5, 8]. As examples, searches for ligands contained within the protein structures have also been implemented [5, 6, 8]. With these and related protein resources, users have at their disposal a means to search for protein structures based on a collection of associated protein annotations and attributes. A recent review of protein databases and some of the associated searches that are available therein is provided by Chen et al. .
The presentation of the results of a text query within a protein database can be done in which users can browse all the entries that match any of the text fields or browse only entries that have matches within specified fields. For example, UniProt allows a user to retrieve matches based on all the annotations collected within the UniProt data files or restrict the search to matches within a particular annotation fields. Annotation fields in UniProt include the protein or organism name fields. Similarly, the RCSB PDB provides a list of all the protein structures found based on matches across all the available text fields or the results for searches that are restricted to matches within particular annotation categories, such as the enzyme type or a Gene Ontology term category. Given that text searches may produce a large collection of annotations and structures that may possibly be browsed, the user may ask the following. Which structures are the most relevant to my query? Of the annotations retrieved, which ones are the most relevant? These questions are analogous to those commonly made for website searches with regard to which topics and which web pages are estimated to be the relevant to a given query. User demand to expand the utilities of web search engines has lead to the development of more efficient and effective methods to retrieve the most relevant topics and web pages to a given text query .
With the goal to achieve improved efficiency and effectiveness for searches for protein structures and their associated annotation categories, a ranking tool, KB-Rank, is described. The KB-Rank tool provides a means to retrieve a list of protein structural chains and annotation categories that are relevant to the provided text query. Structural chains within each retrieved category are ranked according to their estimated relevance to the queried text. The annotation categories are also presented according to their estimated relevance. These utilities can be used to address a variety of searches that are conducted by users of protein structural databases. The tool facilitates informational searches to learn more about particular topics, e.g. the retrieval of information associated with a particular disease. An example of an informational search example is to gain a better understanding of the pathogenic mechanisms of asthma. Navigational searches are also enabled that provide a means to identify specific structural chains that can be used to address particular research questions. One such type of search is to find structures that may be used in a structure based drug design protocol, for example protein structural chains may be used in drug design strategies for the treatment of melanoma.
Materials and methods
The assembly and integration of the protein annotations from open sources is done weekly to coordinate with the release of new protein structures and to ensure that the analysis is up date for all available structures. Annotations are mapped to protein structures at the level of the protein structural chain. A full list of protein structural chains is available from the ftp site at a URL at the PDB <ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt>. The following annotations are assembled. Cellular and biochemical pathway assignments were extracted from BioCyc , CellMap , HumanCyc , INOH , and the NCI Pathway Interaction Database (PID) . Small molecule associations were from BioCyc, BindingDB , HumanCyc, DrugBank , ChEBI , ChEMBL , and SMPDB . SNPs3D  and OMIM [22, 23] provided disease associations. Molecular functions, biological processes, cellular components were from the Gene Ontology (GO) classification system , as assigned in SIFTS . Enzyme classifications were from the EC2PDB database . Structural domains assignments were provided through the CATH  and SCOP  databases. Sequence domain assignments were identified through the Pfam database . Further structural groups were based on the jFatCat alignment algorithm [30, 31]. The FEATURE resource provided predictions of functional sites . The annotations utilized in the current study have been described previously for the purpose of the prediction of protein function , and more complete description of their assembly is provided therein.
Query and presentation of protein structures and annotation categories
In the equation, the variable dtf is the number of times the term appears in the document, sumdtf is the sum of (log (dtf) + 1)’s for all terms in the same document; U is the number of unique terms in the document; N is the total number of documents; and nf is the number of documents that contain the term. Based on the keyword match within the text fields and using equation A, structural entries or PDBIDs are retrieved and ranked. The first 200 entries found by the text search are saved for further analysis.
The web interface
A utility of the KB-Rank query tool is that annotation categories and structural chains are ordered and presented according to their estimated relevance to the queried text. Relevance scores are used as described in Materials and Methods. To make the interpretation of the relevance scores more visually intuitive, colors are used to indicate where each annotation category or structural chain lies within the entire ranges of the scores. As an analogy to a traffic light, a green color indicates that a category or structural chain is most associated with the queried text while a red color indicates that is least relevant. Colors in between are used to indicate intermediate scores and corresponding relevance. The coloring method is comparable with that used within the protein modeling portal , where model quality for a predicted structure, rather than relevance to text query, is similarly assessed.
User case scenarios
Based on a review of the literature, the importance of IL-4 and IL-5 in the development of asthma can be assessed. It is known that that IL-4 contributes in a variety of ways to the development of asthma, one of which is the stimulation of Th0 lymphocytes to Th2 lymphocytes [40, 44]. Th2 lymphocytes secrete other cytokines that include additional IL-4, IL-5, IL-9, and IL-13. IL-5 thereby has a secondary role to disease development as compared to IL-4 in terms of the sequence of the disease mechanism. Further, IL-4 based therapies for asthma have shown improved clinical outcome for the treatment of asthma while IL-5 based therapies have not . The results indicate the relative importance of the two cytokines in the pathogenesis of asthma, and that matches with relevance ranking found by the KB-Rank search tool. The ranking of the structures by the tool thereby provides a starting point for further understanding of the disease mechanism with regard the important protein players and their roles.
In addition to providing informational searches that utilize the ranking of the structural chains and annotation categories, navigational searches are also possible with the KB-Rank tool. In a navigational search, the purpose is to identify a particular structural chain that can be used for further investigation and research. An example type of a navigational search is for the identification of structural chains that can be used in a structure based drug design (SBDD) protocol and virtual screening. For that application, a user searches for a potential drug target that is particularly important to the disease of interest . Selection is further made to find those protein structures that are druggable, i.e. protein structures that have binding pockets and/or that can accommodate a drug molecule .
Upon examination of the other small molecules found from DrugBank for the query of melanoma, we see that the second molecule listed is the drug Sorafenib. It inhibits another protein along the RAS-RAF-MEK-ERK pathway, B-Raf kinase35, PDB + 3C4D chain B. Inhibition with Sorafenib has not proven to be effective in clinical trials for melanoma [51, 52]. But the protein structure is demonstrated to be druggable, and further inhibitors of that target have been developed and demonstrated to have effective anti-melanoma effects in humans [53, 54]. These results indicate that applications of SBDD for the B-Raf kinase target are ongoing and yielding effective results. The identification of the structures of B-Raf kinase and MEK1 with the KB-Rank tool as structures can be used for SBDD for the treatment of melanoma demonstrates that the tool provides a point of entry for the identification known and potential protein structural targets. The high ranks of the viable targets found, based on the text searches, illustrate the utility of the tool for that purpose.
The KB-Rank tool provides a means to attach a relevance score to structural chains and/or associated categories retrieved by a given a text query. It is anticipated that as more annotations are utilized for the ranking process, e.g. through the addition of more annotations associated with the primary amino acid sequence and/or the three-dimensional structures of the protein chains, the display order will more accurately reflect the order of their relevance to the queried text. The annotation categories can be expanded within the types of annotations that have already been assembled. These types include additional three dimensional structural characteristics, small molecule interaction assignments, functional site assignments, and cellular/biochemical pathway designations. The resultant granularity for the searches and subsequent ranking is at the chain level rather than at the level of the structural entry as found in the PDB. That has the advantage of narrowing down a search to particular chain within an entry that has multiple chains. It has the ability to identify relevant protein chain that resides within a complex that may not be directly relevant to the text searched.
The relationship between function and disease are anticipated topics for searches. At the first stage of the search, a text search is implemented over the summary fields extracted from PubMed abstracts of the primary citations of the protein structures and the descriptive fields of constituent domains of the structures as extracted from the Pfam database. In the second stage, an integrated set of annotations are used for categorizing the functional roles of the protein structural chains, and to subsequently rank the retrieved chains by an expected relevance to the queried text. The annotations used for the final ranking need not contain a match with the queried text; they need only be prevalent in the structures retrieved by the text search. The prevalence of a given annotation within the structural chains retrieved is used as an indicator that it is relevant to the queried text. Structures with a relatively larger number of the prevalent will be ranked relatively higher. Also, structures that have been well characterized with a relatively larger number of any annotations will tend to be ranked higher as well. That tendency is analogous to what is found for webpage ranking where the interest level in web pages, as reflected number of its links and its link structure , is used to facilitate the ranking.
Data integration effort forms the substrate for the search tool and connections forged between the annotations further lend utility to the search tool. For example, UniProt entries are connected with chain entries from the PDB; and DrugBank entries are connected with UniProt entries within the integrated database that is utilized by the KB-Rank tool. A result is that for the melanoma search example, the user can identify a small molecule in DrugBank that is associated with melanoma and be provided with a relevant protein structural chain. The result demonstrates the utility of the data integration aspect of the tool as an important component of the tool’s functionality and utility. To complete the data integration, sequence comparisons are done to map the protein chain to annotations. The mapping of entries in BindingDB to the structural chains was done by finding the corresponding sequences with greater than 90% sequence identify through sequence comparison using the BLAST program . The SIFTs resource and the UniProt data files are also utilized to provide connections between the protein structural chains with a host of sequence and functional information [3, 25, 57].
To improve the calculation of the rank score, sequence redundancy of the protein chains was considered. Repetition of the same annotation profile due to the inclusion of chains identical in primary sequence ultimately causes such chains to be ranked unduly higher in the search results. A study by Devos and Valencia demonstrated that protein chains with as high as 95% identity can have a different annotation profiles . To remove chains redundant in primary sequence but limit the loss of annotations, chains were considered redundant if they were identical in sequence . As discussed in the “Materials and methods” section, the representatives of these redundant sequence groups were used to calculate the relevance scores. Through the removal of chains identical in primary sequence, the annotation profile matrix created for a given search became more accurately weighted and when used so generated more accurate relevance scores.
The organization of the text search of the KB-Rank tool application is user-friendly, intuitive, and interactive. As part of the web tool, computer applications access specified annotations only at the required times. The search process itself is implemented in steps that are organized in hierarchical fashion; and each step is run according to user’s request. The organization makes the tool scalable with regard to the further addition of informative annotations from a variety of data sources.
The results of performing a search with the KB-Rank tool include an ordered list of annotation categories and an ordered list of protein structural chains within each category. As each protein structural chain is displayed, links are provided that include a redirection to the corresponding annotation page with chain specific information at the SBKB. At the corresponding page in the SBKB, annotations that are specific to the chain at a resolution below the annotation category can be retrieved. As examples, such links include more specific structural domains contained within the chains, and the differential tissue specific expression patterns that can be found through resources that are linked to the SBKB. In that way, the KB-Rank tool can be used in conjunction with the SBKB to retrieve annotations at different levels of granularity.
The KB-Rank tool provides a means to improve the efficiency and accuracy of searches to identify relevant protein structural chains and functional annotations relevant to a given text query. User search scenarios were described that demonstrate the tool’s utilities for informational and navigational searches. An example informational search was an examination of the protein structures and functional annotations that have a role in the disease asthma. An example navigational search identified potential structures that may be used to further investigate potential treatments for melanoma via a structure based drug design strategy. We demonstrate, through the illustrative examples, how annotations from different data sources were integrated from biomedical resources to enable research. Features of the tool include a staged integration of biomedical text information and the subsequent use of annotations of protein structural chains. It allows the user to effectively identify protein structural chains and annotation categories given a text search regarding protein functional or disease associations.
We thank Raship Shah at Rutgers University for a template of cascading style sheets that are used in the tool’s web design. We thank John Westbrook, Margaret Gabanyi, and Helen Berman at Rutgers University who provided assistance with access to data collections of the SBKB and computer code for text search over primary citation text fields. We are grateful to Dr. Peter Karp and the rest of the BioCyc team. They provided the necessary access to the protein and pathway data for each organism. Support was provided in part by a grant from the National Institute of General Medical Sciences [grant number 5U01 GM093324-02].
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 1.Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res 37:D5–D15PubMedCrossRefGoogle Scholar
- 3.Magrane M, Consortium U (2011) UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011:bar009Google Scholar
- 6.Velankar S, Alhroub Y, Alili A, Best C, Boutselakis HC, Caboche S, Conroy MJ, Dana JM, Van Ginkel G, Golovin A, Gore SP, Gutmanas A, Haslam P, Hirshberg M, John M, Lagerstedt I, Mir S, Newman LE, Oldfield TJ, Penkett CJ, Pineda-Castillo J, Rinaldi L, Sahni G, Sawka G, Sen S, Slowley R, Sousa da Silva AW, Suarez-Uruena A, Swaminathan GJ, Symmons MF, Vranken WF, Wainwright M, Kleywegt GJ (2010) PDBe: Protein Data Bank in Europe. Nucleic Acids Res 39: D402–10Google Scholar
- 8.Kinjo AR, Yamashita R, Nakamura H (2010) PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan. Database (Oxford) 2010:baq021Google Scholar
- 9.Gabanyi MJ, Adams PD, Arnold K, Bordoli L, Carter LG, Flippen-Andersen J, Gifford L, Haas J, Kouranov A, McLaughlin WA, Micallef DI, Minor W, Shah R, Schwede T, Tao YP, Westbrook JD, Zimmerman M, Berman HM (2011) The Structural Biology Knowledgebase: a portal to protein structures, sequences, functions, and methods. J Struct Funct Genomics 12:45–54PubMedCrossRefGoogle Scholar
- 14.Fukuda K (2008) INOH pathway database: curation, annotation, integration. InterOntology08 1:47–50Google Scholar
- 19.Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(Database issue):D1100–D1107Google Scholar
- 24.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29PubMedCrossRefGoogle Scholar
- 34.Berman HM, Westbrook JD, Gabanyi MJ, Tao W, Shah R, Kouranov A, Schwede T, Arnold K, Kiefer F, Bordoli L, Kopp J, Podvinec M, Adams PD, Carter LG, Minor W, Nair R, Baer JL (2008) The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res 37:D365–D368Google Scholar
- 35.Pachev AS (2007) Understanding MySQL internals. Sebastopol, CA, O’Reilly, BeijingGoogle Scholar
- 44.Levine SJ, Wenzel SE (2010) Narrative review: the role of Th2 immune pathway modulation in the treatment of severe asthma and its phenotypes. Ann Intern Med 152:232–237Google Scholar
- 46.Pitt WR, Higueruelo AP, Groom CR (2009) Structural bioinformatics in drug discovery. In: Gu J, Bourne PE (eds) Structural bioinformatics, 2nd edn. Wiley-Blackwell, HobokenGoogle Scholar
- 53.Ji Z, Flaherty KT, Tsao H (2011) Targeting the RAS pathway in melanoma. Trends Mol Med 18:27–35Google Scholar
- 54.Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. Stanford Digital Libraries Working PaperGoogle Scholar