Constructing patch-based ligand-binding pocket database for predicting function of proteins
Many of solved tertiary structures of unknown functions do not have global sequence and structural similarities to proteins of known function. Often functional clues of unknown proteins can be obtained by predicting small ligand molecules that bind to the proteins.
In our previous work, we have developed an alignment free local surface-based pocket comparison method, named Patch-Surfer, which predicts ligand molecules that are likely to bind to a protein of interest. Given a query pocket in a protein, Patch-Surfer searches a database of known pockets and finds similar ones to the query. Here, we have extended the database of ligand binding pockets for Patch-Surfer to cover diverse types of binding ligands.
Results and conclusion
We selected 9393 representative pockets with 2707 different ligand types from the Protein Data Bank. We tested Patch-Surfer on the extended pocket database to predict binding ligand of 75 non-homologous proteins that bind one of seven different ligands. Patch-Surfer achieved the average enrichment factor at 0.1 percent of over 20.0. The results did not depend on the sequence similarity of the query protein to proteins in the database, indicating that Patch-Surfer can identify correct pockets even in the absence of known homologous structures in the database.
KeywordsEnrichment Factor Ligand Molecule Flavin Adenine Dinucleotide Nicotinamide Adenine Dinucleotide Surface Patch
An increasing number of protein structures of uncharacterized proteins have been solved by structural genomics projects. As of June, 2011, there are 3321 structures of unknown function in the Protein Data Bank (PDB). Elucidating function of these proteins is an importation task for bioinformatics. To predict protein function from structure, we have recently developed an alignment free local pocket surface comparison method for predicting the type of ligand that is likely to bind to a query protein . The algorithm, named Patch-Surfer, represents a binding pocket as a combination of segmented surface patches, each of which is characterized by its shape, the electrostatic potential, the hydrophobicity, and the concaveness. A query pocket, represented as a group of patches, is compared with a database of pockets of known binding ligand molecules, and binding ligand prediction is made by summarizing similar pockets retrieved from the database. Representing a pocket by a set of patches was shown to be effective in tolerating difference in global pocket shape while capturing local similarity of pockets. The shape and the physicochemical property of surface patches are represented using the 3D Zernike descriptor (3DZD), a series expansion of mathematical 3D function. In this work, we constructed a large database of ligand binding pockets, which contains a diverse set of pockets. We evaluated the performance of Patch-Surfer on the database in terms of the enrichment factor of correct ligand binding pockets retrieved from the database for query pockets.
The Patch-Surfer method for binding ligand prediction
n defines the range of l and a 3DZD is a series of invariants (Eqn. 3) for each pair of n and l, where n ranges from 0 to the specified order. We use order n = 15 (72 invariants) in the local surface patch comparison. The shape and the concaveness are represented by a vector of 72 invariant values while vectors for the electrostatic potential and the hydrophobicity have 144 invariants.
Next, the query pocket is compared to known pockets stored in the database. In the database, each pocket is also represented as a set of surface patches. For example, ATP binding pockets are represented with, on average, 29.5 patches. Given the query pocket and a pocket in the database, the pocket comparison process first identifies similar patches between the two pockets using a modified bipartite matching algorithm. Two options were tested for the matching stage: the first approach matches all patches while the other approach matches only patches that are more similar than the predefined distance threshold value. The similarity of the two pockets is measured with linearly combined scoring terms between the matched patches.
Constructing database of representative ligand binding pockets
Representative pockets are selected as follows. A list of 5,438 non-redundant protein structures complexed with ligand molecules extracted from PDB was obtained from the Protein-Small-Molecule DataBase http://compbio.cs.toronto.edu/psmdb/downloads/CPLX_25_0.85_7HA.list. From this list, first, we removed all ligands that consist of less than 7 heavy atoms. Then, two ligands which bind to the same protein were grouped together if a pair of atoms, one from each ligand, are closer than 4.0 Å. We further filtered out ligands that are closer than 1.4 Å to the protein, because they bind covalently to proteins. Also, ligand molecules that are more distant than 3.5 Å to any of the protein heavy atoms were removed, as they are not physically interacting with the protein. Finally, we obtained 9,393 pockets structures which bind 2707 different types of ligand molecules.
Obtaining weighting factors for scoring function
The average and the standard deviation are used to normalize the difference in the distribution of the four properties.
We tested the performance of Patch-Surfer using a test dataset that consists of 75 protein pockets, each of which binds to one of the following seven ligands: adenosine monophosphate (AMP) (9 pockets), adenosine-5'-triphosphate (ATP) (14 pockets), flavin adenine dinucleotide (FAD) (10 pockets), flavin mononucleotide (FMN) (6 pockets), alpha- or beta-d-glucose (GLC) (5 pockets), heme (HEM) (16 pockets), and nicotinamide adenine dinucleotide (NAD) (15 pockets).
where T P is the total number of pockets that bind the ligand type P in the database of the size T DB , N x P is the number of pocket for the ligand type P ranked within the top x percent by the database search method (Patch-Surfer) and N x is the total number of retrieved pockets ranked in the top x percent of the database.
Results and discussion
Pocket retrieval results
Effect of the sequence identity to the enrichment factor
Binding ligand prediction examples
We constructed a large database of representative ligand binding pockets for Patch-Surfer. The sufficiently high EF achieved by Patch-Surfer shows that the method is able to retrieve pockets of the same binding ligand from the large database even in absence of homologous proteins in the database. We are currently building a web server for easy access to Patch-Surfer.
This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM075004, R01GM097528). DK also acknowledges grants from NSF (DMS0800568, EF0850009, IIS0915801).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 2, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S2
- 4.Canterakis N: 3D Zernike moments and Zernike affine invariants for 3D image analysis and recognition. Proc 11th Scandinavian Conference on Image Analysis. 1999, 85-93.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.