Prediction of dinucleotide-specific RNA-binding sites in proteins
- 3.2k Downloads
Regulation of gene expression, protein synthesis, replication and assembly of many viruses involve RNA–protein interactions. Although some successful computational tools have been reported to recognize RNA binding sites in proteins, the problem of specificity remains poorly investigated. After the nucleotide base composition, the dinucleotide is the smallest unit of RNA sequence information and many RNA-binding proteins simply bind to regions enriched in one dinucleotide. Interaction preferences of protein subsequences and dinucleotides can be inferred from protein-RNA complex structures, enabling a training-based prediction approach.
We analyzed basic statistics of amino acid-dinucleotide contacts in protein-RNA complexes and found their pairing preferences could be identified. Using a standard approach to represent protein subsequences by their evolutionary profile, we trained neural networks to predict multiclass target vectors corresponding to 16 possible contacting dinucleotide subsequences. In the cross-validation experiments, the accuracies of the optimum network, measured as areas under the curve (AUC) of the receiver operating characteristic (ROC) graphs, were in the range of 65-80%.
Dinucleotide-specific contact predictions have also been extended to the prediction of interacting protein and RNA fragment pairs, which shows the applicability of this method to predict targets of RNA-binding proteins. A web server predicting the 16-dimensional contact probability matrix directly from a user-defined protein sequence was implemented and made available at: http://tardis.nibio.go.jp/netasa/srcpred.
KeywordsDinucleotide Accessible Surface Area Prediction Score Specific Contact Fragment Pair
Protein-RNA interactions are involved in various regulatory and constitutional cellular functions. Role for RNA in cellular defense and developmental regulation, which involves its interaction with proteins, has also been reported [1, 2]. Most of these biological functions involve an accurate identification of specific recognition sites in proteins and RNA. Yet, protein-RNA interactions are far less understood than protein–DNA and protein-protein interactions, partly because fewer structures are available and also likely because these interactions are more complex than the others. In contrast to the intuitively similar protein-DNA interactions, nucleic acid structures in protein-RNA complexes are diverse, resulting in a wider range of mechanisms for protein-RNA interactions [3, 4] and hence their prediction is more difficult.
Several computational approaches have been developed to identify RNA-binding proteins and RNA-binding sites. Statistical potentials and docking scoring functions derived from databases of complex crystal structures effectively identified RNA-binding sites in proteins [5, 6, 7]. Prediction of RNA-protein interactions from sequences or structures using machine learning methods have also been successfully attempted [8, 9, 10, 11, 12, 13, 14, 15]. However, none of these methods addresses the crucial issue of specific interactions i.e. the same protein interacts with some RNA sequences and not others. Knowledge of exact interaction partners and the mechanism of recognition are essential for an understanding and ability to control RNA-level regulatory processes. Hence, an extension of machine learning methods to address this problem is of crucial significance.
We recently addressed the issue of specific recognition of protein and nucleotide subsequences for protein-DNA interactions by looking at the dinucleotide specific contact preferences of amino acids in proteins, using sequence and evolutionary contexts, and applied this information to predict specific contacts from sequence . As a natural extension of this approach to RNA-binding proteins, and also because some RNA-binding proteins are shown to bind regions largely characterized by enrichment in one dinucleotide , we explored the possibility of predicting amino acid dinucleotide contacts. We start with a statistical analysis of protein-RNA contacts and identify patterns of contact preferences that contrast or resemble protein-DNA contacts. The prediction model takes a matrix of sequence- and evolutionary profile-derived features and simulates the RNA-interacting state of every amino acid residue in a given sequence. The resulting 16-dimensional multi-class output neural network was optimized by an early-stopping algorithm and five-fold-out cross-validation. Results indicate that specific RNA-binding contacts can be predicted with accuracy levels similar to those achieved in DNA-binding proteins. We applied dinucleotide-specific contact predictions to identify full-length RNA sequence targets of proteins, with promising success.
In summary, the proposed method can be used in two ways: (1) to improve the performance of RNA-binding sites on proteins, as prediction scores will be modulated by the dinucleotide composition of target RNA sequences and one can look for prediction scores that correspond to the dinucleotide enriched in a known target and (2) predicted binding sites in proteins can be converted to a 16-dimensional dinucleotide score, which works like a pseudo dinucleotide composition itself and candidate target RNA sequences can thereby be scored by comparing observed dinucleotide compositions in candidate target sequences and thereby use them to predict targets of RNA-binding proteins.
Data set (PRNA160)
The dataset and annotations have been adopted from our previous work on predicting RNA-binding proteins from their electric moments . The data preparation procedure is summarized here as follows.
All protein-RNA complexes in the Protein Data Bank (PDB) , which have an annotations in SCOR database were selected . We eliminated redundancy using BLASTCLUST  by obtaining clusters at a 25% sequence identity and selecting the chain with the highest number of RNA contacts from each of the 160 clusters obtained at this threshold. A final list of 160 protein chains and their structural classes, was used to generate and evaluate the models, and is provided as a supplementary Table S1 (Additional file 1).
The contact profiles for the protein sequences were computed from protein-RNA complex structures in the compiled data set. We defined a contact when the distance between any atom in the protein and any atom of the RNA dinucleotide is less than 3.5Å. The contact class of all 16 dinucleotide sequence elements is written in alphabetically order viz. AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG and UU.
Contact statistics and random docking
Where i, and j refer to 20 unique residues and dinucleotides and ASA refers to the absolute accessible surface area, computed by the NACCESS program . Coordinates for each protein chain were isolated from the complex structure before computing the ASAs of amino acid residues. RNA-chains were isolated similarly for dinucleotide ASAs. ASA(i) refers to the total accessible surface area of all residues of a giventype in a given data set and N is the total number of observed contacts of all residue-dinucleotide pairs.
Where i and j have the same meaning as in equation (1) i.e. they take values from 1 to 20 and 1 to 16 respectively. Eij is computed as in (1) and Oij is the corresponding number of contacts in all complexes (note that (1) is for one complex but (2) refers to all complexes taken together). Chi-squared values from observed and expected number of contacts are converted to p-values using standard look-up tables. Chi-squared values computed above denote statistical significance of both positive and negative deviations from expected numbers of counts. To indicate the exclusion or enrichment of a residue-dinucleotide pair in the interface, signed chi-squared values were obtained by multiplying these values by -1 if the observed number of contacts was less than the expected counts.
Two types of encoding schemes of single protein sequences are widely prevalent and have been adopted for this work. A first scheme represented an amino acid residue and its neighbours by 21-bit-sparse-encoded binary vectors i.e. by taking all the 21 components to be zero except for the one identifying a given residue, which is set as 1. The first 20 components resemble the amino acid type and the last one is labelling terminal positions, without neighbours. A second scheme represented each residue by its evolutionary profile. As previously described by us and others [24, 25], the evolutionary profile of a residue, represented as Position-Specific Scoring Matrices (PSSMs), was computed by using the PSI-BLAST program and against NCBI's NR database for each protein sequence. PSSMs were generated by three iterations of PSI-BLAST with default parameters of blastpgp. Prior to neural network training, a vector representing global amino acid composition (GAC) of each protein sequence was concatenated to the feature vector of each residue. Sliding windows of sizes ranging from 1 to 8 sequence neighbours were centred on each residue. Different combinations of feature matrices were tested and the best performing combination was retained.
Artificial neural networks
Three layered (one hidden layer) fully connected neural networks trained by back propagation were simulated using the SNNS software package . The number of input nodes was equal to the number of training features and 16 output nodes represented all possible 16 RNA dinucleotide contacts. Different network architectures were optimized for every feature matrix varying the number of nodes in the hidden layer. Optimum networks were chosen by a five-fold cross-validation scheme according to an early stopping criterion. The data was divided into five sets of proteins and for each training cycle, three subsets were combined to form the training data. Out of the two subsets left out, one was used to determine the stopping point of training. All the reported performance scores were computed on the third (left out) sets. Shuffling the training and test data sets during five cycles of cross-validation ensured that each protein had been used for evaluating performance. Numerical values between 0 and 1 returned by the neural network were transformed into a binary state of binding or non-binding by varying cut-off values during receiver operating characteristic (ROC) analysis. For the purpose of constructing a web server, a precision score (for each of the 16 contact classes) was computed for a prediction score at a given position used as a cut-off. This score corresponds to the true positive probability of that contact class.
Applicability to scanning interacting protein-RNA fragments
From the protein-RNA complexes in the PRNA160 dataset, pairs of protein and RNA sequence fragments of fixed lengths were generated by using a sliding window on both. Subsequently, all combinations of protein-RNA fragment pairs in a given complex were generated by matching each protein fragment to all RNA fragments. Fragment pairs were assigned a binary class according to the contacting information from each complex structure (0 or 1). A fragment pair was labelled as 1 if any of the atoms from the protein fragment was in contact with any of the atoms of the RNA fragment at a threshold of 3.5Å atom-atom distance; otherwise it was labeled as 0. A second neural network model was built to identify the positive class (label=1) from the negative class (label=0). The protein fragments were encoded in fragment feature vectors built by concatenation of the predicted 16 dinucleotide scores of each residue in a given fragment, while the RNA sequence fragments were encoded using dinucleotide compositions. The protein and RNA feature vectors for each fragment pair in a given complex were concatenated forming a fragment-pair feature matrix per complex that was used to train a new neural network. We used the same training algorithm described above but in this case the neural networks predict a one-dimensional binary vector encoding the “paired” or “non-paired” state of the protein-RNA fragment pairs.
Neural networks are trained to return a vector of 16 dinucleotide-binding prediction scores for a protein residue between 0 and 1, which are transformed to binary states of binding or non-binding by choosing a cut-off (for the fragment-pair prediction, only one score is obtained instead of 16, and is treated in an obviously similar way). For each dinucletide type, the occurrence of a contact is considered positive (P) and negative (N) otherwise. True (T) indicates that the states of predicted and observed contact are identical and false (F) otherwise. The notation TP, TN, FP, FN combines these labels to return the number of data points (residues) in that category. These values correspond to a cut-off at which the neural network outputs are transformed into binary predictions. The predicted binding scores are transformed into binary predictions by using different cut-offs yielding sensitivity and specificity over the entire score range. ROC graphs for "1-specificity" (false positive rate) values versus corresponding sensitivity (true positive rate) values are plotted. This graph depicts the variations of the increment in false positive rate with an increase in the true positive rate. Throughout this work, the performance of the models was measured by the total AUC for the ROC.
Results and discussion
Dinucleotide contact statistics
A simple look at the number of contacts with RNA revealed that on an average 1.7% of the residues in the RNA-binding proteins are in contact with a given dinucleotide in the complexes (data not shown). This result differs from our previous report on DNA-binding proteins, which bind to DNA with an average of 1% of the residues, reflecting a larger contact surface in protein-RNA complexes. In contrast to the protein-DNA complexes, in which residues have remarkable preferences for particular dinucleotides such as AG/CT, AC/GT and AT/AT, the contacts in protein-RNA complexes are more homogeneously distributed among all dinucleotides (detailed data not shown). However, contacts with CG (2.2%), GA (2.2%), GG (2.2%), CC (2.1%) and AG (2.0%) were slightly more abundant than dinucleotides UA (1.5%), AU (1.2%) and UU (1.0%). This result illustrates the greater complexity of RNA recognition in comparison to DNA. This is expected, because Protein-RNA interactions involve the matching of a variety of conformational arrangements rather than recognition of canonical helix configurations, as in the case of protein-DNA complexes .
Prediction performance using sequence neighbours
What leads to the predictability of specific contacts from PSSMs?
PSSMs, as used in this work enumerate the residue substitution profiles during evolution at each alignment position for a given protein sequence. The evolutionary patterns of amino acid substitutions at given positions of RBPs represent constrains imposed by the requirement of specific interactions. In our previous study on dinucleotide-specific DNA binding-site prediction , we also found that the amino acid substitution pattern of a residue and its neighbours (encoded in the PSSM) was well-defined during evolution and could be correlated to the ability of the residue to interact with particular DNA dinucleotides. Our findings for protein-RNA interactions suggest that, although more complex and varied than protein-DNA interactions, the substitution patterns of the functional residues in RNA-binding proteins are also significantly constrained during evolution.
Predictability of specific contact across RNA functional classes
In order to improve the model performance for the different RNA functional classes, we implemented functional class-wise predictors of dinucleotide-specific RNA binding sites in proteins. In this case, neural networks were trained independently with complexes from a single functional class. After trying all combinations of sequence and evolutionary features for each RNA class, the prediction performance only partially improved for proteins binding to viral RNA. Interestingly in this case, the optimum neural network was trained with evolutionary information of the interacting residue and GAC of the whole protein sequence without considering any contribution from neighbouring residues. Despite the low accuracy (AUC<55%) yielded for AU dinucleotide (Figure S1 in Additional file 3), the AUC values for the rest of the dinucleotides ranged from 63-88%. It is noteworthy that CU, GA, and UG dinucleotides exhibited improved accuracies of about 85%, 81% and 88%. The data shortage on viral protein-RNA crystal structures is likely another cause of the underperformance observed for this RNA functional class, which could be overcome when more crystal structures of viral RNA complexes become available.
Application to the prediction of binding protein-RNA fragment pairs
SRC PRED web server
A web server which takes FASTA-formatted sequences as input and predicts 16-dimensional vectors representing all possible dinucleotides for each residue position has been implemented and made available at http://tardis.nibio.go.jp/netasa/srcpred. One of the problems in working with this type of web servers is to interpret prediction scores. In this regard, we developed a strategy by using the raw prediction score at each position as a cutoff and determined the corresponding precision in the benchmarking data sets. Precision scores at this cutoff represents the probability that the position being in the positive class i.e. binding residue-dinucleotide pair. The web server automatically returns these probability scores for all residue positions and provides a graphical prediction highlighted according to the range of these scores.
The paper shows that single amino-acid sequences and evolutionary profiles can predict RNA dinucleotide-specific contacts with accuracy somewhat lower than the prediction of general RNA-binding sites. Specific contacts in different RNA functional classes can be successfully predicted from a single comprehensive model but viral RNA complexes performance was slightly poorer. The best prediction accuracies, measured by the area under the ROC graphs, were ~80% for the general model. In addition, we showed that the calculated residue-wise prediction scores, used in combination with dinucleotide compositions correctly identified about 70% of protein-RNA fragment pairs at complex interfaces. This study will provide us with a better understanding and more accurate predictions of the specific base-amino acid interactions in protein-RNA complexes.
The work was partially supported by a fusion grant for interdisciplinary research by IFREC to YK, DS, SA and MF and by a grants-in-aid (kaken-hi) award from the JSPS, Japan to SA.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 13, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S13.
- 23.Hubbard SJ, Thornton JM: NACCESS. London: Department of Biochemistry and Molecular Biology, University College London; 1993.Google Scholar
- 26.Zell A, Mache N, Hubner R, Mamier G, Vogt M, Herrmann K, Schmalzl M, Sommer T, Hatzigeorgiou A, Doring S, et al.: SNNS: Stuttgart neural-network simulator; Tech Rep 3/93 Inst. Parallel Distributed High-Performance Syst., Univ. Stuttgart, Germany. Stuttgart; 1993.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.