# Extract interaction detection methods from the biological literature

- 3.5k Downloads
- 6 Citations

## Abstract

### Background

Considerable efforts have been made to extract protein-protein interactions from the biological literature, but little work has been done on the extraction of interaction detection methods. It is crucial to annotate the detection methods in the literature, since different detection methods shed different degrees of reliability on the reported interactions. However, the diversity of method mentions in the literature makes the automatic extraction quite challenging.

### Results

In this article, we develop a generative topic model, the Correlated Method-Word model (*CMW* model) to extract the detection methods from the literature. In the *CMW* model, we formulate the correlation between the different methods and related words in a probabilistic framework in order to infer the potential methods from the given document. By applying the model on a corpus of 5319 full text documents annotated by the *MINT* and *IntAct* databases, we observe promising results, which outperform the best result reported in the *BioCreative II* challenge evaluation.

### Conclusion

From the promising experiment results, we can see that the *CMW* model overcomes the issues caused by the diversity in the method mentions and properly captures the in-depth correlations between the detection methods and related words. The performance outperforming the baseline methods confirms that the dependence assumptions of the model are reasonable and the model is competent for the practical processing.

### Keywords

Latent Dirichlet Allocation Related Word Topic Factor Latent Topic Variational Inference## Background

### Interaction detection method extraction

The study of protein interactions is one of the most pressing biological problems. In the literature mining community, considerable efforts have been made to automatically extract the protein-protein interactions (*PPI*) from the literature [1, 2, 3] and some practical systems have been put into use [4, 5].

Nevertheless, little work has been done to automatically extract the interaction detection methods from the literature. The detection methods available to identify protein interactions vary in their level of resolution and the confidence of reliability. Therefore, it is important to identify such detection methods in order to validate the reported interactions. Some interaction databases, such as *MINT* [6] and *IntAct* [7], require the interaction entries to be experimentally confirmed. However, manually annotating the detection methods in the literature is time-consuming: on average, the curation of a manuscript takes up 2–3 hours of an expert curator [8]. Therefore, there is great practical demand of automatically extracting the detection methods from the literature.

The first critical assessment of detection method extraction was carried out by the *BioCreative II* challenge evaluation [9]. But only two groups (out of sixteen) submitted their results.

The diversity of method mentions in the literature is the major obstacle precluding the automatic extraction. In the real situation, different authors prefer different words and phrases to describe the same methods. For example, the detection method "*two hybrid*" (MI:0018) has 7 related synonyms, e.g. "*2-hybrid*", "*2 H* ", "*2 h*", "*classical two hybrid*", "*Gal4 transcription regeneration*", "*two-hybrid*", "*yeast two hybrid*", and one exact synonym, e.g. "*2 hybrid*", in the MI ontology [10] definition (it includes the terms describing the interaction detection methods). Although the ontology has already included so many different descriptions, biologists would just mention "*yeast 2-h*", which is not included in the ontology, in their manuscripts.

*BioCreative II*challenge evaluation. The matching performance is demonstrated in Table 1.

String matching performance.

| | | |
---|---|---|---|

740 Full Texts | 0.090 | 0.107 | 0.098 |

As Table 1 illustrates, the poor recall performance confirms the serious diversity, and the inferior precision stems from the simple matching algorithm, which does not take the context into consideration, since most of the matched names are not the exact methods applied in the document but the background knowledge. In this sense, the rigid dictionary-based matching strategy fails to address the practical problem.

Another straightforward solution is to treat the extraction issue as a classification problem – for each detection method in the ontology definition, a set of binary classifiers are built to make yes/no decisions [11, 12]. But the traditional discriminative classifiers make little attempt to uncover the probabilistic structure and the correlation within both input and output spaces. In the biological domain, ignoring the correlation within both methods and words would hinder the performance since there are intrinsic relations.

In another point of view, from the perspective of involvement of domain experts, some approaches achieved acceptable results on the small data set. In Rinaldi's work [13], they invited the biologists to summarize the keywords and patterns for the extraction task and manually refined the patterns according to the performance. Obviously, this manner is not suitable for the large-scale data processing and its flexibility is not desirable.

### Generative topic model

Nowadays, in the machine learning community, the generative topic model is receiving more and more attentions. Latent Dirichlet Allocation (*LDA*) [14] is one of the most typical models. *LDA* reduces the complex process of producing a document into a small number of simple probabilistic steps and thus specifies a probability distribution over all possible documents. Using standard statistical techniques, one can invert the process and infer the set of latent topics responsible for generating a given set of documents [15].

*LDA*-like topic models are rapidly developed into quite different domains. Xing Wei [16] introduced the *LDA* model into information retrieval system and improved the retrieval performance; David Mimno [17] proposed the Author-Persona-Topic model to formulate the expertise of authors based on their publications; Fei-Fei Li [18] advanced a hierarchical generative model to classify natural scene in an unsupervised manner.

The advantages of the generative topic models are: 1) it would be easy to postulate complex latent structures responsible for a set of observations; 2) the correlation between different factors could be easily exploited by introducing the latent topic variables.

In this article, in order to extract the detection methods from the biological literature, we propose to formulate the correlation between the detection methods and related word occurrences in a probabilistic framework. In particular, we assume the applied methods are governed by a set of latent topics and the corresponding word descriptions are also influenced by the same topic factors, which characterize the correlation between the methods and related words. Under this setting, we appeal to the generative topic model to capture such latent correlations and infer the potential methods from the observed words by the statistic inference technique.

The intuitive notion behind the proposed model is that: different documents contain informative commonality in the descriptions of the same methods, therefore we propose to discover the common usage patterns for the desired methods from the latent correlations between the methods and related words. This manner is somehow analogous to the idea that to extract templates from the overlapping of different method descriptions. But the diversity in the method mentions brings the traditional template generation algorithms with low support and low confidence problems. Furthermore, when there are multiple methods in one document, the traditional approach would fail to figure out the latent correlations. In contrast, the generative model deals naturally with the missing data and provides a more feasible and theoretical framework.

The paper is organized as follows: in the Methods section, we present detailed descriptions about the proposed model and discuss the inference and parameter estimation procedures for the model; in the Results section, we perform extensive experiments to validate the proposed model; and in the Conclusions section, we would conclude the work and demonstrate our contributions in this paper.

## Methods

### Correlated Method-Word model

*CMW*model) to extract the detection methods from the biological literature. The

*CMW*model is depicted in Figure 1 with graphical representation. In the standard graphical model formalism [19]: nodes represent the random variables and edges indicate the possible dependence. The joint probability can be obtained from the graph by taking the product of the conditional distribution of nodes given their parents, see Eq(1).

The model can be viewed in the terms of generative process that, the author should first select a set of topics for his/her manuscripts (e.g. physical protein-protein interactions); under different kind of topics, there are different choices of detection methods to confirm the findings (e.g. *pull down* to confirm protein interactions); the selected methods are represented by the particular word occurrences (e.g. descriptions of the experiment conditions, properties and materials), which are also governed by the selected topics. Therefore, the correlations between the detection methods and related words are characterized by the latent topic factors; and from the observed words, we are able to infer the potential methods in the given document according to such correlations.

Formally, we define a corpus consists of *D* documents, *E* methods and *V* words, and a given document consists of *N* methods and *M* words. To simplify the model, we have assumed the topic size *k* is known and fixed on the whole corpus. In the given document *d*, we denote *θ* as the document-specific topic distribution; **z** = {*z*_{1}, *z*_{2}, *z*_{3},..., *z*_{ N }} as the particular discrete topic assignments for each method; **y** = {*y*_{1}, *y*_{2}, *y*_{3},..., *y*_{ M }} as the indexing variables to indicate which topic factor generates the corresponding word and *ϵ* as the method distribution under the topics. These are the latent variables. *e* = {*e*_{1}, *e*_{2}, *e*_{3},..., *e*_{ N }} and *w* = {*w*_{1}, *w*_{2}, *w*_{3},..., *w*_{ M }} are the observed methods and words in document *d*. Besides, *α* and *η* are the parameters of *k*-dimensional and *E*-dimensional Dirichlet distributions that postulate the topic and method prior distributions on the corpus and *β* is a *k* × *V* matrix, which represents the word distribution under topics. These are the model parameters.

*α*,

*β*,

*η*), the

*CMW*model assumes the following generative process of the methods and related words in one document:

- 1.
Sample topic proportion

*θ*from the Dirichlet distribution:*θ*~*Dir*(*α*) - 2.
For each method

*e*_{ n }, n ∈ {1, 2, 3,..., N}: - a.
Sample topic factor

*z*_{ n }from the multinomial distribution :*z*_{ n }~*Mul*(*θ*) - b.
Sample method

*e*_{ n }from the multinomial distribution conditioned on*z*_{ n }:*e*_{ n }~*p*(*e*_{ n }|*ϵ*,*z*_{ n }) - 3.
For each related word

*w*_{ m }, m ∈ {1, 2, 3,..., M}: - a.
Sample indexing variable

*y*_{ m }from the Uniform distribution conditioned on N:*y*_{ m }~*Unif*(1, 2, 3,...,*N*) - b.
Sample word

*w*_{ m }from the multinomial distribution conditioned on ${z}_{{y}_{m}}$:*w*_{ m }~*p*(*w*_{ m }|*β*, ${z}_{{y}_{m}}$)

Our basic notion about each component of this model is that, the discrete occurrences of detection methods and related words in the given document are governed by the topic-specific distributions (e.g. matrix *ϵ* and *β*) respectively. We use such conditional distribution to bridge the correlation between the methods and word occurrences: under different topics, there are different choices of detection methods and the corresponding word descriptions. To formulate this notion in a probabilistic framework, we follow the general settings in the *LDA* model that we assume the document-specific topic proportion *θ* is drawn from the *k*-dimensional Dirichlet distribution *Dir*(*α*), which determines the topic mixture proportion. Especially, we treat the parameter of method's multinomial distribution *ϵ* as a *k* × *E* matrix (one row represents for each mixture component), and, to avoid over-fitting caused by the unbalanced and sparse method occurrences, we assume that each row of *ϵ* is independently drawn from the *E*-dimensional Dirichlet distribution: *ϵ*_{ i }~*Dir*(*η*), which smooths the method distribution under each topic. Each row of matrix *β* represents the particular word distribution under the topics. Besides, since the correlation between the methods and word occurrences is underlying (a document usually associates with multiple detection methods), we use the indexing variable **y** to indicate such latent structure between them.

*CMW*model is illustrated in Figure 2.

The traditional approach (the left panel of Figure 2) simply assumes the relation between the detection methods and related words is determined by the direct mapping. On the contrary, the *CMW* model (the right panel of Figure 2) formulates the relationship within a more throughout consideration: via the latent topic factors, word occurrences are formulated as a finite mixture under particular methods, so that they are not restricted to any methods and multiple words could contribute to the same method. This framework is more suitable and robust to deal with the diversity in the method descriptions. Furthermore, the discriminative classification algorithms assume the methods are independent in prior and the words are also independent when observing the given methods. Thus they would neglect the latent patterns within both methods and words. But in the *CMW* model, different topics govern dissimilar methods and words occurrences, embedding the correlation not only between different methods but also within the related words (see the Correlation between methods and words section and the Methods correlation analysis section for the detailed experiment results).

Efficient dimensional decomposition is explicitly implemented: *V*-dimensional word space and *E*-dimensional method space are mapped into the *k*-dimensional topic space, in which it will be easier for us to reveal the latent correlations between the detection methods and the variant word occurrences.

### Inference and parameter estimation

#### Variational inference

*CMW*model, we need to compute the posterior distribution of the methods in a given document, that is:

Unfortunately, this posterior distribution is intractable: the couples between the continuous variable *θ* and discrete variable *β*, *ϵ* induce a combinatorial number of terms, making it impossible to efficiently get the exact inference result.

Although the exact inference is intractable, there are a wide variety of approximate inference algorithms can serve for the propose, including: expectation propagation [20], variational inference [21] and Markov chain Monte Carlo (*MCMC*) [22] etc. For computational efficiency, we develop a variational inference procedure to approximate the lower bound of the desired posterior distribution of methods in a given document.

where the Dirichlet parameters *γ*, *σ* and the Multinomial parameters *ϕ*, *λ* are free variational parameters.

The meaning of the above variational distribution is that: we discard the dependence among the latent variables by assuming they are independently drawn from the respective distributions. In that case, the aim of the variational inference is to find the optimal variational parameters which could minimize the *Kullback-Leibler* (*KL*) divergence between the variational distribution and the true posterior distribution.

- 1.Dirichlet parameter
*γ*:${\gamma}_{i}={\alpha}_{i}+{\displaystyle \sum _{n=1}^{N}{\varphi}_{ni}}$(3) - 2.Multinomial parameter
*ϕ*:$\mathrm{log}{\varphi}_{ni}\propto {\displaystyle \sum _{m=1}^{M}{\lambda}_{mn}{w}_{m}^{s}{\beta}_{is}}+[\psi ({\gamma}_{i})-\psi ({\displaystyle \sum _{n=1}^{k}{\gamma}_{t}})]+{e}_{n}^{j}[\psi ({\sigma}_{ij})-\psi ({\displaystyle \sum _{t=1}^{E}{\sigma}_{it}})]$(4)

*m*th word in the document is the

*s*th one in the vocabulary, and ${e}_{n}^{j}$ means the

*n*th method in the document is the

*j*th method in the list.

- 3.Multinomial parameter
*λ*:$\mathrm{log}{\lambda}_{mn}\propto {\displaystyle \sum _{i=1}^{k}{\varphi}_{ni}{w}_{m}^{s}{\beta}_{is}}$(5) - 4.Dirichlet parameter
*σ*:${\sigma}_{ij}={n}_{j}+{\displaystyle \sum _{d=1}^{D}{\displaystyle \sum _{n=1}^{{N}_{d}}{\varphi}_{dni}{e}_{dn}^{j}}}$(6)

These update equations are invoked repeatedly until the relative change in *KL* is small (< 0.0001%).

*p*(

**e**|

**w**

*α*,

*β*,

*η*) to infer the potential methods in a given document as follows:

#### Parameter estimation

*CMW*model. This time, we are looking for the optimal model parameters to tighten the lower bound of likelihood and obtain the following update equations:

- 1.Update the Dirichlet parameter
*α*by the Newton-Raphson algorithm:$\frac{\partial L(\alpha )}{\partial {\alpha}_{i}}={\displaystyle \sum _{d=1}^{D}\left\{\psi ({\displaystyle \sum _{t=1}^{k}{\alpha}_{t}})-\psi ({\alpha}_{i})+\psi ({\gamma}_{di})-\psi ({\displaystyle \sum _{t=1}^{k}{\gamma}_{dt}})\right\}}$(8)$\frac{{\partial}^{2}L(\alpha )}{\partial {\alpha}_{i}\partial {\alpha}_{j}}=D\left\{{\psi}^{\prime}({\displaystyle \sum _{t=1}^{k}{\alpha}_{t}})-\delta (i,j){\psi}^{\prime}({\alpha}_{i})\right\}$(9)

*δ*(

*i, j*) = 1 when

*j*=

*k*, otherwise 0.

- 2.Update the Dirichlet parameter
*η*by the Newton-Raphson algorithm:$\frac{\partial L(\eta )}{\partial {\eta}_{j}}={\displaystyle \sum _{i=1}^{k}\left\{\psi ({\displaystyle \sum _{t=1}^{E}{\eta}_{t}})-\psi ({\eta}_{j})+\psi ({\sigma}_{ij})-\psi ({\displaystyle \sum _{t=1}^{E}{\sigma}_{it}})\right\}}$(10)$\frac{{\partial}^{2}L(\eta )}{\partial {\eta}_{i}\partial {\eta}_{j}}=k\left\{{\psi}^{\prime}({\displaystyle \sum _{t=1}^{E}{\eta}_{t}})-\delta (i,j){\psi}^{\prime}({\eta}_{i})\right\}$(11) - 3.Update the Multinomial parameter
*β*:${\beta}_{js}\propto {\displaystyle \sum _{d=1}^{D}{\displaystyle \sum _{n=1}^{{N}_{d}}{\displaystyle \sum _{m=1}^{{M}_{d}}{\lambda}_{dmn}{w}_{dm}^{s}{\varphi}_{dnj}}}}$(12)

These update equations correspond to find the maximum likelihood estimation with the expected sufficient statistics for each document taken under the variational posterior.

*EM*procedure to find the optimal parameters as follows:

- 1.
(

*E-Step*) For each document in the training corpus, optimizing the variational parameters (*γ*,*ϕ*,*λ*,*σ*) according to equations (3) – (6); - 2.
(

*M-Step*) Maximizing the resulting lower bound on the variational likelihood on the whole corpus with respect to the model parameters (*α*,*β*,*η*) according to equations (8) – (12).

The *E-Step* and *M-Step* are repeated until the bound on the likelihood converges (relative change in likelihood is less than 0.001%). The convergency rate of the process depends on the size of parameters in the model, (e.g. number of words, methods and topics). In our experiments (3000 words, 115 methods and up to 500 topics), the algorithm terminates in less than 30 iterations in all the cases.

## Results and discussion

We collect 5319 full-text documents from *PubMed* [23] with method annotations from another two public curated interaction databases: *MINT* and *IntAct*. We perform the following pre-processions on the data set: 1) parsing the HTML file; 2) converting the words into lower cases; 3) removing a standard list of 400 stop words, punctuations, and the terms occur less than 50 times; 4) stemming the words to its root by *Porter Stemming* [24]. We utilize the *macro-precision*, *macro-recall* and *macro-Fscore* [25] to evaluate the performance in average.

### Test corpora

*pull down*" (MI:0096) occurs 2040 times while there are 57 methods (49.6% of all) occurs less than 10 times. Figure 3 demonstrates the unbalanced method distribution on the whole corpus.

We can discover from Figure 3: 1) the 5 dominate detection methods, i.e. *pull down* (MI:0096), *2 hybrid* (MI:0018), *coip* (MI:0019), *anti tag coip* (MI:0007) and *anti bait coip* (MI:0006), take up nearly 59.3% occurrences in the whole corpus; 2) 86.1% (99 out of 115) methods occur in less than 10% documents. In this case, smoothing the estimated parameters is essential to achieve better performance.

### Feature selection

The *CMW* model is proposed to capture the correlation between methods and the "related" words. However, no curations explicitly annotate which words or sentences are related to the curated methods. So we employ *χ*^{2} statistic [26] to select the most relevant feature words from the whole text.

*t*'s

*χ*

^{2}value associating with the method

*e*is calculated according to the following equation:

where *A* is the number of times *t* co-occurs with *e*, *B* is the number of times *t* occurs without *e*, *C* is the number of times *e* occurs without *t*, *D* is the number of times neither *e* or *t* occurs, and *N* is the total number of documents.

*χ*

^{2}statistic, we approximate the dependence between word

*t*and method

*e*, so that we can preserve the words, which are the most relevant to the method descriptions, by the following formulation:

where *p*(*e*_{ i }) is the prior probability of method *e*_{ i }.

In the following experiments, we select the top 3000 terms to build up the feature set according to Eq(13).

### Effect of topic factors

*CMW*model. The perplexity on a set of testing documents is calculated as follows:

where *D* is the set of testing documents and *N*_{ d }is the number of methods in the document *d*.

Better generalization capability is indicated by a lower perplexity over the held-out testing samples. We held out 20% of collection for the testing purpose and used the remaining 80% to train the model, in accordance with 5-fold cross-validation.

*CMW*model gets improved with more topic factors. Since with more topic factors the documents could be partitioned into finer segments, more precise correlations between the methods and words could be captured. But as the number of topics exceeds a limit, the model becomes too specific (higher perplexity). Therefore we could conclude that the topic factors could be treated as the discriminate granularity of the model, that is it operates as a tradeoff between the generality and specificity. Besides, as the number of topic factors increase, there will be more parameters to be estimated (linearly increase with the number of topics), so that more training data is needed to obtain the reliable parameters. In this sense, when the number of topic factors exceeds a limit, the quality of the estimated parameters decreases and hampers the prediction power.

Besides understanding the impact of the number of topic factors on the generalization capability, we would be more interested in their explicit effect on the extraction performance. Here, we evaluate the precision and recall performance of the model under different number of topic factors. We use the same data set partition as in Figure 4.

*CMW*model.

### Extraction performance

Since there is few work to compare with, we employ the well studied *Naïve Bayes*, *KNN* and *SVM* as the baseline methods to evaluate the capability of the proposed *CMW* model. We choose *Naïve Bayes* because it is the simplest generative model with complete independence assumptions, and *KNN* model could exploit the heterogeneity among the similar documents. These are the two basic notions in the *CMW* model. Besides, *SVM* model is the most powerful discriminative model for the classification task with decent performance [11]. All the baseline models are operating on the same feature set as the *CMW* model employs.

*Naïve Bayes*model, we estimate the posterior probability of the methods in a given document by Eq(14). We use a pre-estimated threshold to retrieval the most probable methods.

In the *KNN* model, we make the prediction by ranking the candidate methods in the union of the unlabeled sample's *k*-nearest labeled neighbors, and weight the candidate methods by the similarity between the desired unlabeled sample and its neighbors.

In the *SVM* model, we follow Boutell's strategy [12] to train a set of binary classifiers for each method and predict the unknown methods by the classifiers' output. We use *SV M*^{ light }[27] toolkit to implement a linear kernel *SVM* model with the default parameters.

We perform comparisons on different proportions of the data used for training. In this comparison, we set the size of topics in the *CMW* model to be 250 and *k* in the *KNN* model to be 37.

*CMW*model improves rapidly. The reason for this phenomenon is that in the

*CMW*model, there are

*E*+

*k*(

*V*+ 1) parameters to be estimated, when the training set is not large enough, most of the parameters cannot be fully estimated, which would directly hinder the performance of the model.

One thing we should note is that, since the data set is unbalanced, we should attend the retrieval performance on the minor methods as well. In the method-level evaluation, the baseline models only retrieve most of the major methods (e.g. the top 5 methods) but ignoring the other minor ones, while the *CMW* model exhibits superior retrieve power. We demonstrate the coverage performance of each model on the testing set to compare their retrieval capability.

*CMW*model possesses better retrieval capability than all the baseline methods when the training set is large enough. We contribute the nice coverage performance to the smoothing factor introduced to the method distribution. Because the whole corpus is sparse and unbalanced, the minor methods possess little proportion in the training set. However, the baseline models do not take the sparseness into account, so that they fail to retrieve the minor ones from the testing cases. In contrast, the

*CMW*model attends the smoothing issue and overcomes the sparseness.

*BioCreative II*challenge evaluation [13]. To compare with their approach, we operate the

*CMW*model on the same testing corpus (300 full text documents) and set the topic size to be 300 according to the result in the previous section. The

*CMW*model achieved competitive results (

*F-Score*improved 12.4%), illustrated in Table 2.

Comparison with BioCreative II best result.

| | | |
---|---|---|---|

| 0.506 | 0.522 | 0.483 |

| | | |

| | | |

Here, we briefly conclude the performance of the *CMW* model. The extraction performance outperforms the discriminative baseline methods confirms that the dependence assumptions in the proposed *CMW* model are reasonable. Besides, the traditional discriminative classifiers fail to model the correlation within either the methods or the related words, while in the biological domain such correlations convey important domain dependent information. In this sense, the major advantage of the *CMW* model is that it properly exploits such informative correlations to reinforce the extraction performance. The improvements against the manually revised templates approach validate that the *CMW* model does exploit more precise and general patterns for the desired methods from the large-scale statistics, confirming the reasonable underlying semantic structure from another perspective.

### Correlation between methods and words

*CMW*model, we utilize the method-specific distribution over words by the conditional distribution

*p*(

*w*|

*e*) to retrieval the most relevant terms under each desired method:

where *D* is the set of documents associating with the desired method *e* and *M*_{ d }is the number of words in the document *d*.

*structure"*, "

*crystal"*, "

*helix"*are gathered to

*x-ray*, and "

*yeast"*, "

*two-hybrid"*, "

*site"*are gathered to

*two hybrid*. These are the informative terms in the MI ontology definition of these methods. From this result, we could discover that the

*CMW*model properly selects suitable "indicators" for the given methods. From another perspective, since these "indicators" are organized in a probability framework and accordingly contribute to the desired methods, the

*CMW*model could better overcome the issue caused by the diversity in the method mentions. The reasonable word distribution under methods confirms that the

*CMW*model captures the in-depth correlation between the methods and related words from the literature.

Top 20 relevant terms for methods.

| |
---|---|

(MI:0114) | structure, crystal, residue, molecule, model, site, form, interface, chain, contact, bond, hydrogen, helix, pp, record, helical, window, surface, linker, segment |

(MI:0018) | yeast, two-hybrid, interact, assay, fusion, system, plasmid, clone, cdna, screen, bait, sequence, acid, amino, encode, site, pp, record, domain, plant |

(MI:0096) | gst, fusion, glutathione, pull-down, assay, interact, bead, buffer, wash, yeast, scopus, min, incubate, two-hybrid, antibody, pp, record, system, plasmid, sequence |

(MI:0007) | record, pp, cite, yeast, antibody, strain, panel, anti-flag, saccharomyces, flag, cerevisia, growth, blot, western, flag-tagg, gene, grow, medline, ha, anti-ha |

(MI:0006) | control, buffer, pp, record, isi, bait, cancer, antibody, extract, c-terminus, bead, sirna, tumor, stain, gene, yeast, sds, luciferase, embo, cdna |

(MI:0019) | antibody, pp, record, extract, yeast, domain, sequence, expression, blot, cdna, clone, activity, luciferase, growth, transfect, acid, fusion, sirna, mmedta, link |

### Methods correlation analysis

By the *CMW* model, we map different methods into the latent topic space, where we are able to analyze the relationship between the different methods. Meanwhile, there are intrinsic inherit relationships between the methods, defined in the MI ontology and organized as a concept hierarchy.

*ϵ*by column as follows:

where *ϵ* _{ i }is the *i* th column of *ϵ* matrix.

Recall that, each row of the multinomial parameter *ϵ* is the method distribution under a particular topic, so that each column of *ϵ* represents a method in the topic space. By normalizing *ϵ* by column, we can represent the different methods over the latent topic factors.

Based on this representation, we employ an accumulative clustering algorithm to perform the hierarchical clustering and utilize a visualization tool *gCluto* [28] to demonstrate the captured "*pedigree*" tree. (We only illustrate part of the clustering result because of the page limit.)

*CMW*model captures the proper correlations not only between the methods and words but also among the different methods. The traditional discriminative classifiers are not able to figure out such relationships.

### Classify irrelevant documents

Although the *CMW* model is proposed to address the extraction problem in documents with at least one detection method, in most situation, the curators don't know whether the document is *PPI* related or experimentally confirmed beforehand. So it is necessary to evaluate the model's capability to classify the irrelevant documents.

*PubMed*, none of which are annotated by

*MINT*nor

*IntAct*. These documents are taken as the irrelevant documents. Meanwhile, we randomly select another 1000 documents from the evaluation corpus as relevant documents. In Eq (17), we define the relevance score of each document by the posterior probability of the most potential method in that document as follows:

This measurement indicates the maximum probability of a document containing at least one interaction detection method.

*CMW*model possesses the capability to reject the irrelevant documents before extracting.

## Conclusion

In this paper, we propose a generative probabilistic model, the Correlated Method-Word model, to automatically extract the interaction detection methods from the biological literature. This problem is not well studied by the previous researches. By introducing the latent topic factors, the proposed model formulates the correlation between the detection methods and related words in a probabilistic framework in order to infer the potential methods from the observed words.

In our experiments, the proposed *CMW* model achieved competitive performance against the other well-studied discriminative classifiers on a corpus of 5319 full text documents. And it outperforms the best result reported in the *BioCreative II* challenge evaluation (*F-Score* improved 12.4%). From the promising results, we could see that the proposed *CMW* model overcomes the diversity in the method descriptions and appropriately solve the detection method extraction issue. Furthermore, the model captures the in-depth relationship not only between the methods and related words (see the Correlation between methods and words section), but also among the different methods (see the Methods correlation analysis section). Most of the discriminative classifiers fail to exploit such relations. The competitive performance confirms that the dependence assumptions in the model are reasonable and it is necessary to model the correlation between the different methods and words in the detection method extraction issue.

Our contributions in this paper lie in: 1) propose a generative probabilistic model with proper underlying semantics for the detection method extraction issue, and the model achieves promising performance; 2) properly model the correlation between the detection methods and related words in the biological literature, which captures the in-depth relationship not only between the methods and related words but also among the different methods.

The *CMW* model is now integrating to our *ONBIRES* system [5] to provide on-line service. And in the future work, we are planning to associate the extracted methods with the annotated interaction pairs and retrieve the evidence sentences in the documents, which would provide a more throughout annotation of the protein interactions in the biological literature.

## Notes

### Acknowledgements

This work was supported by the Chinese Natural Science Foundation under grant No. 60572084, National High Technology Research and Development Program of China (863 Program) under No. 2006AA02Z321, as well as Tsinghua Basic Research Foundation under grant No. 052220205 and No. 053220002.

This article has been published as part of *BMC Bioinformatics* Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

### References

- 1.Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human protein interactions from MEDLINE using a full-sentence parser.
*Bioinformatics*2004, 20(5):604–611.CrossRefPubMedGoogle Scholar - 2.Ono T, Hishigaki H, Tanigami A, Takagi T: Automated extraction of information on protein protein interactions from the biological literature.
*IEEE Intelligent Systems*2001, 17(2):155–161.Google Scholar - 3.Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M: Discovering patterns to extract protein protein interactions from full texts.
*Bioinformatics*2004, 20(18):3604–3612.CrossRefPubMedGoogle Scholar - 4.Blaschke C, Valencia A: The Frame-Based Module of the SUISEKI Information Extraction System.
*Bioinformatics*2002, 17(2):14–20.Google Scholar - 5.Huang M, Zhu X, Ding S, Yu H, Li M: ONBIRES: ONtology-based BIological Relation Extraction System.
*Proceedings of the Fourth Asia Pacific Bioinformatics Conference*2006, 327–336.Google Scholar - 6.Molecular INTeraction database Home[http://mint.bio.uniroma2.it/mint/Welcome.do]
- 7.
- 8.Chatr-aryamontri A, Ceol A, Licata L, Cesareni G: Annotating molecular interactions in the MINT database.
*Proceedings of the second biocreative challenge evaluation workshop*2007, 55–59.Google Scholar - 9.Krallinger M, Leitner F, Valencia A: Assessment of the Second BioCreative PPI task: Automatic Extraction of Protein-Protein Interactions.
*Proceedings of the second biocreative challenge evaluation workshop*2007, 41–54.Google Scholar - 10.S O, L MP, H H, R A: The use of common ontologies and controlled vocabularies to enable data exchange and deposition for complex proteomic experiments.
*Pacific Symposium on Biocomputing*2005, 186–196.Google Scholar - 11.Joachims T: Text categorization with support vector machines: learning with many relevant features.
*European Conference on Machine Learning*1998, 137–142.Google Scholar - 12.Matthew R, Boutell XS, Luo Jiebo, M Brown C: Learning multi-label scene classification.
*Pattern Recognition*2004, 37(9):1757–1771.CrossRefGoogle Scholar - 13.Rinaldi F, Kappeler T, Kaijurand K: OntoGene In Biocreative II.
*Proceedings of the second biocreative challenge evaluation workshop*2007, 193–198.Google Scholar - 14.Blei DM, Ng AY, Jordan MI: Latent Dirichlet Allocation.
*The Journal of Machine Learning Research*2003, 3(2–3):993–1022.Google Scholar - 15.Steyvers M, Griffiths T: Probabilistic Topic Models. In
*Handbook of Latent Semantic Analysis*Edited by: Landauer TK, McNamara DS, Dennis S, Kintsch W, Routledge. 2007, 424–440.Google Scholar - 16.Wei X, Croft W: LDA-based document models for ad-hoc retrieval.
*Proceedings of the 29th annual international ACM SIGIR*2006, 178–185.Google Scholar - 17.Mimno D, McCallum A: Expertise modeling for matching papers with reviewers.
*Proceedings of the 13th ACM SIGKDD*2007, 500–509.Google Scholar - 18.Li FF, Perona P: A Bayesian Hierarchical Model for Learning Natural Scene Categories.
*2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition*2005, 524–531.Google Scholar - 19.Buntine WL: Operations for Learning with Graphical Models.
*Journal of Artificial Intelligence Research*1994, 2: 159–225.Google Scholar - 20.Minka TP: Expectation Propagation for approximate Bayesian inference.
*Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence*2001, 362–369.Google Scholar - 21.Attias H: A variational Bayesian framework for graphical models.
*Advances in Neural Information Processing Systems*2000, 209–215.Google Scholar - 22.Andrieu C, de Freitas N, Doucet A, Jordan MI: An Introduction to MCMC for Machine Learning.
*Machine Learning*2003, 50(1–2):5–43.CrossRefGoogle Scholar - 23.
- 24.Porter M: An algorithm for suffix stripping.
*Program*1980, 14(3):130–137.CrossRefGoogle Scholar - 25.Chai KMA, Chieu HL, Ng HT: Bayesian online classifiers for text classification and filtering.
*SIGIR '02: Proceedings of the 26th annual international ACM SIGIR*2002, 97–104.CrossRefGoogle Scholar - 26.Yang Y, OPedersen J: A Comparative Study on Feature Selection in Text Categorization.
*Proceedings of the Fourteenth International Conference on Machine Learning*1997, 412–420.Google Scholar - 27.Thorsten J:
*Learning to Classify Text Using Support Vector Machines*. Heidelberg, Germany: Springer; 2002.Google Scholar - 28.Matt Rasmussen, gCluto Home[http://www-users.cs.umn.edu/mrasmus/gcluto/index.shtml]

## Copyright information

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.