The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem

Leontovich, Andrey M; Tokmachev, Konstantin Y; van Houwelingen, Hans C

doi:10.1186/1471-2105-9-31

The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem

Methodology article
Open access
Published: 22 January 2008

Volume 9, article number 31, (2008)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem

Download PDF

Andrey M Leontovich¹,
Konstantin Y Tokmachev¹ &
Hans C van Houwelingen²

3323 Accesses
4 Citations
Explore all metrics

Abstract

Background

This paper discusses the problem of automated annotation. It is a continuation of the previous work on the A⁴-algorithm (Adaptive algorithm of automated annotation) developed by Leontovich and others.

Results

A number of new statistics for the automated annotation of biological sequences is introduced. All these statistics are based on the likelihood ratio criterion.

Conclusion

Some of the statistics yield a prediction quality that is significantly higher (up to 1.5 times higher) in comparison with the results obtained with the A⁴-procedure.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

What an Algorithm Is

Article 11 January 2015

Background

Many biological databanks, both dealing with protein sequences (e.g., SWISS-PROT) and nucleotide sequences (e.g., GeneBank), contain not only primary structures of sequences (i.e., sequences of letters – amino acids or nucleotides), but also information about functions and properties of these sequences. This information is stored in so called description fields of the sequences. There exist different types of description fields – KW (KeyWords), DE (Descriptions), ..., FT (Feature Table), ...; elements of description fields are referred to as words. Words from KW, DE, ... fields describe a sequence as a whole, while words from FT fields correspond to certain positions (letters) of a sequence.

The automated annotation problem can be described as follows. Consider a biological sequence (referred to as a query sequence) with known primary structure (i.e. letter sequence) but unknown properties and functions (i.e., description fields). The task is to determine functions and properties of this sequence (in other words, to restore its description fields) on the basis of the primary structure. The annotation should be fully automated. This is the subject of the current paper.

There are two main approaches to the solution of this problem. In the first approach (it can be called a static one) a certain fixed protein classification (grouping proteins according to similarity in structure and/or functions), specified beforehand, is used: for a query protein the search of a relative group (super family) is performed on the basis of primary structures, and properties/functions of this group are extended to the query protein. An example of this approach is described in the paper by W. Fleischmann et al. [1], it uses the protein classification (more than 1000 families), stored in the Prosite databank.

The second approach (it can be called a dynamic or an adaptive one) does not use protein classification. Instead, a "dynamic" collection of bank sequences that are similar to a query sequence is generated, and then common properties/functions of these bank sequences are extended to the query sequence. One of the first examples of this approach was described by M.A. Andrade et al. [2]. In this paper the prediction was based on so called word reliability function – a function depending on the degree of similarity between a query sequence and corresponding bank sequences. In other examples of annotation procedures based on the dynamic approach prediction was performed in a "naive" way – all properties/functions of similar proteins were extended to a query protein, or using stochastic methods – only properties/functions that are most frequent for the collection of similar proteins were extended (see [3, 4]).

The current paper uses the dynamic approach. This paper is a sequel to paper [5] that describes the A⁴ algorithm (the Adaptive Algorithm of Automated Annotation), so results of the paper [5] are constantly used here. The A⁴ algorithm is based on a stochastic approach. More precisely, it is based on the notion of transfer probabilities. Transfer probabilities are the probabilities of word transfer (extension) from description fields of one sequence to description fields of another sequence; they depend on the measure of similarity between sequences. Transfer probabilities are evaluated on the basis of word transfer frequencies in the found collection of sequences similar to a query sequence. For each word from description fields of sequences included in the collection the prediction of the fact that this word belongs (or does not belong) to the description field of a query sequence is performed; this prediction is based on transfer probabilities.

In the current paper we introduce and analyze a number of new statistics for the prediction. All of them are based on the likelihood ratio criterion [6] (which is the most powerful criterion). As in [5], all these statistics are evaluated using transfer probabilities. Two approaches to statistics definition are introduced: a "discrete" approach and a "continuous" approach. A detailed analysis and comparison of introduced statistics are performed and the best statistics are selected.

The emphasis is on a precise description of the way these statistics can be constructed using well-known concepts from statistical decision theory.

The current A⁴ algorithm uses SPKW as a language of annotation. Of course, it is possible to use GO terms in A⁴ as well. That would facilitate comparison with other approaches. However, at the current stage of our research we test and choose "the best decision making algorithm", not "the best annotation terminology". Obviously, which annotation is used hardly matters for the problem of finding the optimal "decision making algorithm"

Results

Algorithm description

Generation of a collection of similar sequences

First, we introduce some notation. For the sake of brevity we write "a word ω belongs to a sequence π" instead of "a word ω belongs to description fields of a sequence π". If ω is a KW-type word (to be definite, further in the paper we consider only amino-acid sequences and KW-type words), we write ω ∈ KW[π].

The application of an annotation procedure to an unannotated amino-acid query sequence (i.e., the prediction of description fields) starts with generating a collection of sequences similar to this query sequence with known description fields. These similar sequences are selected from a certain databank that contains annotated amino-acid sequences, e.g., SWISS-PROT.

There exist different approaches to the generation of a collection of similar sequences (see [7]). This collections can be generated on the basis of global alignments between a query sequence and bank sequences (global alignments can be constructed, e.g., by CLUSTAL procedure) using an identity percentage or, more generally, a similarity percentage as a similarity measure. Another variant is to use local alignments (i.e., alignments of most similar fragments of compared sequences, see [7]) instead of global alignments. Local alignments can be constructed, for example, by a well-known BLAST procedure [7, 8], in which sum of weights or a corresponding e-value serves as a measure of similarity between fragments (and thus between compared sequences). Other alignment procedures are also acceptable; alignment procedure selection does not play a critical role.

Since we build on the A⁴ procedure, we briefly summarize that approach. In the A⁴ procedure a collection of similar sequences is generated on the basis of local alignments of a special type. These alignments are constructed by the DotHelix procedure ([9]), in which the "power" (sum of weights divided by the root of the length of the local alignment, see [9, 10] for details) serves as a similarity measure. Each local alignment constructed by DotHelix procedure is a chain of closely located gapless local alignments. Local alignments that are generated during the construction of a collection of similar sequences are referred to as primary local alignments.

Each sequence from a collection of similar sequences π₁,...,π_ncan have several corresponding primary local alignments, but for the sake of simplicity we assume that each similar sequence π_ihas exactly one corresponding primary local alignment, the one with the maximum similarity measure. Let μ_idenote the similarity measure (power) of a primary alignment that corresponds to π_i. The value of μ_icharacterizes the measure of similarity between fragments that constitute this alignment; at the same time μ_ican be treated as a measure of similarity between the whole query sequence π₀ and the whole similar sequence π_i. We assume that similar sequences are ordered in such a way that μ₁≥ μ₂ ≥...≥ μ_n.

The exact stochastic formulation of the problem

Let π₀ be an unannotated amino-acid query sequence, π₁,...,π_nbe a collection of sequences similar to π₀, and ω be a word that belongs to some similar sequences (i.e., ω ∈ KW[π_i]. for at least one similar sequence π_i). The task is to predict whether this word ω belongs to the query sequence π₀ or not.

Let us putξ_i= 1, if ω ∈ KW[π_i], and ξ_i= 0, if ω ∉ KW[π_i] (i = 1,...,n).

we also putξ₀ = 1, if ω ∈ KW[π₀], and ξ₀ = 0, if ω ∉ KW[π₀].

Variables ξ_ican be treated as random variates Actually, they depend on ω. For the sake of brevity we write ξ_iinstead of ξ_i(ω).

In this notation the problem can be stated as follows. Measures of similarity μ_ibetween the query sequence π₀ and similar sequences π_iand values of random variates ξ_i, i = 1,...,n, are given. The task is to determine whether the word ω belongs to the query sequence π₀ or not. In other words, two hypotheses are considered, H₁ : ξ₀ = 1 (i.e., ω ∈ KW[π₀]), and H₀ : ξ₀ = 0 (i.e., ω ∉ KW[π₀]), and the task is to construct a procedure that allows to decide which hypothesis is true on the basis of ξ = (ξ₁,...,ξ_n). As announced in the introduction, we base our procedures on the likelihood ratio. Let us recall the famous Bayes' Theorem that can be written as

\frac{P {H_{1} | ξ}}{P {H_{0} | ξ}} = \frac{P {ξ_{0} = 1 | ξ}}{P {ξ_{0} = 0 | ξ}} = \frac{P {ξ_{0} = 1, ξ} / P {ξ}}{P {ξ_{0} = 0, ξ} / P {ξ}} = \frac{P {ξ | ξ_{0} = 1}}{P {ξ | ξ_{0} = 0}} \cdot \frac{P {ξ_{0} = 1}}{P {ξ_{0} = 0}} .

Here the left part is the posterior odds, that is the ratio of a posteriori probabilities of hypotheses H₁ and H₀ (a posteriori means that values ξ₁,...,ξ_nare known). It is equal to the product of the likelihood ratio $\frac{P {ξ | ξ_{0} = 1}}{P {ξ | ξ_{0} = 0}}$ (i.e., the ratio of probabilities that the set of values ξ = ξ(ω) is realized for the word ω given the conditions ξ₀ = 1 and ξ₀ = 0 respectively) and the prior odds, that is the ratio of a priori probabilities of hypotheses H₁ and H₀.

Statistical decision theory tells us that the optimal prediction procedure should be based on the statistic $\frac{P {H_{1} | ξ}}{P {H_{0} | ξ}}$ or equivalently on the likelihood ratio. For any statistic a threshold value should be specified for the procedure: if the value of the statistic is greater than the threshold, hypothesis H₁ is accepted, otherwise hypothesis H₀ is accepted. The threshold value should be selected in such a way that the total number of incorrect predictions (i.e., the sum of the number of type 1 errors and the number of type 2 errors) would be minimal. It is clear that if the prior odds are equal to 1, then a threshold value of one should be selected for the likelihood ratio; total number of errors would (theoretically) be minimal for this threshold. Surely, the assumption that the ratio of a priori hypothesis probabilities equals 1 does not seem to be natural. Indeed, the number of considered words that do not belong to a query sequence is much greater (approximately 8 times greater) than the number of considered words that belong to a query sequence. But statistics that are obtained from the assumption that this a priori ratio equals 1, and the assumption that this a priori ratio does not equal 1, but is constant (i.e., it does not depend on a word ω) are equivalent. Essentially these are the same statistics (only the threshold value should be changed: a value $\frac{P {ξ_{0} = 0}}{P {ξ_{0} = 1}}$ should be taken instead of 1; as it was noted, this value approximately equals 8 in our data). Therefore, we assume from now on that the ratio of a priori hypothesis probabilities equals 1. Thus all considered statistics are based on the likelihood ratio

\frac{P {ξ | ξ_{0} = 1}}{P {ξ | ξ_{0} = 0}}

(1)

Assumption of independence of variables ξ_i. Transfer probabilities

By virtue of equation (1) we need to estimate conditional probabilities P{ξ|ξ₀ = ε}, where ε = 1 or 0 in order to calculate the likelihood ratio. Our derivation of these estimates uses the assumption that variables ξ_i, i = 1,...,n, are independent in the aggregate. Surely, this assumption is false. In reality variables ξ_iare dependent, and the dependence is considerably strong. Nevertheless, in our definition of the likelihood ratio statistic (and the statistic that is the logarithm of the likelihood ratio) we use the independence assumption. Since variables ξ_iare not independent, one can not assert that the obtained statistics are the most powerful, but these statistics can be still quite good. In decision theory, this approach is known as the naive Bayes procedure.

The independence of variables ξ_iimplies the equality

\begin{matrix} P {ξ | ξ_{0} = ε} = \prod_{i = 1}^{n} P {ξ_{i} | ξ_{0} = ε}, & ε = 1, 0 \end{matrix}

(2)

Each variable ξ_ihas exactly two possible values: 1 and 0. Thus, everything is reduced to the following four conditional probabilities:P{ξ_i= 1|ξ₀ = 1}, P{ξ_i= 0|ξ₀= 1}, P{ξ_i= 1|ξ₀ = 0}, P{ξ_i= 0|ξ₀ = 0}.

In addition, it is clear that

\begin{array}{l} P {ξ_{i} = 1 | ξ_{0} = 1} + P {ξ_{i} = 0 | ξ_{0} = 1} = 1, \\ P {ξ_{i} = 1 | ξ_{0} = 0} + P {ξ_{i} = 0 | ξ_{0} = 0} = 1, \end{array}

(4)

so actually everything is reduced to two conditional probabilitiesP{ξ_i= 1|ξ₀ = 1}, P{ξ_i= 1|ξ₀= 0}.

Conditional probabilities (3) are a special case of conditional probabilities of the typeP{ξ_i= ε₁|ξ_j= ε₂}, where ε₁, ε₂ = 1 or 0, i, j = 0,1,...,n

(above special case corresponds to j = 0). We call all this conditional probabilities transfer probabilities and denote them by

P {ξ_{i} = ε_{1} | ξ_{j} = ε_{2}} = p_{ε_{1} | ε_{2}} .

Transfer probabilities depend on i, j (and certainly on the word ω):

p_{ε_{1} | ε_{2}} = p_{ε_{1} | ε_{2}} (i, j; ω) .

Conditional probabilities play a central role in our procedure. According to equation (4), it suffices to explain how transfer probabilities p_1|1, p_1|0 are evaluated.

We suppose that transfer probabilities satisfy the following assumptions ("axioms") (the sense of these assumptions is obvious).

where μ_ijis the measure of similarity between sequences π_i, π_j(i, j = 0,1,...,n).

Particularly, if one of these sequences is the query sequence π₀ and μ_jis the measure of similarity between π_j, π₀, we haveP{ξ_j= 1|ξ₀ = 1} = p_1|1(μ_j), P{ξ_j= 1|ξ₀ = 0} = p_1|0(μ_j).

Assumption 2) Transfer probabilities (for an arbitrary fixed word ω) depend on similarity measure μ monotonically: the probability p_1|1(μ) increases (does not decrease) and the probability p_1|0(μ)decreases (does not increase) as μ increases.

Assumption 3) The inequalityp_1|1(μ) > p_1|0(μ)

always holds (if μ > 0).

Transfer probabilities are evaluated on the basis of the measure of similarity between similar sequences for which it is known whether ω ∈ KW[π_i] using the so called isotonic regression procedure (see [11]). (In [5] this procedure was referred to as monotonization procedure). Results of this procedure are piecewise-constant monotonous functions of the similarity measure μ that increase (do not decrease) for probabilities p_1|1(μ) and decrease (do not increase) for probabilities p_1|0(μ).

We briefly recall the isotonic regression problem. Let one have two number sets (i.e., a set of points in the plain) x_i, y_i, i = 1,...,n, for which x₁ ≤ x₂ ≤...≤ x_n. The task is to find values z₁, z₂,...,z_n, z₁ ≤ z₂ ≤...≤ z_n, that minimize the deviation $\sum_{i = 1}^{n} {(z_{i} - y_{i})}^{2}$ .

This is the monotone-increasing isotonic regression problem. The monotone-decreasing isotonic regression problem is similar; the only difference is that here z₁ ≥ z₂ ≥...≥ z_n.

The isotonic regression procedure constructs a monotonic number sequence z₁,...,z_n(while in linear regression values of z_iare linearly expressed in terms of x_i: z_i= αx_i+ β).

An algorithm for the solution of isotonic regression problem can be easily constructed. We do not describe it here. We only note that each z_iis the mean value of {y_j} over a window of variable length: $L_{i} : z_{i} = (\sum_{j = t (i)}^{t (i) + L (i) - 1} y_{i}) / L_{i}$ . For different indices these windows either coincide or do not overlap, and the i-th window contains i.

We also note that the values of x_iare not essential in the isotonic regression problem, only the order of values of y_iis essential.

To obtain the transfer probabilities using isotonic regression we proceed as follows.

For the evaluation of p_1|1 we consider pairs of similar sequences π_i, π_jthat satisfy the condition ω ∈ KW[π_i]. Let μ_ijdenote the measure of similarity between sequences π_i, π_j. We put ξ_ij= 1 if the word ω belongs to the sequence π_j, and put ξ_ij= 0 if ω does not belong to π_j(recall that the word ω belongs to π_i). Then we apply the monotone-increasing isotonic regression procedure to the collection of points (μ_ij, ξ_ij), where μ_ijare in ascending order. The resulting values are the estimates of the transfer probabilities p_1|1(μ_ij).

Similarly, applying the monotone-decreasing isotonic regression procedure to the collection of points (μ_ij, ξ_ij) that correspond to pairs of similar sequences π_i, π_jfor which ω does not belong to π_i, we obtain values of the transfer probabilities p_1|0(μ_ij). As we noted, the probabilities p_1|1, p_1|0 are supposed to satisfy condition (6). Therefore, if it turns out that functions p_1|1(μ), p_1|0(μ)obtained after the application of isotonic regression procedure do not satisfy this condition for some values of μ, then we consider that probabilities p_1|1, p_1|0 are not defined for these values of μ. Hence, it is possible that transfer probabilities are defined only for sufficiently large values of μ, but not for all μ. In particular, it is possible that for some words there are no values of μ such that inequality (6) holds. These words are referred to as degenerate words.

Statistics description

All statistics considered in this paper are based on the likelihood ratio criterion, that is on formula (1) under the assumption of independence of variables ξ_i(formula (2)).

Two approaches are used in the definitions of these statistics. One of them can be called a "discrete" approach; the other can be called a "continuous" approach.

The discrete approach is based directly on formulae (1), (2) and the definition of transfer probabilities. It follows from (5) that the following formulae for conditional probabilities hold:P{ξ_i|ξ₀ = 1} = p_1|1(μ_i), if ξ_i= 1,P{ξ_i|ξ₀ = 1} = p_0|1(μ_i) = 1 - p_1|1(μ_i), if ξ_i= 0,

and hence

P {ξ_{i} | ξ_{0} = 1} = (1 - p_{1 | 1} (μ_{i})) \cdot {(\frac{p_{1 | 1} (μ_{i})}{1 - p_{1 | 1} (μ_{i}))})}^{ξ_{i}} .

(7)

Similarly,

P {ξ_{i} | ξ_{0} = 0} = (1 - p_{1 | 0} (μ_{i})) \cdot {(\frac{p_{1 | 0} (μ_{i})}{1 - p_{1 | 0} (μ_{i})})}^{ξ_{i}} .

(8)

Relations (7), (8) together with (1), (2) imply the following formula for the logarithm of the likelihood ratio:

T^{(1)} (ξ; ω) = T^{(1)} (ξ) = \ln (\prod_{i = 1}^{n} \frac{P {ξ_{i} | ξ_{0} = 1}}{P {ξ_{i} | ξ_{0} = 0}}) = α_{0} + \sum_{i = 1}^{n} α_{i} \cdot ξ_{i},

(9)

where

\begin{matrix} α_{0} = \sum_{i = 1}^{n} α_{0 i}, & α_{0 i} = \ln \frac{1 - p_{1 | 1} (μ_{i})}{1 - p_{1 | 0} (μ_{i})} < 0 \end{matrix}

(10)

α_{i} = \ln \frac{p_{1 | 1} (μ_{i}) (1 - p_{1 | 0} (μ_{i}))}{p_{1 | 0} (μ_{i}) (1 - p_{1 | 1} (μ_{i}))} > 0.

(11)

One can see that the statistic T⁽¹⁾ can be expressed as a linear combination of the ξ_i.

As we noted, theoretically the best threshold value for the statistic T⁽¹⁾ is equal to 1n1 = 0. However, that is only theoretical. As the assumption of independence of variables ξ_iis incorrect and the statistic T⁽¹⁾ is only a rough estimate of logarithm of the likelihood ratio, the best threshold value is not necessarily equal to zero (and this threshold does not really equal zero in practice).

Let us put $η = η (ω) = \frac{T^{(1)} - α_{0}}{\sum_{i = 1}^{n} α_{i}}$ . From the definition of η and relations (9)–(11) it follows that

η = \sum_{i = 1}^{n} {\hat{a}}_{i} ξ_{i},

(12)

where

{\hat{a}}_{i} = \frac{\ln \frac{p_{1 | 1} (μ_{i}) (1 - p_{1 | 0} (μ_{i}))}{p_{1 | 0} (μ_{i}) (1 - p_{1 | 1} (μ_{i}))}}{\sum_{i = 1}^{n} \ln \frac{p_{1 | 1} (μ_{i}) (1 - p_{1 | 0} (μ_{i}))}{p_{1 | 0} (μ_{i}) (1 - p_{1 | 1} (μ_{i}))}} > 0.

(13)

We note that

\sum_{i = 1}^{n} {\hat{a}}_{i} = 1.

(14)

The variable η(ω) can be treated as a second statistic. Its values lie between 0 and 1. Statistics T⁽¹⁾ and η are linearly dependent, and hence for a fixed word ω these statistics are equivalent. However, the coefficients α₀ and α_i(i = 1,...,n) are different for different words ω, so the relation between thresholds, used in the prediction, is different for different words; consequently, the statistics T⁽¹⁾ and η are not equivalent for the whole totality of words (We will see later that if thresholds are well-chosen, then η leads to better results than T⁽¹⁾).

The other approach to the definition of statistics, used in the annotation procedure, starts from a linear combination $η = \sum_{i = 1}^{n} a_{i} ξ_{i}$ of the ξ_ias in formula (12). Here coefficients α_iare not necessarily given by (13), but should be positive and satisfy the relation (14). This approach uses the assumption that η has the normal distribution. Certainly this assumption is not correct (at least because the inequality 0 ≤ η ≤ 1 always holds). Nevertheless, we use this assumption (and, similarly to the discrete approach, the assumption of independence of variables ξ_i(ω)). (As the normal distribution is continuous, this approach can be called "continuous").

Thus, we consider a random variable

η = \sum_{i = 1}^{n} a_{i} ξ_{i}

(15)

and assume that it has a normal distribution. Denote by

\begin{matrix} M_{1} η = M (η | ξ_{0} = 1), D_{1} η = D (η | ξ_{0} = 1) = σ_{1}^{2}, \\ M_{0} η = M (η | ξ_{0} = 0), D_{0} η = D (η | ξ_{0} = 0) = σ_{0}^{2} \end{matrix}

the conditional expectations and dispersions of η given that ξ₀ = 1 or 0 respectively. We haveM₁η = ∑a_iM(ξ_i|ξ₀ = 1) = ∑a_ip_1|1(μ_i), M₀η = ∑a_ip_1|0(μ_i).

Further, the assumption of independence of ξ_i(ω) implies that

\begin{matrix} D_{1} η = σ_{1}^{2} = \sum_{i} a_{i}^{2} \cdot p_{1 | 1} (μ_{i}) \cdot (1 - p_{1 | 1} (μ_{i})), & D_{0} η = σ_{0}^{2} = \sum_{i} a_{i}^{2} \cdot p_{1 | 0} (μ_{i}) \cdot (1 - p_{1 | 0} (μ_{i})) . \end{matrix}

(We note that it is the only place where the independence of ξ_iis used; therefore, the assumption of independence of ξ_iis not as essential in this approach as it was in the definition of the statistic T⁽¹⁾). It follows from the assumption of a normal distribution that in cases ξ₀ = 1 and 0 the variables

\frac{η - M_{1} η}{σ_{1}} and \frac{η - M_{0} η}{σ_{0}} have standard normal distribution N(0,1)

(16)

As above, we use the logarithm of the likelihood ratio as a statistic and in addition assume that the ratio of a priori hypothesis probabilities is equal to 1. Let us denote this statistic by T⁽²⁾. Relation (16) implies that

\begin{matrix} T^{(2)} (λ) = \ln \frac{P {λ \leq η \leq λ + d λ | ξ_{0} = 1}}{P {λ \leq η \leq λ + d λ | ξ_{0} = 0}} \\ = \ln \frac{(1 / \sqrt{2 π} \cdot σ_{1}) \cdot \exp {- \frac{1}{2} \cdot {((λ - M_{1} η) / σ_{1})}^{2}}}{(1 / \sqrt{2 π} \cdot σ_{0}) \cdot \exp {\frac{1}{2} \cdot {((λ - M_{0} η) / σ_{0})}^{2}}} \\ = \frac{1}{2} \cdot {{((λ - M_{0} η) / σ_{0})}^{2} - {((λ - M_{1} η) / σ_{1})}^{2}} + \ln (σ_{0} / σ_{1}) . \end{matrix}

Another variant of this statistic – statistic ${\hat{T}}^{(2)}$ – was used in the paper [5]. The statistic ${\hat{T}}^{(2)}$ is the ratio of type 1 error to type 2 error :

{\hat{T}}^{(2)} (λ) = \ln \frac{P^{(1)} (λ)}{P^{(2)} (λ)} = \ln \frac{Φ ((λ - M_{1} η) / σ_{1})}{1 - Φ ((λ - M_{0} η) / σ_{0})},

(17)

where Φ(x) is the cumulative normal distribution function: $Φ (x) = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{x} e^{\frac{- z^{2}}{2}} d z$ .

The statistics η and ${\hat{T}}^{(2)}$ are equivalent for an arbitrary fixed word ω (although not equivalent for the whole totality of words). At the same time, the statistics η and T⁽²⁾ are not necessarily equivalent, as the dependence of T⁽²⁾ on η may turn out to be not monotonic. Moreover, it is never monotonic for all values of η. However, we are interested only in values 0 ≤ η ≤ 1, and usually (though not always) the dependence of T⁽²⁾ on η is monotonous for these values of η, and in this case the statistics η and T⁽²⁾ are equivalent (for a fixed word ω). Furthermore, even in the case where the dependence is not monotonic, the monotonicity is violated only for the values of η close either to 0 or to 1, and the lack of monotonicity can be disregarded.

We still have to discuss the question of the choice of coefficients a_iin (15). One of the variants was described above: formula (13) can be applied. Another variant was introduced in [5]. This variant can be described as follows.

As above, we assume that the variate η = ∑a_iξ_ihas a normal distribution. Each set of coefficients a_iand each threshold value λ have corresponding theoretical frequencies of type 1 and type 2 errors P⁽¹⁾(λ), P⁽²⁾(λ). The idea is to take the set of coefficients {a_i} that gives the minimum sum P⁽¹⁾(λ) + P⁽²⁾(λ), where λ is the best threshold value for this coefficient set. However, analytically it is very cumbersome, so the following simplification was implemented. For a fixed set of coefficients {a_i} we select the threshold value λ such that the frequency of type 1 errors is equal to the frequency of type 2 errors: P⁽¹⁾(λ) = P⁽²⁾(λ). Then we find the set of coefficients that minimizes these frequencies P⁽¹⁾(λ) = P⁽²⁾(λ). Here the search of the coefficient set can be reduced to a conditional extremum problem. The A⁴ algorithm uses an iterated procedure for the solution of this problem (see [5]).

In order to optimize the automated annotation procedure, that is to increase the prediction quality, certain modifications of the procedure were introduced. For each statistic (η, T⁽¹⁾, T⁽²⁾, ${\hat{T}}^{(2)}$ ) many variants (up to 36), including the variant that was described above, were considered – each variant corresponds to a certain combination of these modifications. Some variants really turned out to be better than the variants described above.

A simplified scheme of modifications (and thereby of statistic variants) can be described as follows.

1)
What primary local alignments are considered? In the described variant of statistics only one primary local alignment (the one with the maximum power) was considered. Meanwhile, all constructed local alignments with sufficiently high power can be considered, as it was done in [5]. In this case indices i in (9), (12), (15) correspond not to individual similar sequences, but to primary local alignments of this sequences.
2)
Are the lengths of primary local alignments taken into consideration? In the described variant lengths of primary local alignments were not taken into consideration, but these lengths can be considered as well. In this case indices i in (9), (12), (15) correspond not to similar sequences or primary local alignments, but to individual positions of these primary local alignments (as in case of FT-type words). In this case the total number of variates ξ_iis equal to the overall length of all primary local alignments. For these variants of statistics long local alignments turn out to be more significant than short local alignments with the same power.
3)
How are the coefficients a_i calculated? The coefficients a_iin (15) can be calculated using different methods – either formula (13) can be used, or an iterative method (described in [5]) can be applied. (For the statistic T⁽¹⁾ the coefficients a_iare always calculated using formula (13)).

As there are 3 modifications, the total number of basic variants (in the described simplified scheme) equals 2³ = 8 (and for the statistic T⁽¹⁾ it equals 4).

So, variants of the statistics η, T⁽¹⁾, T⁽²⁾, ${\hat{T}}^{(2)}$ are considered in this paper. Moreover, for the purpose of comparison (as in the paper [5]) the simple statistic

q = q (ω) = \frac{\sum_{i}^{n} ξ_{i}}{n}

(18)

(the frequency of occurrence of a word ω in the collection of similar sequences) is also considered.

Finally we quote a scheme of the A⁴ algorithm in Figure 1 (this scheme is essentially taken from [5]). A short description of each stage can be found in [5]. Here we only note that the most time-consuming stage is the first one – the generation of a collection of similar sequences. We also note that in the current investigation regions were not determined, and the prediction was performed for the whole query sequence.

Testing results

Testing scheme

A collection of 518 sequences, randomly selected from SWISS-PROT databank, was generated. Note that only initially annotated sequences (i.e., sequences whose description fields were not obtained by the extension from similar sequences) were selected. The prediction was performed for KW-type words. All selected sequences were divided into two groups. The first group contained 210 sequences. This group was used during the "learning stage": the procedure was applied to all sequences from the group, and values of procedure parameters (including optimal threshold values) that minimize the total number of errors for the first group were selected. The remaining 308 sequences were used for the "main testing" that was performed using parameter values obtained on the basis of "learning" results. A list of all these 308 sequences is given in Additional file 1.

The total number of KW-type words in the description fields of 308 selected sequences, used as query sequences, was equal to 1176 (the positive prediction was preferable for these words). As it was noted, a collection of similar sequences was generated for each query sequence; the number of sequences in these collections equaled 100. Then the list of all KW-type words from description fields of similar sequences was formed. The majority of these words belonged only to one similar sequence. All such words are degenerate, transfer probabilities are not defined for them. At the same time nearly all these words did not belong to the query sequence. Hence it was sensible to perform prediction only for words that belonged to at least two similar sequences. For all these words transfer probabilities were evaluated, non-degenerate words (i.e., words with defined transfer probabilities) were determined, and the prediction was performed (for both degenerate and non-degenerate words).

Table 1 shows certain characteristics of sequences used for the testing.

Table 1 Certain characteristics of sequences used for the testing

Full size table

The average number of words considered per query sequence equals 30 and the entire range was from 3 to 67.

Final results

Testing results are presented in Table 2.

Table 2 Testing results for different basic variants of the statistics.

Full size table

The table contains testing results for the basic statistics η, T⁽¹⁾, T⁽²⁾, ${\hat{T}}^{(2)}$ as well as for the statistics q and S_And, included into the table for the purpose of comparison (See the Discussion section). Each line of the table corresponds to a certain variant of studied statistics. The first column shows which statistic and which variant of this statistic corresponds to a line. Variants are given in square brackets: the first number equals 0 if lengths of primary local alignments are not considered, and 1 otherwise; the second number equals 0 if the coefficients a_iare calculated using formula (13), and 1 if the coefficients a_iare calculated using the iterative procedure (for the statistic T⁽¹⁾ the second number is always 0). The second column contains the number N¹ of type 1 errors (i.e., number of cases when the prediction for a word that belongs to description fields of a query sequence is negative). The third column contains the number N² of type 2 errors (i.e., number of cases when the prediction for a word that does not belong to description fields of a query sequence is positive). The forth column contains the total number of errors N^all= N¹ + N². The next columns contain sums P⁽¹⁾ + P⁽²⁾ and P⁽¹⁾ + P⁽⁺⁾, where P⁽¹⁾ is the proportion of type 1 errors, P⁽²⁾ is the proportion of type 2 errors, and P⁽⁺⁾ is the ratio of false positive predictions to the total number of predictions: $P^{(1)} = \frac{N^{1}}{n_{q}}$ , $P^{(2)} = \frac{N^{2}}{n_{a l l} - n_{q}}$ , $P^{(+)} = \frac{N^{2}}{n_{+}}$ (n_allis the total number of words for which the prediction is performed, i.e., total number of KW-type words in description fields of all sequences, similar to at least one query sequence, n_qis the number of words (from the list of these n_allwords) that belong to query sequences, n₊ is the total number of words for which the prediction is positive; here n_all= 9236, n_q= 1176, n_all- n_q= 8060; the value of n₊ of depends on the threshold evaluated for the given version of statistic at the "learning stage"). Lines are ordered according to the prediction quality: higher lines correspond to statistic variants with lower total number of errors. For a fixed statistic, lines that correspond to the best variant of this statistic (i.e., for the variant that leads to the lowest total number of errors) are marked. Recall that the statistic ${\hat{T}}^{(2)}$ [1,0] is exactly the statistic that was considered in [5] (note that it is the best variant for the statistic ${\hat{T}}^{(2)}$ ).

The results could also have been presented as a confusion table as laid out in Table 3, but doing so for all variants would take a lot of space.

Table 3 Lay out of the confusion table for the results of Table 2.

Full size table

The testing results showed that the first modification (consideration of all primary local alignments or only one primary local alignment with the maximum power) did not significantly affect the prediction quality, so results are only presented for one case (all local alignments are considered, similarly to [5]).

Note that the results presented in the Table correspond to the whole totality of words including degenerate words (though statistics η, T⁽¹⁾, T⁽²⁾, ${\hat{T}}^{(2)}$ , T^(nik)were evaluated only for non-degenerate words, for degenerate words the prediction was performed on the basis of the statistic q, i.e., on the basis of word frequency). In particular, the total number of errors includes errors for degenerate words. It is worth noticing that the total number of errors for degenerate words in case of the best choice of threshold q₀ for the frequency q(ω) turned out to be surprisingly small: only 13 (whereas the number of errors for the whole totality of words was 220); these errors included twelve type 1 errors and one type 2 error (recall that the prediction was performed for 2029 degenerate words, and the prediction was incorrect only in 13 cases). Such a small number of errors can seemingly be explained by the fact that for degenerate words the frequency q(ω) is nearly always close either to 1 or to 0, otherwise in almost all cases transfer probabilities p_1|1, p_1|0 can be defined for at least some values of similarity measure μ and hence a word ω turns out to be non-degenerate.

We also note that along with type 1 and type 2 errors other errors can occur, as it is possible that some words from description fields of a query sequence do not belong to description fields of similar sequences and hence the prediction is not performed for these words at all. However, we used a large number of similar sequences (100) for the prediction, so such errors were extremely rare (only two words for the whole test set of 518 sequences, whereas the number of words for which the prediction was performed equaled 9236). Hence, in our case these errors can be disregarded.

Statistical analysis showed that the precision of the evaluation of N¹, N², N^allis reasonable: it can be checked that the relative precision of N^all(the standard error of ln(N^all)) is in the order of 7–10% which is in line with the relative precision of a Poisson random variable that is given by $1 / \sqrt{N^{a l l}}$ . However, since the "methods" (the statistics and their variants) are compared on the same sequences, the standard error for the comparison between methods is much smaller due to the high correlation of results for the same sequence and is in the order of 2–4%. This implies that "methods" for which N^alldiffer by more than 10% can considered to be significantly different.

This shows that size of the experiment with 210 randomly selected sequences in the learning stage and 308 sequences in the testing stage is large enough to obtain valid statements about the accuracy of the proposed methodology and allows statistical comparisons of different statistics and variants.

When the set of tested sequences is fixed, the total number of errors N^allis an objective characteristic of prediction quality, and the optimal threshold values (selected during the "learning stage") provide exactly the minimum of the total number of errors. However, if different testing results (based on different sets of query sequences) are compared, then absolute numbers of errors can not be treated as a procedure quality measure, and relative quantities (proportions) should be considered instead of absolute quantities. Usually a sum of the proportion of type 1 errors and the proportion of type 2 errors P⁽¹⁾ + P⁽²⁾ is used as a quality measure; values of this sum are presented in the fifth column of Table 2. However, in our situation the number of words n_all- n_qthat do not belong to query sequences is much larger than n_q, and for the majority of these words it is obvious that they do not belong to a query sequence. Consequently, in case of optimal parameter values the proportion of type 2 errors P⁽²⁾ is small, it is considerably smaller than P⁽¹⁾. Thus the quantity P⁽¹⁾ + P⁽²⁾ is not representative in our case (see [5]). The ratio P⁽⁺⁾ of the number of wrong positive predictions to the total number of positive predictions is more representative than P⁽²⁾. Hence, it is natural to measure procedure quality by the sum P⁽¹⁾ + P⁽⁺⁾. Exactly this procedure quality measure was used in [5].

In medical decision making (diagnostic testing) the terms sensitivity (sens = 1 - P⁽¹⁾) and specificity (spec = 1 - P⁽²⁾) are frequently used to quantify the accuracy of a procedure, while the quantity 1 - P⁽⁺⁾ is known as the positive predictive value. Alternative terminology is discussed in [12]. The ROC curve plotting sens against 1 - spec is a popular way of showing the overall performance of a diagnostic test without specification of a cut off value. It is also used and discussed in [12], but with different labels for the axes. Figure 2 contains ROC curves for the best variants of the statistics (i.e., for variants that were marked in Table 2) applied to the non-degenerate words.

Discussion

Overall comparison of the statistics T⁽¹⁾, η, T⁽²⁾, ${\hat{T}}^{(2)}$

Mean values of N^all(where averaging is performed over all variants presented in Table 2) for different statistics are the following: 241 for η, 310 for T⁽¹⁾, 320 for T⁽²⁾, and 356 for ${\hat{T}}^{(2)}$ (recall that differences in the order of 10% can be considered to be significant). These numbers show that the statistics can clearly be ordered with respect to the prediction quality: the best statistic is η, then comes T⁽¹⁾, then T⁽²⁾, and finally ${\hat{T}}^{(2)}$ . For the comparison of the statistics η, T⁽¹⁾, as well as for the comparison of statistics the T⁽²⁾, ${\hat{T}}^{(2)}$ this conclusion is obvious. For the comparison of the statistics T⁽¹⁾, T⁽²⁾ it seems to be less obvious, but the consideration of variants in which the calculation of coefficients a_iis performed using formula (13) (i.e., variants [0,0] and [1,0]; recall that the statistic T⁽¹⁾ has only such variants) clearly shows that the statistic T⁽¹⁾ is considerably better than T⁽²⁾. The same conclusion can be drawn from the comparison of ROC curves presented in Fig. 2.

Testing results show that the prediction quality for the best variant of the statistic η (see the first line of Table 2) is much higher (higher by 50%) in comparison with results of [5] (see the line of Table 2 that corresponds to ${\hat{T}}^{(2)}$ [0,1]). That is the most striking finding of the current paper.

It is interesting to see that the statistic η turned out to be better than T⁽¹⁾: the latter would be expected to be better as it equals the logarithm of the likelihood ratio (in reality, it does not equal this logarithm because the ξ_iare not independent). We suppose that this fact can be explained as follows. Recall that T⁽¹⁾ and η are equivalent for an arbitrary fixed word ω. These statistics turn out to be not equivalent only if they are compared on the whole totality of words. It is worth noting that the variation of the values of the statistic T⁽¹⁾ (i.e., the difference between values of T⁽¹⁾ in cases ξ₁ =....= ξ_n= 0 and ξ₁ =....= ξ_n= 1) significantly depends on a word ω: for some words these values vary from -700 to 700, for some other from -0.01 to 0.01. In principle, the optimal threshold value is different for different words. As the ranges of T⁽¹⁾ values essentially vary, then it is probable that optimal thresholds also essentially vary. The optimal threshold for the whole totality of words is a certain mean value of thresholds over individual words ω. Since the optimal thresholds are significantly different for different words, the quality of the mean is low: for certain words (e.g., words with small variation of T⁽¹⁾) this mean is completely unrepresentative. In the same time, the range of values of η is the same for all words (these values lie between 0 and 1). Consequently, the difference between optimal thresholds for individual words is probably not so significant, and the mean gives better quality for individual words in comparison with T⁽¹⁾.

From the point of view of prediction quality the statistic T⁽²⁾ turned out to be worse than T⁽¹⁾. It is not surprising, because during the derivation of the formula for T⁽²⁾ along with the incorrect assumption of independence of ξ_iwe also made the incorrect assumption of the normal distribution of η. This consideration is applicable to ${\hat{T}}^{(2)}$ as well. Since the cut-off points for these statistics are determined empirically in the test set and validated in the training set, the violation of the normality assumption does not invalidate the procedures as such, but might affect their performance.

Effect of procedure modifications

The next issue is the dependence of prediction quality on procedure modifications. One can see that for a fixed statistic the prediction quality significantly depends on the choice of the variant. Thus, the introduction of modifications turned out to be effective.

It is interesting that all statistics for which coefficients a_jcan be calculated in different ways (these are the statistics η, T⁽²⁾ ${\hat{T}}^{(2)}$ ) prediction quality was better in case when these coefficients were calculated using the iterative procedure.

At the same time the dependence of prediction quality on the modification related to the consideration of lengths of primary local alignments is more intricate: for the statistic η, and to a lesser extent for the statistics T⁽¹⁾, T⁽²⁾, the results are better if lengths of primary local alignments are considered, but for the statistic ${\hat{T}}^{(2)}$ results are better when lengths of primary local alignments are not considered. Currently reasons of this fact are not clear for us.

As it was noted, the dependence of results on modifications dealing with the number of considered primary local alignments for similar sequences is not essential. (However, we note that for nearly all variants that were described above as well as variants that were not described, prediction quality is better in case when all primary local alignments are considered).

Comparison with other procedures

For the purpose of comparison we compared our findings with the simple statistic q(ω) (the frequency of word occurrence in the list of similar sequences, see (18)) was also considered, using a threshold of q = 0.422. As expected, q(ω) gave essentially worse results in comparison with T⁽¹⁾, η, T⁽²⁾, ${\hat{T}}^{(2)}$ (see Table 2).

Furthermore, we compared our results with the results, which predicts all of the words for which there is among similar sequences at least one sequence with a power above a certain threshold. Such an approach was used by Andrade M. et al. [2]. This is the statistics S_Andas already mentioned in Table 2. It is defined as S_And(ω) = max μ_j, where μ_jis the measure of similarity between sequences π₀, π_j, and the maximum is taken on the j for which ξ_j(ω) = 1 (i.e. the word belongs to these sequences). In our application similarity is measured by the power value (see [9, 10]) as used in the other statistics in this paper and mentioned before, while [2] used the E-value. That makse the cut-offs hard to compare. The results for these statistics are given in the same Table 2. The results were better than in the statistics q(ω), but worse than in any of our statistics.

Surprisingly, it turned out that the prediction quality is better when only some similar sequences (e.g., 10 from 100) are used for the evaluation of a statistic (but not for the evaluation of transfer probabilities!). It means that a large part of similar sequences only leads to an increase of "noise". (See [13] for details.)

We would also like to mention the work [14] on FunCat categories containing a set of about 7,500 well annotated proteins and providing a benchmarking for different methods of automated annotation. It would be interesting to apply our procedure to this database, but that has not been realized yet.

We did not attempt at this stage to apply our approach to the GO annotation. It would be interesting further research to switch to GO data and to compare our approach with other approaches in the literature.

There is a similarity with the approach of Kajan et al. [15]. They also base their procedures on the likelihood ratio, but use approximations based on maximal similarity (or minimal distance), while we attempt to estimate the likelihood ratio from pairwise comparisons within the set of similar sequences.

There is also an interesting relation with the GOPET tool presented in [16] based on earlier work by the same authors [17]. These authors use a number of characteristics of the set of found similar sequences for term (word), such as maximal e-value, frequency of the term etc., as a coordinates of decision making space. They use more features while we concentrate on the similarity. That might be an advantage for their method. On the other hand we try to use the information from all similar sequences in an optimal way relying on statistical decision theory by means of the use of transfer probabilities and the concept of likelihood ratio. It is an interesting topic of further research to compare the two methods.

So far we only predicted the presence of a key word. It is quite a challenge to obtain a true prediction of function.

Conclusion

The main conclusion of the paper is that the introduction of the concept of likelihood ratio coming from statistical decision theory is very helpful in the development of automated annotation procedures. We obtained a substantial improvement when compared with our previous results.

We are sure that there is room for further improvement. Issues for further research are the size of the set of similar sequences and the combination of different statistics into a "super-predictor".

References

Fleischmann W, Moller S, Gateau A, Apweiler R: A novel method for automatic functional annotation of proteins. Bioinformatics 1999, 15: 228–233. 10.1093/bioinformatics/15.3.228
Article CAS PubMed Google Scholar
Andrade MA, Brown NP, Leroy C, Hoersch S, de Daruvar A, Reich C, Franchini A, Tamames J, Valencia A, Ouzounis C, Sander C: Automated genome sequence analysis and annotation. Bioinformatics 1999, 15: 391–412. 10.1093/bioinformatics/15.5.391
Article CAS PubMed Google Scholar
Kretschmann E, Fleischmann W, Apweiler R: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 2001, 17: 920–926. 10.1093/bioinformatics/17.10.920
Article CAS PubMed Google Scholar
Hegyi H, Gerstein M: Annotation transfer for genomics: Measuring functional divergence in multi-domain proteins. Genome Research 2001, 11: 1632–1640. 10.1101/gr. 183801
Article PubMed Central CAS PubMed Google Scholar
Leontovich AM, Brodsky LI, Drachev VA, Nikolaev VK: Adaptive algorithm of automated annotation. Bioinformatics 2002, 18: 838–846. 10.1093/bioinformatics/18.6.838
Article CAS PubMed Google Scholar
Cox DR, Hinkley DV: Theoretical Statistics. London: Chapman and Hall; 1974.
Book Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Probabilistic models of proteins and nuclear acids. Cambridge University Press; 1998.
Book Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215: 403–410.
Article CAS PubMed Google Scholar
Leontovich AM, Brodsky LI, Gorbalenya AE: Construction of the full local similarity map for 2 biopolymers. Biosystems 1993, 30: 57–63. 10.1016/0303-2647(93)90062-H
Article CAS PubMed Google Scholar
Altschul SF, Erickson BW: Optimal sequence alignment using affine gap costs. Bulletin of Mathematical Biology 1986, 48: 603–616.
Article CAS PubMed Google Scholar
Barlow RE, Bartholomew JM, Bremner JM, Brunk HD: Statistical Inference Under Order Restrictions. New York: John Wiley & Sons; 1972.
Google Scholar
Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412–424. 10.1093/bioinformatics/16.5.412
Article CAS PubMed Google Scholar
Leontovich AM, Tokmachev KY: Methods for improving the quality of prediction in the process of automatic annotating A(4). Biofizika 2006, 51: 593–601.
CAS PubMed Google Scholar
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 2004, 32: 5539–45. 10.1093/nar/gkh894
Article PubMed Central CAS PubMed Google Scholar
Kajan L, Kertesz-Frarkas A, Franklin D, Ivoanova N, Kocsor A, Pongor S: Application of a simple likelihood ratio approximant to protein classification. Bioinformatics 2006, 22: 2865–2869. 10.1093/bioinformatics/btl512
Article CAS PubMed Google Scholar
Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, König R: GOPET: A tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006, 7: 161. 10.1186/1471-2105-7-161
Article PubMed Central PubMed Google Scholar
Vinayagam A, König R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S: Applying Support Vector Machines for Gene ontology based gene function prediction. BMC Bioinformatics 2004, 5: 116. 10.1186/1471-2105-5-116
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

The authors thank V.K. Nikolaev, A.E. Gorbalenya, V.V. Galatenko and I.V. Antonov for valuable comments and discussions. Work of A.M. Leontovich and K.Y. Tokmachev was partially supported through the Joint Program in Bioinformatics between Leiden University Medical Center and Moscow State University administrated through the CRDF GAP1473 grant to A.E. Gorbalenya. Also A.M. Leontovich acknowledges support of the EU grant FP6 IP Vizier LSHG-CT-2004-511960. and grants RFBR 06-07-89143a and RFBR 06-01-00454.

Author information

Authors and Affiliations

Belozersky Institute of Physico-Chemical Biology, Moscow State University, Moscow, 119899, Russia
Andrey M Leontovich & Konstantin Y Tokmachev
Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Post zone S5-P, 2300, RC, Leiden, The Netherlands
Hans C van Houwelingen

Authors

Andrey M Leontovich
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Y Tokmachev
View author publications
You can also search for this author in PubMed Google Scholar
Hans C van Houwelingen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hans C van Houwelingen.

Additional information

Authors' contributions

A.M. Leontovich and H.C. van Houwelingen developed the approach. A.M. Leontovich developed the methodology and the algorithms in cooperation with K.Y. Tokmachev and H.C. van Houwelingen. K.Y. Tokmachev carried out all computations.

Electronic supplementary material

12859_2007_2016_MOESM1_ESM.xls

Additional file 1: Testing sequences. The file contains the list of the 308 sequences used in the testing phase. (XLS 30 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Leontovich, A.M., Tokmachev, K.Y. & van Houwelingen, H.C. The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem. BMC Bioinformatics 9, 31 (2008). https://doi.org/10.1186/1471-2105-9-31

Download citation

Received: 27 February 2007
Accepted: 22 January 2008
Published: 22 January 2008
DOI: https://doi.org/10.1186/1471-2105-9-31

The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problem

Abstract

Background

Results

Conclusion

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

What an Algorithm Is

Background

Results

Algorithm description

Generation of a collection of similar sequences

The exact stochastic formulation of the problem

Assumption of independence of variables ξ i . Transfer probabilities

Statistics description

Testing results

Testing scheme

Final results

Discussion

Effect of procedure modifications

Comparison with other procedures

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12859_2007_2016_MOESM1_ESM.xls

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Assumption of independence of variables ξ_i. Transfer probabilities