Fast index based algorithms and software for matching position specific scoring matrices
 10k Downloads
 57 Citations
Abstract
Background
In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task.
Results
We present a new nonheuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of noncomparable PSSMscores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an Evalue or a pvalue. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330.
Conclusion
Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than $\mathcal{A}$^{ m } + m  1, where m is the length of the PSSM and $\mathcal{A}$ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate onthefly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
Keywords
Suffix Reading Depth Score Threshold Suffix Tree Amino Acid PairBackground
Position specific scoring matrices (PSSMs) have a long history in sequence analysis (see [1]). A high PSSMscore in some region of a sequence often indicates a possible biological relationship of this sequence to the family or motif characterized by the PSSM. There are several databases utilizing PSSMs for function assignment and annotation, e.g., PROSITE [2], PRINTS [3], BLOCKS [4], EMATRIX [5], JASPAR [6], or TRANSFAC [7]. While these databases are constantly improved, there are only few improvements in the programs searching with PSSMs. E.g., the programs FingerPrintScan [8], BLIMPS [4], and MatInspector[9] still use a straightforward $\mathcal{O}$(mn)time algorithm to search a PSSM of length m in a sequence of length n. In [10] the authors presented a method based on Fourier transformation. A different method introduced in [11] employs data compression. To the best of our knowledge there is no software available implementing these two methods. The most advanced program in the field of searching with PSSMs is EMATRIX [12], which incorporates a technique called lookahead scoring. The lookahead scoring technique is also employed in the suffix tree based method of [13]. This method performs a limited depth first traversal of the suffix tree of the set of target sequences. This search updates PSSMscores along the edges of the suffix tree. Lookahead scoring allows to skip subtrees of the suffix tree that do not contain any matches to the PSSM. Unfortunately, the method of [13] has not found its way into a widely available and robust software system. In [14], the development of new, more efficient algorithms for searching with PSSMs is considered an important problem, which still needs better solutions.
In this paper, we present a new, nonheuristic algorithm for searching with PSSMs. With any nonheuristic PSSM searching algorithm, the performance in terms of sensitivity and specificity solely depends on the used PSSM and threshold, i.e. given a PSSM and threshold, all these algorithms give exactly the same results. For the generation of PSSMs from aligned sequences, numerous different methods, were described in literature over the last decades [1, 5, 15, 16, 17]. Rather than improving PSSMs, we focus on improvements in terms of time and space efficiency when searching with PSSMs, independently of their underlying generation method. The overall structure of our proposed search algorithm is similar to the method of [13]. However, instead of suffix trees we use enhanced suffix arrays, a data structure which is as powerful as suffix trees (cf. [18]). Enhanced suffix arrays provide several advantages over suffix trees, which make them more suitable for searching with PSSMs:

While suffix trees require about 12n bytes in the best available implementation (cf. [19]), the enhanced suffix array used for searching with PSSMs only needs 9n bytes of space.

While the suffix tree is usually only computed in main memory, the enhanced suffix array is computed once and stored on file. Whenever a PSSM is to be searched, the enhanced suffix array is mapped into main memory which requires no extra time.

While the depth first traversal of the suffix tree suffers from the poor locality behavior of the data structure (cf. [20]), the enhanced suffix array provides optimal locality, because when searching with PSSMs it is sequentially scanned from left to right.
One of the algorithmic contributions of this paper is a new technique that allows to skip parts of the enhanced suffix array containing no matches to the PSSM. Due to the skipping, our algorithm achieves an expected running time that is sublinear in the size of the search space (i.e., the size of the nucleotide or protein database). As a consequence, our algorithm scales very well for large data sizes.
Since the running time of our algorithm increases with the size of the underlying alphabet, we developed a filtering technique, utilizing alphabet reduction, that achieves better performance especially on sequences/PSSMs over the amino acid alphabet.
When searching with a PSSM, it is important to determine a suitable threshold for a PSSMmatch. Usually, the user prefers to specify a significance threshold (i.e., an Evalue or a pvalue) which has to be transformed into an absolute score threshold for the PSSM under consideration. This can be done by computing the score distribution of the PSSM, using wellknown dynamic programming (DP, for short) methods, e.g., [12, 21, 22, 23]. Unfortunately, these methods are not fast enough for large PSSMs. For this reason, we have developed a new, lazy evaluation algorithm that only computes a small fraction of the complete score distribution. Our algorithm speeds up the computation of the threshold by factor of at least 3, compared to standard DP methods. This makes our algorithm applicable for onthefly computations of the score thresholds.
The new algorithms described in this paper are implemented as part of the PoSSuM software distribution. This is available free of charge for noncommercial research institutions. For details, see [24]. Parts of this contribution appeared as [25] in proceedings of GCB2004.
Results
PSSMs and lookahead scoring: LAsearch
A simple algorithm for the PSSM matching problem slides along the sequence and computes sc (w, M) for each w = S [j..j + m  1], j ∈ [0, n  m]. The running time of this algorithm is $\mathcal{O}$(mn). It is used e.g., in the programs FingerPrintScan [8], BLIMPS [4], MatInspector[9], and MATCH [17].
In [12], lookahead scoring is introduced to improve the simple algorithm. Lookahead scoring allows to stop the calculation of sc (w, M) when it is clear that the given overall score threshold th cannot be achieved. To be more precise, we define $pfxs{c}_{d}\left(w,M\right):={\displaystyle {\sum}_{h=0}^{d}M\left(h,w[h]\right)}$, max_{ d }:= max{M(d, a)  a ∈ $\mathcal{A}$}, and ${\sigma}_{d}:={\displaystyle {\sum}_{h=d+1}^{m1}ma{x}_{h}}$ for any d ∈ [0, m  1]. pfxsc_{ d }(w, M) is the prefix score of depth d. σ_{ d }is the maximal score that can be achieved in the last m  d  1 positions of the PSSM. Let th_{ d }:= th  σ_{ d }be the intermediate threshold at position d. The correctness of lookahead scoring, not shown in [12], is implied by the following Lemma:
 (1)
pfxsc_{ d }(w, M) ≥ th_{ d }for all d ∈ [0, m  1],
 (2)
sc(w, M) ≥ th.
Proof: (1)⇒(2): Suppose that (1) holds. Then ${\sigma}_{m1}={\displaystyle {\sum}_{h=m}^{m1}ma{x}_{h}=0}$ and
$sc\left(w,M\right)={\displaystyle \sum _{h=0}^{m1}M\left(h,w[h]\right)=}pfxs{c}_{m1}\left(w,M\right)\ge t{h}_{m1}=th{\sigma}_{m1}=th.$
(2)⇒(1): Suppose that (2) holds. Let d ∈ [0, m  1]. Then
$\begin{array}{l}sc\left(w,M\right)={\displaystyle \sum _{h=0}^{m1}M\left(h,w[h]\right)}={\displaystyle \sum _{h=0}^{d}M\left(h,w[h]\right)}+{\displaystyle \sum _{h=d+1}^{m1}M\left(h,w[h]\right)}\\ =pfxs{c}_{d}\left(w,M\right)+{\displaystyle \sum _{h=d+1}^{m1}M\left(h,w[h]\right)}\end{array}$
Hence sc(w, M) ≥ th implies $pfxs{c}_{d}\left(w,M\right)+{\displaystyle {\sum}_{h=d+1}^{m1}M\left(h,w[h]\right)}\ge th$. Since M(h, w[h]) ≤ max_{ h }for h ∈ [0, m  1], we conclude
$\sum _{h=d+1}^{m1}M\left(h,w[h]\right)}\le {\displaystyle \sum _{h=d+1}^{m1}ma{x}_{h}}={\sigma}_{d$
and hence
$pfxs{c}_{d}\left(w,M\right)\ge th{\displaystyle \sum _{h=d+1}^{m1}M\left(h,w[h]\right)}\ge th{\sigma}_{d}=t{h}_{d}.$
The Lemma suggests a necessary condition for a PSSMmatch which can easily be exploited: When computing sc(w, M) by scanning w from left to right, one checks for d = 0,1,..., m  1, if the intermediate threshold th_{ d }is achieved. If not, the computation can be stopped. See Figure 1 for an example of intermediate thresholds and their implications.
The lookahead scoring algorithm (herein after called LAsearch) runs in $\mathcal{O}$(kn) time, where k is the average number of PSSMpositions per sequence position actually evaluated. In the worst case, k ∈ $\mathcal{O}$(m), which leads to the worst case running time of $\mathcal{O}$(mn), not better than the simple algorithm. However, k is expected to be much smaller than m, leading to considerable speedups in practice.
Our reformulation of lookahead scoring and its implementation is the basis for improvements and evaluation in the subsequent sections.
PSSM searching using enhanced suffix arrays: ESAsearch
lcp is a table in the range 0 to n such that lcp[0] := 0 and lcp[i] is the length of the longest common prefix of S_{suf[i  1]}and S_{suf[i]}, for i ∈ [1, n]. See Figure 2 for an example. Table lcp can be computed in linear time given table suf [27]. In practice PSSMs are used to model relatively short, local motifs and hence do not exceed length 255. For searching with PSSMs we therefore do not access values in table lcp larger than 255, and hence we can store lcp in n bytes.
skp is a table in the range 0 to n such that skp[i] := min({n + 1} ∪ {j ∈ [i + 1, n]  lcp[i] > lcp[j]}). In terms of suffix trees, skp[i] denotes the lexicographically next leaf that does not occur in the subtree below the branching node corresponding to the longest common prefix of S_{suf[i  1]}and S_{suf[i]}. Figure 2 shows this relation. Table skp can be computed in $\mathcal{O}$(n) time given suf and lcp. For the algorithm to be described we assume that the enhanced suffix array for S has been precomputed.
In a suffix tree, all substrings of S of a fixed length m can be scored with a PSSM by a depth first traversal of the tree. Using lookahead scoring, one can skip certain subtrees that do not contain matches to the PSSM. Since suffix trees have several disadvantages (see the introduction), we use enhanced suffix arrays to search PSSMs. Like in other algorithms on enhanced suffix arrays (cf. [18]), one simulates a depth first traversal of the suffix tree by processing the arrays suf and lcp from left to right. To incorporate lookahead scoring while searching we must be able to skip certain ranges of suffixes in suf. To facilitate this, we use table skp. We will now make this more precise.
For i ∈ [0, n], let v_{ i }= S_{suf[i]}, l_{ i }= min{m, v_{ i }}  1, and d_{ i }= max({1} ∪ {d ∈ [0, l_{ i }] pfxsc_{ d }(v_{ i }, M) ≥ th_{ d }}). Now observe that d_{ i }= m  1 ⇔ pfxsc_{m1}(v_{ i }, M) ≥ th_{m1}⇔ sc (v_{ i }, M) ≥ th. Hence, M matches at position j = suf[i] if and only if d_{ i }= m  1. Thus, to solve the PSSM searching problem, it suffices to compute all i ∈ [0, n] satisfying d_{ i }= m  1. We compute d_{ i }along with C_{ i }[d] = pfxsc_{ d }(v_{ i }, M) for any d ∈ [0, d_{ i }]. d_{0} and C_{0} are easily determined in $\mathcal{O}$(m) time. Now let i ∈ [1, n] and suppose that d_{i1}and C_{i1}[d] are determined for d ∈ [0,d_{i1}]. Since v_{i1}and v_{ i }have a common prefix of length lcp[i], we have C_{ i }[d] = C_{i1}[d] for all d ∈ [0, lcp[i]  1]. Consider the following cases:

If d_{i1}+ 1 ≥ lcp[i], then compute C_{ i }[d] for d ≥ lcp[i] while d ≤ l_{ i }and C_{ i }[d] ≥ th_{ d }. We obtain d_{ i }= d.

If d_{i1}+ 1 < lcp[i], then let j be the minimum value in the range [i + 1, n + 1] such that all suffixes v_{ i }, v_{i+1},...,v_{j1}have a common prefix of length d_{i1}+ 1 with v_{i1}. Due to the common prefix we have pfxsc_{ d }(v_{i1}, M) = pfxsc_{ d }(v_{ r }, M) for all d ∈ [0, d_{i1}+ 1] and r ∈ [i, j  1]. Hence d_{i1}= d_{ r }for r ∈ [i, j  1]. If d_{i1}= m  1, then there are PSSM matches at all positions suf[r] for r ∈ [i, j  1]. If d_{i1}<m  1, then there are no PSSM matches at any of these positions. That is, we can directly proceed with index j. We obtain j by following a chain of entries in table skp: compute a sequence of values j_{0} = i, j_{1} = skp[j_{0}],...,j_{ k }= skp[j_{k1}] such that d_{i1}+ 1 < lcp[j_{1}],...,d_{i1}+ 1 < lcp[j_{k1}], and d_{i1}+ 1 ≥ lcp[j_{ k }]. Then j = j_{ k }.
We illustrate the ideas of algorithm ESAsearch, formally described above, with the following example. Let M be a PSSM of length m = 2 over alphabet $\mathcal{A}$ = {a, c} with M(0, a) = 1, M(0, c) = 3, M(1, a) = 3, and M(1, c) = 2. For a given threshold of th = 6 we obtain intermediate thresholds th_{0} = 3 and th_{1} = 6. To search with M in the enhanced suffix array for sequence S = caaaaccacac as given in Figure 2, we start processing the enhanced suffix array suf top down by scoring the first suffix S_{suf[0]} = aaaaccacac$ with M from left to right. For the first character of S_{suf[0]} we obtain a score of pfxsc_{0}(S_{suf[0]},M)= M(0, a) = 1 which is below the first intermediate threshold th_{0}= 3. Hence we set d_{0} = 1 and notice that we can skip all suffixes of S that start with character 'a'. Further on, with a lookup in lcp[1] = 3 we find that S_{suf[1]} and S_{suf[0]} share a common prefix of length 3 and d_{0} + 1 = 1 + 1 < lcp[1] = 3 (second case described above). The next suffix that may match M with th = 6 is S_{suf[6]} = caaaaccacac$. Suffixes S_{suf[1]}, S_{suf[2]},... S_{suf[5]} can be skipped since they all share a common prefix with S_{suf[0]} of at least length 1. That is, they begin all with character 'a' and would also miss the first intermediate threshold th_{0} = 3 when scored. We find S_{suf[6]} by following a chain of entries in table skp: skp[1] = 2, skp[2] = 3, and skp[3] = 6. When scoring S_{suf[6]} we compute pfxsc_{0}(S_{suf[6]}, M) = M(0,c) = 3 and pfxsc_{1}(S_{suf[6]}, M) = M(0, c) + M(1, a) = 6 and store them for reuse in C[0] and C[1]. Since d_{6} = 1 = m  1 = 1 holds, we report suf[6] = 0 with score sc (S_{suf[6]}, M) = pfxsc_{1}(S_{suf[6]},M) = 6 as a matching position. With lookups in lcp[7] = 2 and lcp[8] = 3 we notice that S_{suf[7]} and S_{suf[8]} share a common prefix of at least two characters with S_{suf[6]}. Hence we report suf[7] = 6 and suf[8] = 8 with score C[1] = 6 as further matching positions. We proceed with the scoring of S_{suf[9]}. Since lcp[9] = 1 holds, we obtain the score for the first character 'c' from array C with pfxsc_{0}(S_{suf[9]}, M) = C[0]. After scoring the second character of S_{suf[9]}, pfxsc_{1}(S_{suf[9]}, M) = 5 <th_{1} = 6 holds and we miss the second intermediate threshold and continue with the next suffix. The last two suffixes S_{suf[10]} and S_{suf[11]} in suf do not have to be considered since their lengths are smaller than to m = 2 (not counting the sentinel character $) and therefore they cannot match M. We end up with matching positions 0, 6, and 8 of M in S with match score 6. To find these matches, we processed the enhanced suffix array suf top down and scored suffixes from left to right, facilitating the additional information given in tables lcp and skp to avoid rescoring of characters of common prefixes of suffixes and to skip suffixes that cannot match M for the given threshold.
Analysis
The C_{ i }arrays can be stored in a single $\mathcal{O}$(m) space array C as any step i only needs the C_{ i }specific to that step. C_{ i }solely depends on C_{i1}, and C_{ i }[0..d  1] = C_{i1}[0..d  1] holds for a certain d<m, i.e., the first d entries in C_{ i }are known from the previous step, and thus C can be organized as a stack. No other space (apart from the space for the enhanced suffix array) depending on input size is required for ESAsearch, leading to an $\mathcal{O}$(m) space complexity.
The worst case for ESAsearch occurs, if th ≤ sc_{min}(M) (M matches at each position in S), and no suffix of S shares a common prefix with any other suffix. In this case lookahead scoring does not give any speedup and every suffix must be read up to depth to m, leading to an $\mathcal{O}$(nm) worst case time complexity. This is not worse but also not better than the complexity for LAsearch. Next we show that, independent of the chosen threshold th, the overall worst case running time boundary for ESAsearch drops to $\mathcal{O}$(n + m) under the assumption that
n ≥ $\mathcal{A}$^{ m } + m  1 (1)
holds.
The shorter the common prefixes of the neighboring suffixes, the slower ESAsearch runs. Thus to analyze the worst case, we have to consider sequences containing as many different substrings of some length q as possible. Observe that a sequence can contain at most $\mathcal{A}$^{ q } different substrings of length q > 0, independent of its length. To analyze the behavior of ESAsearch on such a sequence, we introduce the concept of suffixintervals on enhanced suffix arrays, similar to lcpintervals as used in [18].
 1.
lcp[i] < ℓ
 2.
lcp[j + 1] < ℓ
 3.
lcp[k] ≥ ℓ for all k ∈ {x  i + 1 ≤ x ≤ j}
An lcpinterval, or ℓinterval, with lcpvalue ℓ ∈ {0,..., n} is a suffixinterval ℓ  [i, j] with i <j and lcp[k] = ℓ for at least one k ∈ {i + 1,..., j}.
Every lcpinterval ℓ  [i, j] of an enhanced suffix array for text S corresponds to an internal node v in a suffix tree for S, and the length of the string spelled out by the edge labels on the path from the root node to v is equal to ℓ. Leaves are represented as singleton intervals, ℓ  [i, j] with i = j. We say that suffixinterval ℓ  [i, j] embeds suffixinterval ℓ^{+}  [k, l], if and only if ℓ^{+} > ℓ, i ≤ k <l ≤ j, and if there is no suffixinterval ℓ'  [r, s] with ℓ < ℓ' < ℓ^{+} and i ≤ r ≤ k <l ≤ s ≤ j. As an example for ℓsuffixintervals, consider the enhanced suffix array given in Figure 2. [0, 5] is a 1suffixinterval, because lcp[0] = 0 < 1, lcp[5 + 1] = 0 < 1, and lcp[k] ≥ 1, for all k, 1 ≤ k ≤ 5. Suffixinterval 2[3,5] is embedded in 1[0,5], but 3[0,1] is not. Consider an enhanced suffix array of a sequence which contains all possible substrings of length q. There are $\mathcal{A}$ 1suffixintervals, $\mathcal{A}$^{2} 2suffixintervals, and so on. Consequently, up to depth q, there are a total of
${E}_{q}={\displaystyle \sum _{i=1}^{q}{\left\mathcal{A}\right}^{i}}=\frac{{\left\mathcal{A}\right}^{q+1}\left\mathcal{A}\right}{\left\mathcal{A}\right1}\left(2\right)$
ℓsuffixintervals (1 ≤ ℓ ≤ q). This corresponds to the number of internal nodes and leaves in a suffix tree, which is atomic up to at least depth q under our assumptions.
Since we are considering sequences that contain all possible substrings of length q, there are $\mathcal{A}$^{ d }dsuffixintervals at any depth d, 1 ≤ d ≤ q. Let d[i, j] be a dsuffixinterval. We know that pfxsc_{ d }(v_{ i }, M) is a partial sum of pfxsc_{ q }(v_{ i }, M), and because v_{ i }[0.. d  1] = v_{i + 1}[0..d  1] = ... = v_{ j }[0.. d  1], pfxsc_{ d }(v_{ i }, M) is also a partial sum of pfxsc_{ q }(v_{ k }, M) for i ≤ k ≤ j. That is, after ESAsearch has calculated pfxsc_{ d }(v_{ i }, M) at depth d, at any suffixinterval (d + 1)  [r, s] embedded in d[i, j] it suffices to only calculate the "rest" of pfxsc_{ q }(v_{ k }, M). At any depth d, the algorithm calculates pfxsc_{d+1}(v_{ r }, M) = pfxsc_{ d }(v_{ i }, M) + M(d, v_{ r }[d]), meaning that all prefix scores at depth d + 1 in a dsuffixinterval can be computed from the prefix scores at depth d by $\mathcal{A}$ matrix lookups and additions as there are $\mathcal{A}$ embedded (d + 1)suffixintervals. There are $\mathcal{A}$^{ d } dsuffixintervals at depth d. Hence, it takes ESAsearch a total of $\mathcal{A}$^{ d } ·$\mathcal{A}$ matrix lookups and additions to advance from depth d to d + 1, and thus we conclude that the algorithm requires a total of $\mathcal{O}$(E_{ q }) operations to compute all scores for all substrings of length q.
Suppose that ESAsearch has read suffix v_{ i }in some step up to depth q  1 such that character v_{ i }[q  1] is the last one read. If lcp[i + 1] ≥ q holds, then the algorithm has found a suffixinterval q[i, j] with a yet unknown right boundary j, otherwise j = i. ESAsearch reports all suf[k] with k ∈ [i, j] as matching positions by scanning over table lcp starting at position i until lcp[k] < lcp[i] (such that it finds j = k  1), and continues with suffix v_{ k }at depth lcp[k]. Hence processing such a suffixinterval requires one matrix lookup and addition to compute the score, and j  i + 1 steps to report all matches and find suffix v_{ k }. Since suffixintervals do not overlap, the total length of all suffixintervals at depth q can be at most n, so the total time spent on reporting matches is bounded by n.
 1.
If p = 0 (⇒ m = q), then the time required to calculate all match scores is in $\mathcal{O}$(E_{ q }) as discussed above.
 2.
If p < 0 (⇒ m<q), then none of the msuffixintervals are singletons since we assumed that the sequence under consideration contains all possible substrings of length q, i.e., there must be suffixes sharing a common prefix of length m, and the time required to calculate all match scores is in $\mathcal{O}$(E_{ m }).
 3.
If p > 0 (⇒ m > q), then every msuffixinterval can be a singleton, and all prefix scores for the PSSM prefix of length q are calculated in $\mathcal{O}$(E_{ q }) time. However, the remaining scores for the pending substrings of length p must be computed for every suffix longer than q, taking $\mathcal{O}$(np) additional time, and leading to a total $\mathcal{O}$(E_{ q }+ np) worst case time complexity for computing all match scores.
Note that a text containing $\mathcal{A}$^{ q } different substrings must have a certain length, which must be at least $\mathcal{A}$^{ q }. In fact, a minimum length text that contains all strings of length q has length n = $\mathcal{A}$^{ q } + q  1. It represents a de Bruijn sequence[28] without wraparound, i.e., a de Bruijn sequence with its first q  1 characters concatenated to its end. Since a de Bruijn sequence without wraparound represents the minimum length worst case, we infer from Equation (2) that E_{ q }∈ $\mathcal{O}$(n). Hence, if m = q, then it takes $\mathcal{O}$(n) time to calculate all match scores. If m <q, then E_{ m }<E_{ q }and thus it takes sublinear time. If m > q, it takes $\mathcal{O}$(n + np) time.
We summarize the worst case running time of ESAsearch for preprocessing a PSSM M of length m, searching with M, and reporting all matches with their match scores, as
Hence, the worst case running time is $\mathcal{O}$(n + m) for p ≤ 0, implying that this time complexity holds for any PSSM of length m and threshold on any text of length n ≥ $\mathcal{A}$^{ m } + m  1, as already stated in Inequality (1).
In practice, large numbers of suffixes can be skipped if the threshold is stringent enough, leading to a total running time sublinear in the size of the text, regardless of the relation between n and m. ESAsearch reads a suffix up to depth m unless an intermediate score falls short of an intermediate threshold, and skips intervals with the same or greater lcp if this happens. Right boundaries of skipped suffixintervals are found quickly by following the chain of skipvalues (see function skipchain in Figure 4). It are these jumps that make ESAsearch superior in terms of running time to LAsearch in practice. The best case is indeed $\mathcal{O}$($\mathcal{A}$) which occurs whenever there is no score in the first row of the PSSM that is greater than th_{0}.
Performance improvements via alphabet transformations
Inequality (1) provides the necessary condition for $\mathcal{O}$(n + m) worst case running time. We now assume that m in Inequality (1) identifies not the length of a PSSM, but the threshold dependent expected reading depth for some PSSM. We denote this expected depth by m*(th) ≤ m and continue denoting the PSSM's length by m. As seen before, for PSSMs with length m, such that p = m  m*(th), the worst case running time is $\mathcal{O}$(n + n·max {0, p} + m), but the expected running time is $\mathcal{O}$(n + m), as on average we expect p ≤ 0. Inequality (1) with m substituted by m*(th) implies ${\mathrm{log}}_{\left\mathcal{A}\right}$ (n) ≥ m*(th). That is, to achieve linear worst case running time for the amino acid alphabet, m*(th) needs to be very small. For instance, if n = 20^{7}, then the search time is guaranteed to be linear in n only for PSSMs with a maximum length of 7, and expected to be linear for PSSMs with expected reading depth of 7. Observe that for $\mathcal{A}$ = 4, m*(th) needs to be smaller or equal to 15 to achieve linear or sublinear running times. This provides the motivation to reduce the alphabet size by transforming $\mathcal{A}$ into a reduced size $\stackrel{\_}{\mathcal{A}}$ such that $\stackrel{\_}{\mathcal{A}}$ < $\mathcal{A}$.
We now describe how to take advantage of reduced alphabets as fast filters in the ESAsearch algorithm. Let $\mathcal{A}$ = {a_{0}, a_{1},..., a_{ k }} and $\stackrel{\_}{\mathcal{A}}$ = {b_{0}, b_{1},..., b_{ l }} be two alphabets, and Φ : $\mathcal{A}$ → $\stackrel{\_}{\mathcal{A}}$ a surjective function that maps a character a ∈ $\mathcal{A}$ to a character b ∈ $\stackrel{\_}{\mathcal{A}}$. We call Φ^{1}(b) the character class corresponding to b. For a sequence S = s_{1}s_{2} ... s_{ n }∈ $\mathcal{A}$^{ n } we denote the transformed sequence with $\stackrel{\_}{S}$ = Φ(s_{1}) Φ(s_{2}) ... Φ(s_{ n }) ∈ $\stackrel{\_}{\mathcal{A}}$^{ n }. Along with the transformation of the sequence, we transform a PSSM such that we have a one to one relationship between the columns in the PSSM and the characters in $\stackrel{\_}{\mathcal{A}}$. We define the transformed PSSM $\stackrel{\_}{M}$ of M as follows:
Definition 2 Let M be a PSSM of length m over alphabet $\mathcal{A}$, and Φ : $\mathcal{A}$ → $\stackrel{\_}{\mathcal{A}}$ a surjective function. The transformed PSSM $\stackrel{\_}{M}$ is defined as a function $\stackrel{\_}{M}$: [0, m  1] × $\stackrel{\_}{\mathcal{A}}$ → ℝ with
MS := {j ∈ [0, n  m]  sc (S[j..j + m  1], M) ≥ th}
Now observe that we can use matches of $\stackrel{\_}{M}$ on $\stackrel{\_}{S}$, for the computation of matches of M on S, since MS ⊆ $\stackrel{\_}{MS}$. We prove that MS ⊆ $\stackrel{\_}{MS}$ holds for all th ∈ [sc_{min} (M), sc_{max}(M)] by proving the more general statement given in the following Lemma.
Lemma 2 sc (w, M) ≤ sc ($\stackrel{\_}{w}$, $\stackrel{\_}{M}$) holds for all w ∈ $\mathcal{A}$^{ m }.
Proof:
$\begin{array}{l}sc\left(w,M\right)={\displaystyle \sum _{i=0}^{m1}M\left(i,w[i]\right)\le}{\displaystyle \sum _{i=0}^{m1}\mathrm{max}\left\{M\left(i,a\right)a\in {\Phi}^{1}\left(\Phi \left(w[i]\right)\right)\right\}}\\ ={\displaystyle \sum _{i=0}^{m1}\stackrel{\_}{M}}\left(i,\Phi \left(w[i]\right)\right)=sc\left(\stackrel{\_}{w},\stackrel{\_}{M}\right).\end{array}$
Thus the following implications follow directly

sc (w, M) ≥ th ⇒ sc ($\stackrel{\_}{w}$, $\stackrel{\_}{M}$) ≥ th

i ∈ MS ⇒ i ∈ $\stackrel{\_}{MS}$
and we conclude: MS ⊆ $\stackrel{\_}{MS}$ holds for th ∈ [sc_{min} (M), sc_{max} (M)].
 (1)
Transform S into $\stackrel{\_}{S}$ and build the enhanced suffix array for $\stackrel{\_}{S}$;
 (2)
Construct $\stackrel{\_}{M}$ from M;
 (3)
Compute $\stackrel{\_}{MS}$ by searching with $\stackrel{\_}{M}$ on the enhanced suffix array of $\stackrel{\_}{S}$ using algorithm ESAsearch;
 (4)
For each i ∈ $\stackrel{\_}{MS}$ rescore match with σ = sc (S[i..i + m  1], M), and report i and σ if and only if σ ≥ th.
As a further consequence of Definition 2 the maximum score values in each row of M and $\stackrel{\_}{M}$ and thus the intermediate thresholds remain unchanged in the transformation process. Unfortunately the necessary PSSM transformation accompanying alphabet size reduction affects the expected reading depth m*(th) in such a way that it increases with more degraded alphabets, and therefore reduces the expected performance improvement. Due to maximization according to Equation (3) the matrix values in $\stackrel{\_}{M}$ increase and we expect a decreased probability of falling short of an intermediate threshold early. Observe that there is a tradeoff between increased expected reading depth and increased lcpinterval sizes at low reading depths. Therefore it is desirable to minimize the effect of maximization by grouping PSSM columns with similar score values, i.e., highly correlated columns. Since PSSMs reflect the properties of the underlying multiple alignment, we expect correlations of PSSM columns according to biologically motivated symbol similarities. Hence character correlation is the motivation for our alphabet reduction strategy.
Reduced amino acid alphabets
${c}_{a,b}:=\frac{{\displaystyle {\sum}_{i=1}^{20}{Y}_{a,i}{Y}_{b,i}}}{\left({\displaystyle {\sum}_{i=1}^{20}{Y}_{a,i}^{2}}\right)\left({\displaystyle {\sum}_{i=1}^{20}{Y}_{b,i}^{2}}\right)}$
and amino acid pairs can be iteratively grouped together according to their correlations, starting with the most correlated pairs, until all the amino acids are divided into the desired number of groups.
Finding an appropriate threshold for PSSM searching: LazyDistrib
Probabilities and expectation values
The results of PSSM searches strongly depend on the choice of an appropriate threshold value th. A small threshold may produce a large number of false positive matches without any biological meaning, whereas meaningful matches may not be found if the threshold is too stringent. PSSMscores are not equally distributed and thus scores of two different PSSMs are not comparable. It is therefore desirable to let the user define a significance threshold instead. The expected number of matches in a given random sequence database (Evalue) is a widely accepted measure of the significance. We can compute the Evalue for a known background distribution and length of the database by exhaustive enumeration of all substrings. However, the time complexity of such a computation is $\mathcal{O}$($\mathcal{A}$^{ m }m) for a PSSM of length m. If the values in M are integers within a certain range [r_{ min }, r_{ max }] of size R = r_{ max } r_{ min }+ 1, then dynamic programming (DP) methods (cf. [12, 21, 22]) allow to compute the probability distribution (and hence the Evalue) in $\mathcal{O}$(m^{2}R$\mathcal{A}$) time.
In practice the probability distribution is often not exactly, or completely calculated due to concerns of speed. E.g., in the EMATRIX system [12] score thresholds are calculated and stored for probability values in the interval π = 10^{1}, 10^{2},..., 10^{40} only. Consequently, the user can only specify one of these pvalue cutoffs. For the calculation of the pvalue from a determined match score, EMATRIX uses loglinear interpolation on the stored thresholds. A different, commonly used strategy to derive a continuous distribution function uses the extreme value distribution as an approximation [31, 32, 33] of high scoring matches.
Even though it is widely accepted that highscoring local alignment score distributions of the popular position independent scoring systems PAM and BLOSUM can be well approximated by an extreme value distribution, this cannot be generalized for arbitrary PSSMs.
Calculation of exact PSSM score distributions
While recent publications focus on the computation of the complete probability distribution, what is required specifically for PSSM matching, is computing a partial cumulative distribution corresponding to an Evalue resp. pvalue specified by the user. Therefore, we have developed a new "lazy" method to efficiently compute only a small fraction of the complete distribution.
We formulate the problem we solve w.r.t. Evalues and pvalues: Given a user specified Evalue η, find the minimum threshold Tmin_{ E }(η, M), such that the expected number of matches of M in a random sequence of given length is at most η. Given a user specified pvalue π, find the minimum threshold $Tmi{n}_{\mathcal{P}}$(π, M), such that the probability that M matches a random string of length m is at most π. The threshold Tmin_{ E }(η, M) can be computed from $Tmi{n}_{\mathcal{P}}$(π, M) according to the equation Tmin_{ E }(π·(n  m + 1), M) = $Tmi{n}_{\mathcal{P}}$(π, M). Hence we restrict on computing $Tmi{n}_{\mathcal{P}}$(π, M).
Since all strings of length m have a score between sc_{min} (M) and sc_{max}(M), we conclude $Tmi{n}_{\mathcal{P}}$(1, M) = sc_{min} (M) and $Tmi{n}_{\mathcal{P}}$(0, M) > sc_{max}(M). To explain our lazy evaluation method, we first consider existing methods based on DP.
Evaluation with dynamic programming
We assume that at each position in sequence S, the symbols occur independently, with probability f (a) = (1/n)·{i ∈ [0, n  1] S[i] = a}. Thus a substring w of length m in S occurs with probability ${\prod}_{i=0}^{m1}f\left(w[i]\right)$ and the probability of observing the event sc (w, M) = t is $\mathbb{P}\left[sc(w,M)=t\right]={\displaystyle {\sum}_{w\in {\mathcal{A}}^{m}:sc(w,M)=t}{\displaystyle {\prod}_{i=0}^{m1}f(w[i])}}$. We obtain $Tmi{n}_{\mathcal{P}}$(π, M) by a lookup in the distribution:
If the values in the PSSM M are integers in a range of width R, dynamic programming allows to efficiently compute the probability distribution. The dynamic programming aspect becomes more obvious by introducing for each k ∈ [0, m  1] the prefix PSSM M_{ k }: [0, k] × $\mathcal{A}$ → ℕ defined by M_{ k }(j, a) = M (j, a) for j ∈ [0, k] and a ∈ $\mathcal{A}$.
Corresponding distributions Q_{ k }(t) for k ∈ [0, m  1] and t ∈ [sc_{min} (M_{ k }), sc_{max} (M_{ k })], and Q_{1}(t), are defined by
$\begin{array}{c}{Q}_{1}(t):=\{\begin{array}{ll}1\hfill & \text{if}t=0\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}\\ {Q}_{k}(t):={\displaystyle \sum _{a\in \mathcal{A}}{Q}_{k1}}(tM(k,a))f(a)\end{array}$
If we allow for floating point scores that are rounded to ε decimal places, the time and space requirement increases by a factor of 10^{ ε }. Conversely, if all integer scores share a greatest common divisor z, the matrix should be canceled down by z.
Restricted probability computation
In step d we compute Q (sc_{max} (M)  d) where all intermediate scores contributing to sc_{max} (M)  d have to be considered. In analogy to lookahead scoring, in each row j of M we avoid all intermediate scores below the intermediate threshold th_{ j }because they do not contribute to Q(sc_{max} (M)  d). The algorithm stops if the cumulated probability for threshold sc_{max} (M)  d exceeds the given pvalue π and we obtain $Tmi{n}_{\mathcal{P}}$(π, M) = sc_{max} (M)  d + 1.
Lazy evaluation of the permuted matrix
The restricted computation strategy performs best if there are only few iterations (i.e., $Tmi{n}_{\mathcal{P}}$(π, M) is close to sc_{max}(M)) and in each iteration step the computation of Q_{ k }(t) can be skipped in an early stage, i.e., for small values of k. The latter occurs to be more likely if the first rows of M contain strongly discriminative values leading to the exclusion of the small values by comparison with the intermediate thresholds. An example of this situation is given in Figure 1. Since Q_{ k }(t) is invariant to the permutation of the rows of M, we can sort the rows of M such that the most discriminative rows come first. We found that the difference between the largest two values of a row is a suitable measure for the level of discrimination since a larger difference increases the probability to remain below the intermediate threshold. Since the rows of M are scanned several times, we save time by initially sorting each row in order of descending score. We divide the computation steps where the step d computes Q(sc_{max}(M)  d): In step d = 0 only the maximal scores max_{ i }, i ∈ [0, m  1] in each row have to be evaluated.
$\begin{array}{ccc}{Q}_{k}(thd)& =& Pbu{f}_{k}(thd)+\\ {\displaystyle \sum _{a\mathcal{A}:M(k,a)\ge {\mathrm{max}}_{k}d}{Q}_{k1}(thdM(k,a))f(a)}\\ Pbu{f}_{k}(thd)& :=& {\displaystyle \sum _{a\in \mathcal{A}:M(k,a)<{\mathrm{max}}_{k}d}{Q}_{k1}(thdM(k,a))f(a)}\end{array}$
In the present implementation, the algorithm assumes independently distributed symbols. The algorithm can be extended to an order dMarkov model (w.r.t. the background alphabet distribution). This increases the computation time by a factor of $\mathcal{A}$^{ d }.
Implementation and computational results
We implemented LAsearch, ESAsearch, both capable to handle reduced alphabets, and LazyDistrib in C. The program was compiled with the GNU C compiler (version 3.1, optimization option 03). All measurements were performed on a 8 CPU Sun UltraSparc III computer running at 900 MHz, with 64 GB main memory (using only one CPU and a small fraction of the memory). Enhanced suffix arrays were constructed with the program mkvtree, see [34].
Performed experiments and experimental input.
Exp. 1  Exp. 2  Exp. 3  Exp. 4  Exp. 5  Exp. 6  

# searched sequences  59,021  30,964  19,111  1 (H.s. Chr. 6)  19,111  19,111 
total length  20.2 MB  37.2 MB  4.3 MB  162.9 MB  4.3 MB  4.3 MB 
sequence source  see [13]  DBTSS 5.1  RCSB PDB  Sanger V1. 4  RCSB PDB  RCSB PDB 
sequence type/PSSM type  protein  DNA  protein  DNA  protein  protein 
# PSSMs  4,034  220  11,411  576  28,337  10,931 
PSSM source  see [13]  MatInspector  PRINTS 38  TRANSFAC Prof. 6.2  BLOCKS 14.1  PRINTS 38 
avg. length of PSSMs  29.74  14.21  17.32  13.33  26.3  17.37 
index construction (sec)  41  146  10.2  586  10.2  10.2 
mdc (sec)  1960    1486    11871  1486 
MatInspector  x  
FingerPrintScan  x  
Blimps  x  
DN00  x  
LAsearch  x  x  x  x  x  
ESAsearch  x  x  x  x  x  x 
ESAsearch (reduced $\mathcal{A}$)  x 
Results of Experiments 1–5.
Experiment 1: 4,034 PSSMs in 20.2 MB protein sequences  

pvalue  DN00 (total time)  DN00 (search)  LAsearch  ESAsearch +41 sec. 
10^{10}  65,808  64,939  39,839  41,813 
10^{20}  38,773  37,706  23,786  24,378 
10^{30}  21,449  20,362  14,111  13,084 
10^{40}  9,606  8,533  8,067  5,374 
Experiment 2: 220 PSSMs in 37.2 MB DNA  
MSS  MatInspector  LAsearch  ESAsearch +32 sec.  
0.80  12,773  3,605  202  
0.85  12,567  3,189  108  
0.90  12,487  2,818  53  
0.95  12,445  2,356  12  
1.00  12,429  885  1  
Experiment 3: 11,411 PSSMs in 4.3 MB protein sequences  
Evalue  FingerPrintScan  LAsearch  ESAsearch +10.2 sec.  
10^{10}  4,733  3,423  1.244  
10^{20}  4,710  486  52  
10^{30}  4,706  27  10  
Experiment 4: 576 PSSMs in 162.9 MB DNA  
MSS  LAsearch  ESAsearch +586 sec.  
0.85  18,446  318  
0.90  16,376  150  
0.95  13,764  50  
1.00  5,294  1  
Experiment 5: 28,337 PSSMs in 4.3 MB protein sequences  
rawth  Blimps  LAsearch  ESAsearch +10.2 sec.  
945  271:30:16  16:03:12  11:35:58 
Running times of the LazyDistrib algorithm.
pvalue  simple DP  LazyDistrib  speedup factor 

10^{10}  1,486  485.8  3 
10^{20}  1,486  92.5  95 
10^{30}  1,486  8.9  166 
10^{40}  1,486  4.5  330 
PoSSuM software distribution
Our software tool PoSSuMsearch implements all algorithms and ideas presented in this work, namely Simplesearch, LAsearch, ESAsearch and LazyDistrib. A user can search for PSSMs in enhanced suffix arrays built by mkvtree from the Vmatch package, as well as on plain sequence data in FASTA, GENBANK, EMBL, or SWISSPROT format. The search algorithm can be chosen from the command line.
PSSMs are specified in a simple plain text format, where one file may contain multiple PSSMs. The alphabet a PSSM refers to, and alphabet character to PSSM column assignments can be specified on a perPSSM basis for most flexible alphabet support. All implemented algorithms support alphabet transformations. PSSMs can contain integer as well as floating point scores. To prevent rounding errors for integer based PSSMs, PoSSuMsearch uses integer arithmetics for these, resulting in an additional speedup on most CPU architectures. Searching on the reverse strand of nucleotide sequences is implemented by PSSM transformation according to WatsonCrick base pairing. Hence it is sufficient to build the enhanced suffix array for one strand only. This can then be used to search both strands.
The cutoff can be specified as pvalue, Evalue, MSS (matrix similarity score), or raw score threshold. If only the best matches with the highest scores need to be known, then PoSSuMsearch can be asked to report only the k highest scoring matches without even specifying an explicit cutoff. To do so, the search algorithms dynamically adapt the threshold during the search. When using p or Evalues, the score threshold is determined by either the lazy dynamic programming algorithm introduced in this contribution, or read from file that stores the complete precalculated probability distribution. Background distributions can be specified arbitrarily by the user, or determined from a given sequence database. We provide a tool, PoSSuMdist, to generate a compressed file containing the complete precalculated probability distribution for a set of PSSMs.
PSSM matches can be sorted by specifying a list of sort keys, like pvalue, match score, sequence number, and so on. The output formats of PoSSuMsearch print out all available information about a match, either in a human readable format, tab delimited, or in machine readable, XMLbased CisML [35]. PoSSuMsearch as well as PoSSuMdist support multithreading for a further reduction of running time on multi CPU machines.
The PoSSuM software distribution includes the searching tool PoSSuMsearch itself, and additional tools to determine character frequencies from sequence data, for probability distribution calculation, and PSSM format converters for TRANSFAC, BLOCKS, PRINTS, and EMATRIX style PSSMs.
Discussion and conclusion
We have presented a new nonheuristic algorithm for searching with PSSMs, achieving expected sublinear running time. Our analysis of ESAsearch shows that for sequences not shorter than $\mathcal{A}$^{ m } + m  1 a linear runtime in the worst case is achieved. It shows superior performance over the most widely used programs, especially for DNA sequences. The enhanced suffix array, on which the method is based, requires only 9n bytes. This is a space reduction of more than 45 percent compared to the 17n bytes implementation of [13]. Further on, we developed a systematic concept for alphabet reduction, especially useful on amino acid sequences and PSSMs for gaining additional speedup. Our third main contribution is a new algorithm for the efficient calculation of score thresholds from user defined Evalues and pvalues. The algorithm allows for accurate onthefly calculations of thresholds, and has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
Availability
The PoSSuM software distribution and its documentation is available precompiled for different operating systems and architectures on [24]. A version of mkvtree is included. A web based version of PoSSuMsearch is available under the same URL.
Notes
Acknowledgements
The authors thank Sven Rahmann and three anonymous reviewers for comments on the manuscript, Alexander Kel from Biobase GmbH Germany for providing the TRANSFAC PSSMs used in the benchmark experiments, and Jan Krüger for setting up the web interface and integrating PoSSuMsearch in our local webservice environment. M.B. and R.H. were supported by the International NRW Graduate School in Bioinformatics and Genome Research.
Supplementary material
References
 1.Gribskov M, McLachlan M, Eisenberg D: Profile Analysis: Detection of Distantly Related Proteins. Proc Nat Acad Sci USA 1987, 84: 4355–4358. 10.1073/pnas.84.13.4355PubMedCentralCrossRefPubMedGoogle Scholar
 2.Hulo N, Sigrist C, Le Saux V, LangendijkGenevaux PS, Bordoli L, Gattiker A, De Castro E, Bucher P, Bairoch A: Recent improvements to the PROSITE database. Nud Acids Res 2004, 32: 134–137. 10.1093/nar/gkh044CrossRefGoogle Scholar
 3.Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C: PRINTS and its automatic supplement, prePRINTS. Nucl Acids Res 2003, 31: 400–402. 10.1093/nar/gkg030PubMedCentralCrossRefPubMedGoogle Scholar
 4.Henikoff J, Greene E, Pietrokovski S, Henikoff S: Increased Coverage of Protein Families with the Blocks Database Servers. Nucl Acids Res 2000, 28: 228–230. 10.1093/nar/28.1.228PubMedCentralCrossRefPubMedGoogle Scholar
 5.Wu T, NevillManning C, Brutlag D: Minimalrisk scoring matrices for sequence analysis. J Comp Biol 1999, 6(2):219–235.CrossRefGoogle Scholar
 6.Sandelin A, Alkema W, Engstrom P, Wasserman W, Lenhard B: JASPAR: an openaccess database for eukaryotic transcription factor binding profiles. Nucl Acids Res 2004, 32: D91D94. 10.1093/nar/gkh012PubMedCentralCrossRefPubMedGoogle Scholar
 7.Matys V, Fricke E, Geffers R, Gößiling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, KelMargoulis OV, Kloos DU, Land S, LewickiPotapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC(R): transcriptional regulation, from patterns to profiles. Nucl Acids Res 2003, 31: 374–378. 10.1093/nar/gkg108PubMedCentralCrossRefPubMedGoogle Scholar
 8.Scordis P, Flower D, Attwood T: FingerPRINTScan: intelligent searching of the PRINTS motif database. Bioinformatics 1999, 15(10):799–806. 10.1093/bioinformatics/15.10.799CrossRefPubMedGoogle Scholar
 9.Quandt K, Frech K, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide data. Nucl Acids Res 1995, 23: 4878–4884.PubMedCentralCrossRefPubMedGoogle Scholar
 10.Rajasekaran S, Jin X, Spouge J: The Efficient computation of Position Specific Match Scores with the Fast Fourier Transformation. J Comp Biol 2002, 9: 23–33. 10.1089/10665270252833172CrossRefGoogle Scholar
 11.Freschi V, Bogliolo A: Using sequence compression to speedup probabilistic profile matching. Bioinformatics 2005, 21(10):2225–2229. 10.1093/bioinformatics/bti323CrossRefPubMedGoogle Scholar
 12.Wu T, NevillManning C, Brutlag D: Fast Probabilistic Analysis of Sequence Function using Scoring Matrices. Bioinformatics 2000, 16(3):233–244. 10.1093/bioinformatics/16.3.233CrossRefPubMedGoogle Scholar
 13.Dorohonceanu B, NevillManning C: Accelerating Protein Classification Using Suffix Trees. In Proc. of the International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 2000:128–133.Google Scholar
 14.Gonnet H: Some string matching problems from Bioinformatics which still need better solutions. Journal of Discrete Algorithms 2004, 2(1):3–15. 10.1016/S15708667(03)000625CrossRefGoogle Scholar
 15.Tatusov R, Altschul S, Koonin E: Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks. Proc Nat Acad Sci USA 1994, 91(25):12091–12095. 10.1073/pnas.91.25.12091PubMedCentralCrossRefPubMedGoogle Scholar
 16.Henikoff J, Henikoff S: Using substitution probabilities to improve positionspecific scoring matrices. Bioinformatics 1996, 12(2):135–143.CrossRefGoogle Scholar
 17.Kel A, Gößling E, Reuter I, Cheremushkin E, KelMargoulis O, Wingender E: MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucl Acids Res 2003, 31(13):3576–3579. 10.1093/nar/gkg585PubMedCentralCrossRefPubMedGoogle Scholar
 18.Abouelhoda M, Kurtz S, Ohlebusch E: Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2004, 2: 53–86. 10.1016/S15708667(03)000650CrossRefGoogle Scholar
 19.Kurtz S: Reducing the Space Requirement of Suffix Trees. Software – Practice and Experience 1999, 29(13):1149–1171. Publisher Full Text 10.1002/(SICI)1097024X(199911)29:13<1149::AIDSPE274>3.0.CO;2OCrossRefGoogle Scholar
 20.Giegerich R, Kurtz S: A Comparison of Imperative and Purely Functional Suffix Tree Constructions. Science of Computer Programming 1995, 25(2–3):187–218. 10.1016/01676423(95)000038CrossRefGoogle Scholar
 21.Staden R: Methods for calculating the probabilities for finding patterns in sequences. Comp Appl Biosci 1989, 5: 89–96.PubMedGoogle Scholar
 22.Rahmann S: Dynamic programming algorithms for two statistical problems in computational biology. In Proc. of the 3rd Workshop of Algorithms in Bioinformatics (WABI). LNCS 2812, Springer Verlag; 2003:151–164.CrossRefGoogle Scholar
 23.Rahmann S, Müller T, Vingron M: On the Power of Profiles for Transcription Factor Binding Site Detection. Statistical Applications in Genetics and Molecular Biology 2003., 2(1):Google Scholar
 24.Beckstette M, Homann R, Giegerich R, Kurtz S: PoSSuM software distribution.2006. [http://bibiserv.techfak.unibielefeld.de/possumsearch/]Google Scholar
 25.Beckstette M, Strothmann D, Homann R, Giegerich R, Kurtz S: PoSSuMsearch: Fast and Sensitive Matching of Position Specific Scoring Matrices using Enhanced Suffix Arrays. In Proc. of the German Conference on Bioinformatics. Volume P53. GI Lecture Notes in Informatics; 2004:53–64.Google Scholar
 26.Kärkkäinen J, Sanders P: Simple Linear Work Suffix Array Construction. In Proceedings of the 13th International Conference on Automata, Languages and Programming. Springer; 2003.Google Scholar
 27.Kasai T, Lee G, Arimura H, Arikawa S, Park K: Lineartime LongestCommonPrefix Computation in Suffix Arrays and its Applications. In 12th Annual Symposium on Combinatorial Pattern Matching (CPM2001). Volume 2089. SpringerVerlag, New York: Lecture Notes in Computer Science; 2001:181–192.CrossRefGoogle Scholar
 28.de Bruijn N: A Combinatorial Problem. Koninklijke Nederlands Akademie van Wetenschappen Proceedings 1946, 49: 758–764.Google Scholar
 29.Li T, Fan K, Wang J, Wang W: Reduction of protein sequence complexity by residue grouping. Protein Engineering 2003, 16(5):323–330. 10.1093/protein/gzg044CrossRefPubMedGoogle Scholar
 30.Murphy LR, Wallqvist A, Levy R: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Engineering 2000, 13(3):149–152. 10.1093/protein/13.3.149CrossRefPubMedGoogle Scholar
 31.Castillo G: Extreme Value Theory in Engineering. Academic Press; 1988.Google Scholar
 32.Embrechts P, Klüppelberg C, Mikosch T: Modelling Extremal Events. Springer; 1997.CrossRefGoogle Scholar
 33.Goldstein L, Waterman M: Approximations to profile score distributions. J Comp Biol 1994, 1: 93–104.CrossRefGoogle Scholar
 34.Kurtz S: The Vmatch large scale sequence analysis software.2005. [http://www.vmatch.de/]Google Scholar
 35.Haverty P, Weng Z: CisML: an XMLbased format for sequence motif detection software. Bioinformatics 2004, 20(11):1815–1817. 10.1093/bioinformatics/bth162CrossRefPubMedGoogle Scholar
 36.Weeks D, Eskandari S, Scott D, Sachs G: A H+gated urea channel: the link between Helicobacter pylori urease and gastric colonization. Science 2000, 287: 482–485. 10.1126/science.287.5452.482CrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.