Pattern Matching and Consensus Problems on Weighted Sequences and Profiles
 600 Downloads
 1 Citations
Abstract
We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoreticallyprovable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterised by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meetinthemiddle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lowerorder terms conditioned on the optimality of the original algorithm for the knapsack problem. Therefore, we make an effort to keep the lower order terms of the complexities of our algorithms as small as possible.
Keywords
Weighted sequence Position weight matrix Profile matching Multichoice Knapsack1 Introduction
A weighted sequence X of length 4 over the alphabet Σ = {,}
X[1]  X[2]  X[3]  X[4] 

\(\pi _{1}^{(X)}(\mathtt {a})= 1/2\)  HCode \(\pi _{2}^{(X)}(\mathtt {a})= 1\)  HCode \(\pi _{3}^{(X)}(\mathtt {a})= 3/4\)  HCode \(\pi _{4}^{(X)}(\mathtt {a})= 0\) 
\(\pi _{1}^{(X)}(\mathtt {b})= 1/2\)  HCode \(\pi _{2}^{(X)}(\mathtt {b})= 0\)  \(\pi _{3}^{(X)}(\mathtt {b})= 1/4\)  HCode \(\pi _{4}^{(X)}(\mathtt {b})= 1\) 
1.1 Weighted Pattern Matching and Profile Matching
First of all, we study the standard variants of pattern matching problems on weighted sequences and profiles, in which only the pattern or the text is an uncertain sequence. In the most popular formulation of the Weighted Pattern Matching problem, we are given a weighted sequence of length n, called a text, a solid (standard) string of length m, called a pattern, both over an alphabet of size σ, and a threshold probability\(\frac {1}{z}\). We are asked to find all positions in the text where the fragment of length m represents the pattern with probability at least \(\frac {1}{z}\). Each such position is called an occurrence of the pattern in the text; we also say that the fragment of the text and the pattern match. The Weighted Pattern Matching problem can be solved in \(O(\sigma n \log m)\) time via the Fast Fourier Transform [7]. The averagecase complexity of the WPM problem has also been studied and a number of fast algorithms have been presented for certain values of weight ratio\(\frac {z}{m}\) [4, 5]. An indexing variant of the problem has also been considered [1, 2, 13, 14, 16, 19]; here, one is to preprocess a weighted text to efficiently answer pattern matching queries. The most efficient index [2] for a constantsized alphabet uses O(nz) space, takes O(nz) time to construct and answers queries in optimal O(m + occ) time, where occ is the number of occurrences reported. A more general indexing data structure, which assumes z = O(1), was presented in [6]. A streaming variant of the Weighted Pattern Matching problem was considered very recently in [23].
In the classic Profile Matching problem, the pattern is an m × σ profile, the text is a solid string of length n, and our task is to find all positions in the text where the fragment of length m has score at least Z. A naïve approach to the Profile Matching problem works in O(nm + mσ) time. A broad spectrum of heuristics improving this algorithm in practice is known; for a survey, see [22]. However, all these algorithms have the same worstcase running time. One of the principal heuristic techniques, coming in different flavours, is lookahead scoring that consists in checking if a partial match could possibly be completed by the highest scoring letters in the remaining positions of the scoring matrix and, if not, pruning the naïve search. The Profile Matching problem can also be solved in \(O(\sigma n \log m)\) time via the Fast Fourier Transform [24].
Our results
As the first result, we show how the lookahead scoring technique combined with a data structure for answering longest common extension (LCE) queries in a string can be applied to obtain simple and efficient algorithms for the standard pattern matching problems on uncertain sequences. For a weighted sequence, by R we denote the size of its list representation. In the case that σ = O(1), which often occurs in molecular biology applications, we have R = O(n). In the Profile Matching problem, we set M as the number of strings that match the scoring matrix with score above Z. In general M ≤ σ^{m}; however, we may assume that for practical data this number is actually much smaller. We obtain the following results:
Theorem 1.1
Profile Matching can be solved in \(O(m\sigma + n \log M)\) time.
Theorem 1.2
Weighted Pattern Matching can be solved in \(O(R+n \log z)\) time.
1.2 Profile Consensus and Multichoice Knapsack
Along the way to our most involved contribution, we study Profile Consensus, a consensus problem on uncertain sequences. Specifically, we are to check for the existence of a string that matches two scoring matrices, each above threshold Z. The Profile Consensus problem is essentially equivalent to the wellknown Multichoice Knapsack problem (also known as the Multiple Choice Knapsack problem). In this problem, we are given n classes C_{1},…,C_{n} of at most λ items each—N items in total—each item c characterised by a value v(c) and a weight w(c). The goal is to select one item from each class so that the sums of values and of weights of the items are below two specified thresholds, V and W. (In the more intuitive formulation of the problem, we require the sum of values to be above a specified threshold, but here we consider an equivalent variant in which both parameters are symmetric.) This problem generalises the (binary) Knapsack problem, in which we have λ = 2. The Multichoice Knapsack problem is widely used in practice, but most research concerns approximation or heuristic solutions; see [17] and references therein. As far as exact solutions are concerned, the classic meetinthe middle approach by Horowitz and Sahni [12], originally designed for the (binary) Knapsack problem, immediately generalises to an \(O^{*}(\lambda ^{\lceil {\frac {n}{2}\rceil }})\)time^{1} solution for Multichoice Knapsack.
Several important problems can be expressed as special cases of the Multichoice Knapsack problem using folklore reductions (see [17]). This includes the Subset Sum problem, which, for a set of n integers, asks whether there is a subset summing up to a given integer Q, and the kSum problem which, for k classes of λ integers, asks to choose one element from each class so that the selected integers sum up to zero. These reductions give immediate hardness results for the Multichoice Knapsack problem and thus yield the same consequences for Profile Consensus. For the Subset Sum problem, as shown in [9, 11], the existence of an O^{∗}(2^{εn})time solution for every ε > 0 would violate the Exponential Time Hypothesis (ETH) [15, 20]. Moreover, the O^{∗}(2^{n/2}) running time, achieved in [12], has not been improved yet despite much effort. The 3Sum conjecture [10] and the more general kSum conjecture state that the 3Sum and kSum problems cannot be solved in O(λ^{2−ε}) time and \(O(\lambda ^{\lceil {\frac {k}{2}} \rceil (1\varepsilon )})\) time, respectively, for any ε > 0.
Our results
Theorem 1.3
Multichoice Knapsack can be solved in \(O(N+\sqrt {a\lambda }\log A)\) time.
Note that a ≤ A ≤ λ^{n} and thus the running time of our algorithm for Multichoice Knapsack is bounded by \(O(N+n\lambda ^{(n + 1)/2}\log \lambda )\). Up to lower order terms (i.e., the factor \(n\log \lambda =(\lambda ^{(n + 1)/2})^{o(1)}\)), this matches the time complexities of the fastest known solutions for both Subset Sum (also binary Knapsack) and 3Sum. Our parameters identify a new measure of difficulty for the Multichoice Knapsack problem. The main novel part of our algorithm for Multichoice Knapsack is an appropriate (yet intuitive) notion of ranks of partial solutions.
1.3 Weighted Consensus and General Weighted Pattern Matching
Analogously to the Profile Consensus problem, we define the Weighted Consensus problem. In the Weighted Consensus problem, given two weighted sequences of the same length, we are to check if there is a string that matches each of them with probability at least \(\frac {1}{z}\). A routine to compare userentered weighted sequences with existing weighted sequences in the database is used, e.g., in JASPAR,^{2} a wellknown database of PWMs. Finally, we study a general variant of pattern matching on weighted sequences. In the General Weighted Pattern Matching (GWPM) problem, both the pattern and the text are weighted. In the most common definition of the problem (see [3, 13]), we are to find all fragments of the text that give a positive answer to the Weighted Consensus problem with the pattern. The authors of [3] proposed an algorithm for the GWPM problem based on the weighted prefix table that works in \(O(n z^{2} \log z + n\sigma )\) time. Solutions to these problems can be applied in transcriptional regulation: motif and regulatory module finding; and annotation of regulatory genomic regions.
Our results
For a weighted sequence, by λ let us denote the maximal number of letters with score at least \(\frac {1}{z}\) at a single position (thus \(\lambda \le \min (\sigma ,z)\)). Our algorithm for the Multichoice Knapsack problem (covered in Section 1.2) yields time complexities \(O(R+\sqrt {z\lambda }\log z)\) and \(O(n\sqrt {z\lambda }\log z)\) for Weighted Consensus and GWPM, respectively. Using a tailormade solution based on the same scheme, we obtain faster procedures as specified below.
Theorem 1.4
The General Weighted Pattern Matching problem can be solved in \(O(n\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time, and the Weighted Consensus problem can be solved in \(O(R +\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time.
In particular, we obtain the following result for the practical case of σ = O(1).
Corollary 1.5
General Weighted Pattern Matching over a constantsized alphabet can be solved in \(O(n \sqrt {z} \log \log z)\) time.
We also provide a simple reduction from Multichoice Knapsack to Weighted Consensus, which lets us transfer the negative results to the GWPM problem.
Theorem 1.6
 1.
O^{∗}(z^{ε}) timefor everyε > 0,unless the exponential time hypothesis (ETH) fails;
 2.
O^{∗}(z^{0.5−ε}) timefor someε > 0,unless there is anO^{∗}(2^{(0.5−ε)n})timealgorithm for theSubset Sumproblem;
 3.
\(\tilde {O}(R+z^{0.5}\lambda ^{0.5\varepsilon })\)timefor someε > 0 andforn = O(1),unless the 3Sumconjecture fails.
For the higherorder terms, our complexities match the conditional lower bounds; therefore, in the proofs of Theorems 1.3 and 1.4 we put significant effort to keep the lower order terms of the complexities as small as possible.
Finally, we analyse the complexity of the Multichoice Knapsack and General Weighted Pattern Matching problems in case of a large λ. This is a theoretical study that shows a possibility of improvement of the complexity for instances that do not originate from the Subset Sum and kSum problems.
Theorem 1.7
For every positive integerk = O(1),theMultichoice Knapsackproblem can be solvedin\(O(N+ {(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log A (\frac {\log A}{\log \lambda })^{k})\)time.
Theorem 1.8
Ifλ^{2k− 1} ≤ z ≤ λ^{2k+ 1}forsome positive integerk = O(1),then theWeighted Consensusproblem can be solvedin\(O(R+(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\)time,and theGWPMproblem can be solvedin\(O(n(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\)time.
A preliminary version of this research appeared as [18].
1.4 Structure of the Paper
We start with Preliminaries, where we recall basic notions on classic strings and formalise the model of computation. The following four sections describe our algorithms: in Section 3 for Profile Matching; in Section 4 for Weighted Pattern Matching; in Section 5 for Profile Consensus; and in Section 6 for Weighted Consensus and GWPM. In Section 7 we present conditional lower bounds for the GWPM problem based on the special cases of Multichoice Knapsack. Finally, in Section 8 we perform a multivariate analysis of Profile Consensus and GWPM and present improved solutions in the case that \(\frac {\log a}{\log \lambda }\) is a constant other than an odd integer.
2 Preliminaries
Let Σ = {1,…,σ} be an alphabet. A stringS over Σ is a finite sequence of letters from Σ. By Σ^{m} we denote the set of strings of length m over Σ. We denote the length of S by S and, for 1 ≤ i ≤S, the ith letter of S by S[i]. By S[i..j] we denote the string S[i]⋯S[j] called a factor of S (if i > j, then the factor is an empty string). A factor is called a prefix if i = 1 and a suffix if j = S. For two strings S and T, we denote their concatenation by S ⋅ T (ST in short).
For a string S of length n, by LCE(i,j) = lcp(S[i..n],S[j..n]) we denote the length of the longest common prefix of suffixes S[i..n] and S[j..n]. This value lets us easily determine the longest common prefix \(\mathit {lcp}(S[i\mathinner {..} i^{\prime }],S[j\mathinner {..} j^{\prime }])\) of any two factors starting at positions i and j, respectively. The following fact specifies a wellknown efficient data structure answering LCE queries; see [8] for details.
Fact 2.1
Let S be a string of length n over an integer alphabet of size σ = n^{O(1)}. After O(n)time preprocessing, in O(1) time one can compute LCE(i,j) for any indices i,j.
The Hamming distance between two strings X and Y of the same length, denoted by d_{H}(X,Y ), is the number of positions where the strings differ.
2.1 Model of Computations
For problems on weighted sequences, we assume the wordRAM model with word size \(w = {\Omega }(\log n + \log z)\) and integer alphabet of size σ = n^{O(1)}. We consider the logprobability model of representations of weighted sequences, that is, we assume that probabilities in the weighted sequences and the threshold probability \(\frac {1}{z}\) are all of the form \(c^{\frac {p}{2^{dw}}}\), where c and d are constants and p is an integer that fits in a constant number of machine words. Additionally, the probability 0 has a special representation. The only operations on probabilities in our algorithms are multiplications and divisions, which can be performed exactly in O(1) time in this model. Our solutions to the Multichoice Knapsack problem only assume the wordRAM model with word size \(w={\Omega }(\log S+\log a)\), where S is the sum of integers in the input instance; this does not affect the O^{∗} running time.
3 Profile Matching
3.1 Solution to Profile Matching
For a scoring matrix P, the heavy string of P, denoted H(P), is constructed by choosing at each position the heaviest letter, that is, the letter with the maximum score (breaking ties arbitrarily). In other words, H(P) is a string that matches P with the maximum score.
Observation 3.1
If Score(S,P) ≥ Z for a string S of length m and an m × σ scoring matrix P, then \(d_H(\textbf {H}(P),S) \le \left \lfloor {\log \textbf {M}_{Z}(P)} \right \rfloor \).
Proof
Let d = d_{H}(H(P),S). We can construct 2^{d} strings of length S that match P with a score above Z by taking either of the letters S[j] or H(P)[j] at each position j such that S[j]≠H(P)[j]. Hence, \(2^{d} \le \textbf {M}_{Z}(P)\), which concludes the proof. □
We obtain the following result.
Theorem 1.1
Profile Matching can be solved in \(O(m\sigma + n \log M)\) time.
Proof
Let us bound the time complexity of the presented algorithm. The heavy string \(P^{\prime }\) can be computed in O(mσ) time. The data structure for lcpqueries in \(P^{\prime }T\) can be constructed in O(n + m) time by Fact 2.1. Finally, for each position p in the text T we will consider at most \(\left \lfloor {\log M} \right \rfloor + 1\) mismatches between \(P^{\prime }\) and T, as afterwards the score \(s^{\prime }\) drops below Z due to Observation 3.1. □
4 Weighted Pattern Matching
A weighted sequenceX = X[1]⋯X[n] of length X = n over alphabet Σ is a sequence of sets of pairs of the form \(X[i] = \{(j,\ \pi ^{(X)}_{i}(j))\ :\ j \in {\Sigma }\}\). Here, \(\pi _{i}^{(X)}(j)\) is the occurrence probability of the letter j at the position i ∈{1,…,n}. These values are nonnegative and sum up to 1 for a given i. For all our algorithms, it is sufficient that the probabilities sum up to at most 1 for each position. Also, the algorithms sometimes produce auxiliary weighted sequences with sum of probabilities being smaller than 1 on some positions.
We denote the maximum number of letters occurring at a single position of the weighted sequence (with nonzero probability) by λ and the total size of the representation of a weighted sequence by R. The standard representation consists of n lists with up to λ elements each, so R = O(nλ). However, the lists can be shorter in general. Also, if the threshold probability \(\frac {1}{z}\) is specified, at each position of a weighted sequence it suffices to store letters with probability at least \(\frac {1}{z}\), and clearly there are at most z such letters for each position. This reduction can be performed in linear time, so we shall always assume that λ ≤ z. Moreover, the assumption that Σ is an integer alphabet of size σ = n^{O(1)} lets us assume without loss of generality that the entries \((j,\pi ^{(X)}_{i}(j))\) in the lists representing X[i] are ordered by increasing j: if this is not the case, we can simultaneously sort these lists in linear time.
4.1 Weighted Sequences versus Profiles
As shown below, profiles and weighted sequences are essentially equivalent objects.
Fact 4.1
 1.
Given a weighted sequence X of length n over an alphabet of size σ and a probability \(\frac {1}{z}\), one can construct in O(nσ) time an n × σ profile P and a threshold Z such that M_{Z}(P) = M_{z}(X).
 2.
Given an m × σ profile P and a threshold Z, one can construct in O(mσ) time a weighted sequence X and a probability \(\frac {1}{z}\) such that M_{z}(X) = M_{Z}(P).
Proof
Given a weighted sequence X, one can construct an equivalent profile P setting \(P[i,s]=\log \pi ^{(X)}_{i}(s)\) for each position i and character s. If \(\pi ^{(X)}_{i}(s)= 0\), we set \(P[i,s]=\infty \) (which can be replaced by a sufficiently large finite value after we fix the threshold Z). The profile P satisfies M_{Z}(P) = M_{z}(X) for \(Z = \log z\).
In the light of Fact 4.1, it may seem that the results for profiles and weighted sequences should coincide. However, we use different parameters to study the complexity of the algorithmic problems in these models: for profiles this is the number M_{Z}(P) of matching strings, while for weighted sequence this is the inverse z of the threshold probability \(\frac {1}{z}\). These parameters are related by the following observation:
Observation 4.2
A weighted sequence X satisfies M_{z}(X)≤ z for every threshold.
However, the bound M_{z}(X)≤ z is not tight in general, which gives more power to algorithms parameterised by z. Moreover, z is a part of the input (as opposed to M_{Z}(P) for profiles). Furthermore, it is natural to consider a common threshold probability \(\frac {1}{z}\) for multiple weighted sequences, e.g., factors of a weighted text T as in Weighted Pattern Matching.
A more technical difference lies in the representation of profiles and weighted sequences, which we have chosen consistently with the literature. A profile is stored as a dense m × σ matrix, while in a weighted sequence of the same length we do not explicitly keep entries with \(\pi ^{(X)}_{i}(s)= 0\), so the input size R can be smaller than m ⋅ σ. This allows for faster algorithms—because reading the input takes less time—but at the same time poses some challenges—because \(\pi ^{(X)}_{i}(s)\) cannot be accessed in constant time, unless σ = O(1) or we allow randomisation. This is illustrated below in case of the Weighted Pattern Matching problem and also in Section 6.
4.2 Solution to Weighted Pattern Matching
The approach from our solution to Profile Matching can be used for Weighted Pattern Matching. In a natural way, we extend the notion of a heavy string to weighted sequences. This lets us restate Observation 3.1 in the language of probabilities instead of scores.
Observation 4.3
If a string P matches a weighted sequence X of the same length with probability at least \(\frac {1}{z}\), then \(d_H(\textbf {H}(X),P) \le \left \lfloor {\log z} \right \rfloor \).
Implementation for large alphabets

finding the letter with the maximum probability at a given position,

computing the probability of a given letter at a given position.
To implement the latter operation for an arbitrary character, we store each T[j] in a weightbalanced binary tree [21], with the weight of \((s,\pi ^{(T)}_{j}(s))\) equal to \(\pi ^{(T)}_{j}(s)\). As a result, any \(\pi ^{(T)}_{j}(s)\) can be retrieved in \(O(\log \pi ^{(T)}_{j}(s))=O(\log z)\) time. During the course of the pth step of the algorithm, \(\alpha ^{\prime }\) is a product of some probabilities including all the retrieved probabilities \(\pi ^{(T)}_{j}(s)\) with \(s \ne T^{\prime }[j]\). The while loop is executed only when \(\alpha ^{\prime }\ge \frac {1}{z}\), so the product of these probabilities (excluding the one retrieved in the final iteration) is at least \(\frac {1}{z}\). Consequently, the overall retrieval time in the pth step is \(O(\log z)\).
This way, we can implement the algorithm in \(O(R+n \log z)\) time.
Theorem 1.2
Weighted Pattern Matching can be solved in \(O(R+n \log z)\) time.
Remark 4.4
In the same complexity one can solve GWPM with a solid text.
5 Profile Consensus and Multichoice Knapsack
For a fixed instance of Multichoice Knapsack, we say that S is a partial choice if S ∩ C_{i}≤ 1 for each class. The set D = {i : S ∩ C_{i} = 1} is called its domain. For a partial choice S, we define \(v(S) = {\sum }_{c \in S} v(c)\) and \(w(S) = {\sum }_{c \in S} w(c)\).
5.1 Profile Consensus versus Multichoice Knapsack
As shown below, Profile Consensus and Multichoice Knapsack are essentially equivalent problems.
Fact 5.1
 1.
Consider an instance of Profile Consensus with two m × σ profiles P, Q and a common threshold Z. In O(mσ) time one can construct an equivalent instance of Multichoice Knapsack with m classes of σ items each, A_{V} = M_{Z}(P), and A_{W} = M_{Z}(Q).
 2.
Consider an instance of Multichoice Knapsack with n classes of at most λ items each. In O(nλ) time one can construct an equivalent instance of Profile Consensus with two n × λ profiles P, Q and a common threshold Z such that M_{Z}(P) = A_{V} and M_{Z}(Q) = A_{W}.
Proof
Given an instance (P,Q,Z) of the Profile Consensus problem, we construct an equivalent instance of Multichoice Knapsack with m classes of σ items each, denoted c_{i,j} for 1 ≤ i ≤ m and 1 ≤ j ≤ σ, each with value v(c_{i,j}) = −P[i,j] and weight w(c_{i,j}) = −Q[i,j]. We set both thresholds to V = W = −Z. It is straightforward to verify that the constructed instance satisfies the required conditions.
This construction is easily reversible if V = W and the size of each class is λ. In general, we add dummy items (with infinite or very large weight and value), decrease the weight of each item by \(\frac {1}{n}(WV)\), and decrease the weight threshold to V . □
The only technical difference between Multichoice Knapsack and Profile Consensus is that the profiles are stored as dense m × σ matrices while the classes in Multichoice Knapsack can be of different size so the input size N can be smaller than the number of classes n times the bound λ on the class size.
Below, we formulate our results in the more established language of Multichoice Knapsack.
5.2 Overview of the Solution
The classic O(2^{n/2})time solution to the Knapsack problem [12] is based on a meetinthemiddle approach. The set D = {1,…,n} is partitioned into two domains D_{1},D_{2} of size roughly n/2, and for each D_{i}, all partial choices S are generated and ordered by v(S). This reduces the problem to an instance of Multichoice Knapsack with two classes, which is solved using a folklore lineartime solution (described for completeness in Section 5.5).
The meetinthemiddle approach to Knapsack generalises directly to a solution to Multichoice Knapsack. The partition may be chosen as to balance the number of partial choices in each domain, and so the worstcase time complexity is \(O(\sqrt {Q\lambda })\), where \(Q={\prod }_{i = 1}^{n} C_{i}\) is the number of choices.
Our aim in this section is to replace Q with the parameter a (which never exceeds Q). The overall running time is going to be \(O(N+\sqrt {a\lambda }\log A)\).
 (1)
How to partition the set of classes?
 (2)
In what order should the partial choices be generated?
 (3)
How many partial choices should be generated, given that the value of the parameter a is not known in advance?
A natural idea to deal with question (2) is to consider only partial choices with small values v(S) or w(S). This is close to our actual solution, which is based on a notion of ranks of partial choices that we introduce in Section 5.3.
Finally, to tackle question (3), we generate the partial choices batchwise until either a solution is found or we can certify that it does not exist. The idea of this step is presented also in Section 5.3, while the generation procedure is detailed in Section 5.4. While dealing with these issues, a careful implementation is required to avoid several further extra factors in the running time.
In the end, we show that the number of partial choices that need to be generated is indeed \(O(\sqrt {a\lambda })\). Our final solution to Multichoice Knapsack is presented in Section 5.6 without the instance size reduction and in Section 5.8 using the reduction.
5.3 Ranks of Partial Choices
For a partial choice S, we define rank_{v}(S) as the number of partial choices \(S^{\prime }\) with the same domain for which \(v(S^{\prime })\le v(S)\). We symmetrically define rank_{w}(S). For simplicity, if c ∈ C_{i}, we denote rank_{v}(c) = rank_{v}({c}) and rank_{w}(c) = rank_{w}({c}). Ranks are introduced as an analogue of match probabilities in weighted sequences. Probabilities are multiplicative, while for ranks we have submultiplicativity:
Fact 5.2
If S = S_{1} ∪ S_{2} is a decomposition of a partial choice S into two disjoint subsets, then rank_{v}(S_{1})rank_{v}(S_{2}) ≤ rank_{v}(S) (and same for rank_{w}).
Proof
Let D_{1} and D_{2} be the domains of S_{1} and S_{2}, respectively. For every partial choices \(S^{\prime }_{1}\) over D_{1} and \(S^{\prime }_{2}\) over D_{2} such that \(v(S^{\prime }_{1}) \le v(S_{1})\) and \(v(S^{\prime }_{2}) \le v(S_{2})\), we have \(v(S^{\prime }_{1} \cup S^{\prime }_{2})=v(S^{\prime }_{1})+v(S^{\prime }_{2})\le v(S)\). Hence, \(S^{\prime }_{1}\cup S^{\prime }_{2}\) must be counted while determining rank_{v}(S). □
For 0 ≤ j ≤ n, let L_{j} be the list of partial choices with domain {1,…,j} ordered by value v(S), and for ℓ > 0 let L_{j}[ℓ] be the ℓth element of L_{j}. Analogously, for 1 ≤ j ≤ n + 1, we define R_{j} as the list of partial choices over {j,…,n} ordered by v(S), and for r > 0, R_{j}[r] as the rth element of R_{j}. If any of the partial choices L_{j}[ℓ], R_{j}[r] does not exist, we assume that its value is \(\infty \).
The following two observations yield a decomposition of each choice into a single item and two partial solutions of a small rank. Observe that we do not need to know A_{V} in order to check if the ranks are sufficiently large.
Lemma 5.3
Letℓand r be positive integers such thatv(L_{j}[ℓ]) + v(R_{j+ 1}[r]) > Vforeach 0 ≤ j ≤ n.For every choice S withv(S) ≤ V,there is an indexj ∈{1,…,n} anda decompositionS = L ∪{c}∪ Rsuchthatv(L) < v(L_{j− 1}[ℓ]),c ∈ C_{j},andv(R) < v(R_{j+ 1}[r]).
Proof
Let S = {c_{1},…,c_{n}} with c_{i} ∈ C_{i} and, for 0 ≤ i ≤ n, let S_{i} = {c_{1},…,c_{i}}. If v(S_{n− 1}) < v(L_{n− 1}[ℓ]), we set L = S_{n− 1}, c = c_{n}, and R = ∅, satisfying the claimed conditions.
Otherwise, we define j as the smallest index i such that v(S_{i}) ≥ v(L_{i}[ℓ]), and we set L = S_{j− 1}, c = c_{j}, and R = S ∖ S_{j}. The definition of j implies v(L) < v(L_{j− 1}[ℓ]) and v(L ∪{c}) ≥ v(L_{j}[ℓ]). Moreover, we have v(L ∪{c}) + v(R) = v(S) ≤ V < v(L_{j}[ℓ]) + v(R_{j+ 1}[r]), and thus v(R) < v(R_{j+ 1}[r]). □
Fact 5.4
Let ℓ,r > 0. If v(L_{j}[ℓ]) + v(R_{j+ 1}[r]) ≤ V for some j ∈{0,…,n}, then ℓ ⋅ r ≤ A_{V}.
Proof
Let L and R be the ℓth and rth entry in L_{j} and R_{j+ 1}, respectively. Note that v(L ∪ R) ≤ V implies rank_{v}(L ∪ R) ≤ A_{V} by definition of A_{V}. Moreover, rank_{v}(L) ≥ ℓ and rank_{v}(R) ≥ r (the equalities may be sharp due to draws). Now, Fact 5.2 yields the claimed bound. □
5.4 Generating Partial Choices of Small Rank
Note that L_{j} can be obtained by interleaving C_{j} copies of L_{j− 1}, where each copy corresponds to extending the choices from L_{j− 1} with a different item. If we were to construct L_{j} having access to the whole L_{j− 1}, we could apply the following standard procedure. For each c ∈ C_{j}, we maintain an iterator on L_{j− 1} pointing to the first element S on L_{j− 1} for which S ∪{c} has not yet been added to L_{j}. The associated value is v(S ∪{c}). All iterators initially point at the first element of L_{j− 1}. Then the next element to append to L_{j} is always S ∪{c} corresponding to the iterator with minimum value. Having processed this partial choice, we advance the iterator (or remove it, once it has already scanned the whole L_{j− 1}). This process can be implemented using a binary heap H_{j} as a priority queue, so that initialisation requires O(C_{j}) time and outputting a single element takes \(O(\log C_{j})\) time. Each partial choice S ∈L_{j} is stored in O(1) space using a pointer to a partial choice \(S^{\prime } \in \textbf {L}_{j1}\) such that \(S=S^{\prime } \cup \{c\}\) for some c ∈ C_{j}.
For r ≥ 0, let \(\textbf {L}^{(i)}_{j}\) be the prefix of L_{j} of length \(\min (i,\textbf {L}_{j})\) and \(\textbf {R}^{(i)}_{j}\) be the prefix of R_{j} of length \(\min (i, \textbf {R}_{j})\). A technical transformation of the procedure stated above leads to an online algorithm that constructs the prefixes \(\textbf {L}^{(i)}_{j}\) and \(\textbf {R}^{(i)}_{j}\), as shown in the following lemma. Along with each reported partial choice S, the algorithm also computes w(S).
Lemma 5.5
AfterO(N)timeinitialisation, one can computeL_{1}[i],…,L_{n}[i] knowing\(\textbf {L}^{(i1)}_{1},\ldots ,\textbf {L}^{(i1)}_{n}\)in\(O(n\log \lambda )\)time.Symmetrically, one can constructR_{1}[i],…,R_{n}[i] from\(\textbf {R}^{(i1)}_{1},\ldots ,\textbf {R}^{(i1)}_{n}\)inthe same time complexity.
Proof
Our online algorithm is going to use the same approach as the offline computation of lists \(\textbf {L}_{j}^{(i)}\). The order of computations will be different, though.
At each step, for j = 1 to n we shall extend lists \(\textbf {L}^{(i1)}_{j}\) with a single element (unless the whole L_{j} has already been generated) from the top of the heap H_{j}. We keep an invariant that each iterator in H_{j} always points to an element that is already in \(\textbf {L}^{(i1)}_{j1}\) or to L_{j− 1}[i]: the first element that has not been yet added to L_{j− 1}, which is represented by the top of the heap H_{j− 1}.
We initialise the heaps as follows: we introduce H_{0} which represents the empty choice ∅ with v(∅) = 0. Next, for j = 1,…,n we build the heap H_{j} representing C_{j} iterators initially pointing to the top of H_{j− 1}. The initialisation takes O(N) time in total since a binary heap can be constructed in time linear in its size.
At each step, the lists \(\textbf {L}^{(i1)}_{j}\) are extended for consecutive values j from 1 to n. Since \(\textbf {L}^{(i1)}_{j1}\) is extended before \(\textbf {L}^{(i1)}_{j}\), by the invariant, all iterators in H_{j} point to the elements of \(\textbf {L}^{(i)}_{j1}\) while we compute L_{j}[i]. We take the top of H_{j} and move it to \(\textbf {L}^{(i)}_{j}\). Next, we advance the corresponding iterator and update its position in the heap H_{j}. After this operation, the iterator might point to the top of H_{j− 1}. If H_{j− 1} is empty, this means that the whole list L_{j− 1} has already been generated and traversed by the iterator. In this case, we remove the iterator.
This way we indeed simulate the previous offline solution. A single phase makes O(1) operations on each heap H_{j}. The running time is bounded by \(O({\sum }_{j} \log C_{j})=O(n\log \lambda )\) at each step of the algorithm. □
5.5 Multichoice Knapsack for n = 2 Classes
Let us recall the final processing of the meetinthemiddle solution to the Knapsack problem [12]. We formulate it in terms of Multichoice Knapsack with two classes.
An item c ∈ C_{j} is irrelevant if there is another item \(c^{\prime }\in C_{j}\) that dominatesc, i.e., such that \(v(c) \ge v(c^{\prime })\) and \(w(c) \ge w(c^{\prime })\). Observe that removing an irrelevant item leads to an equivalent instance of the Multichoice Knapsack problem, and it may only decrease the parameters A_{V} and A_{W}.
Lemma 5.6
TheMultichoice Knapsackproblem can be solved inO(N) timeifn = 2 andthe elements c ofC_{1}andC_{2}aresorted byv(c).
Proof
Since the items of C_{1} and C_{2} are sorted by v(c), a single scan through these items lets us remove all irrelevant elements. Next, for each c_{1} ∈ C_{1} we compute c_{2} ∈ C_{2} such that v(c_{2}) ≤ V − v(c_{1}) but otherwise v(c_{2}) is largest possible. As we have removed irrelevant elements from C_{2}, this item also minimises w(c_{2}) among all elements satisfying v(c_{2}) ≤ V − v(c_{1}). Hence, if there is a feasible solution containing c_{1}, then {c_{1},c_{2}} is feasible. If we process elements c_{1} by nondecreasing values v(c_{1}), the values v(c_{2}) do not increase, and thus the items c_{2} can be computed in O(N) time in total. □
5.6 Multichoice Knapsack Parameterised by a
Combining the procedures of Lemmas 5.5 and 5.6 with the combinatorial results of Section 5.3, we obtain the first algorithm for Multichoice Knapsack parameterised by a.
Proposition 5.7
Multichoice Knapsack can be solved in \(O(n(\lambda +\sqrt {a\lambda })\log \lambda )\) time.
Proof
Below, we give an algorithm working in \(O(n(\lambda +\sqrt {A_{V}\lambda })\log \lambda )\) time. The final solution runs it in parallel on the original instance and on the instance with v and V swapped with w and W , waiting until at least one of them terminates.
We increment an integer r starting from 1, maintaining \(\ell =\left \lceil {\frac {r}{\lambda }} \right \rceil \) and the lists \(\textbf {L}_{j}^{(\ell )}\) and \(\textbf {R}_{j + 1}^{(r)}\) for 0 ≤ j ≤ n, as long as v(L_{j}[ℓ]) + v(R_{j+ 1}[r]) ≤ V for some j (or until all the lists have been completely generated). By Fact 5.4, we stop at \(r=O(\sqrt {A_{V} \lambda })\) and due to Lemma 5.5, the process takes \(O(n\sqrt {A_{V} \lambda }\log \lambda )\) time.
According to Lemma 5.3, every feasible solution S admits a decomposition S = L ∪{c}∪ R with \(L\in \textbf {L}_{j1}^{(\ell )}\), c ∈ C_{j}, and \(R\in \textbf {R}_{j + 1}^{(r)}\) for some index j. We consider all possibilities for j. For each of them we will reduce searching for S to an instance of the Multichoice Knapsack problem with 2 classes of \(O(\sqrt {A_{V}\lambda })\) items. By Lemma 5.6, these instances can be solved in \(O(n\sqrt {A_{V}\lambda })\) time in total.
The items of the jth instance are going to belong to classes \(\textbf {L}_{j1}^{(\ell )}\odot C_{j}\) and \(\textbf {R}_{j + 1}^{(r)}\), where \(\textbf {L}_{j1}^{(\ell )}\odot C_{j} = \{L\cup \{c\} : L\in \textbf {L}_{j1}^{(\ell )} , c\in C_{j}\}\). The set \(\textbf {L}_{j1}^{(\ell )}\odot C_{j}\) is constructed by merging C_{j}≤ λ sorted lists, each of size \(\ell =O(1+\sqrt {A_{V}/\lambda })\). This takes \(O((\lambda +\sqrt {A_{V}\lambda })\log \lambda )\) time, which results in \(O(n(\lambda +\sqrt {A_{V} \lambda })\log \lambda )\) time over all indices j.
Clearly, each feasible solution of the constructed instances represents a feasible solution of the initial instance, and by Lemma 5.3, every feasible solution of the initial instance has its counterpart in one of the constructed instances. □
5.7 Preprocessing to Reduce Instance Size
In order to improve the running time for Multichoice Knapsack, we develop two reductions and run them as preprocessing to the procedure of Proposition 5.7. First, we observe that items c with rank_{v}(c) > A_{V} or rank_{w}(c) > A_{W} cannot belong to any feasible solution. Moreover their removal results in λ ≤ a, which lets us hide the \(O(n\lambda \log \lambda )\) term in the running time. Our second reduction decreases the number of classes n to \(O\left (\frac {\log A}{\log \lambda }\right )\). For this, we repeatedly remove irrelevant items (as defined in Section 5.5) and merge small classes into their Cartesian product (so that the class sizes are more balanced).
For each class C_{i}, let \(v_{\min }(i) = \min \{v(c) : c\in C_{i}\}\). Also, let \(V_{\min } = {\sum }_{i = 1}^{n} v_{\min }(i)\); note that \(V_{\min }\) is the smallest possible value v(S) of a choice S. We symmetrically define \(w_{\min }(i)\) and \(W_{\min }\).
Lemma 5.8
Given an instance I of theMultichoice Knapsackproblem, one can compute inO(N) timean equivalent instance\(I^{\prime }\)with\(N^{\prime } \le N\),\(n^{\prime }=n\),\(A_{V}^{\prime }= A_{V}\),\(A_{W}^{\prime }= A_{W}\),and\(\lambda ^{\prime }\le \min (\lambda ,a)\).
Proof
From each class C_{i} we remove all items c such that \(V_{\min }+v(c)v_{\min }(i)>V\) or \(W_{\min }+w(c)w_{\min }(i)>W\). Afterwards, for each item c ∈ C_{i} one can obtain a choice S such that c ∈ S and v(S) ≤ V (or w(S) ≤ W) by choosing the elements with the minimal value (minimal weight, respectively) in all the remaining classes. □
Our second preprocessing consists of several steps. First, we quickly reduce the number of classes to \(n=O(\log A)\).
Lemma 5.9
Given an instance I of theMultichoice Knapsackproblem, one can compute in linear time an equivalentinstance\(I^{\prime }\)with\(N^{\prime }\le N\),\(A_{V}^{\prime }\le A_{V}\),\(A_{W}^{\prime }\le A_{W}\),\(\lambda ^{\prime }\le \lambda \),and\(n^{\prime } \le 2\log A\).
Proof
Observe that if a class C_{i} contains an item c for which both \(v(c)=v_{\min }(i)\) and \(w(c)=w_{\min }(i)\), then we can greedily include it in the solution S. Hence, we can remove such a class, setting \(V := V v_{\min }(i)\) and \(W := W w_{\min }(i)\). We execute this reduction rule exhaustively, which clearly takes O(N) time in total and may only decrease the parameters A_{V} and A_{W}. After the reduction, the minima \(v_{\min }(i)\) and \(w_{\min }(i)\) must be attained by distinct items of every class C_{i}.
We shall prove that now we can either find out that A ≥ 2^{n/2} or that we are dealing with a NOinstance. To decide which case holds, let us define Δ_{V}(i) as the difference between the second smallest value in the multiset {v(c) : c ∈ C_{i}} and \(v_{\min }(i)\). We set \({\Delta }_{V}^{\text {mid}}\) as the sum of the \(\left \lceil {\frac {n}{2}} \right \rceil \) smallest values Δ_{V}(i) for 1 ≤ i ≤ n; we define \({\Delta }_{W}^{\text {mid}}\) analogously.
Claim 1
If \(V_{\min } + {\Delta }_{V}^{\text {mid}} \le V\), then A_{V} ≥ 2^{n/2}; if \(W_{\min } + {\Delta }_{W}^{\text {mid}} \le W\), then A_{W} ≥ 2^{n/2}; otherwise, we are dealing with a NOinstance.
Proof
First, assume that \(V_{\min } + {\Delta }_{V}^{\text {mid}} \le V\). This means that there is a choice S with v(S) ≤ V containing at least \(\frac {n}{2}\) items c such that rank_{v}(c) ≥ 2. Hence, Fact 5.2 yields \(\text{rank} _{v}(S)\ge 2^{\left \lceil {n/2} \right \rceil }\) and consequently A_{V} ≥ 2^{n/2}, as claimed. Symmetrically, if \(W_{\min } + {\Delta }_{W}^{\text {mid}} \le W\), then A_{W} ≥ 2^{n/2}.
Now, suppose that there is a feasible solution S. As no class contains a single item minimising both v(c) and w(c), there are at least \(\left \lceil {\frac {n}{2}} \right \rceil \) classes for which S contains an item not minimising v(c), or at least \(\left \lceil {\frac {n}{2}} \right \rceil \) classes for which S contains an item not minimising w(c). Without loss of generality, we assume that the former holds. Let D be the set of at least \(\left \lceil {\frac {n}{2}} \right \rceil \) classes i satisfying the condition. If c ∈ C_{i} does not minimise v(c), then \(v(c)\ge v_{\min }(i)+{\Delta }_{V}(i)\). Consequently, \(V\ge v(S) = V_{\min } + {\sum }_{i\in D} {\Delta }_{V}(i)\). However, observe that \( {\sum }_{i\in D} {\Delta }_{V}(i) \ge {\Delta }_{V}^{\text {mid}}\), so \(V \ge V_{\min } + {\Delta }_{V}^{\text {mid}}\), as claimed. □
The conditions from the claim can be verified in O(N) time using a lineartime selection algorithm to compute \({\Delta }_{V}^{\text {mid}}\) and \({\Delta }_{W}^{\text {mid}}\). If any of the first two conditions holds, we return the instance obtained using our reduction. Otherwise, we output a dummy NOinstance. □
In the improved reduction we use two basic steps. The first one is expressed in the following lemma.
Lemma 5.10
Consider a class of items in an instance of theMultichoice Knapsackproblem. In linear time, we can remove someirrelevant items from the class so that the resulting class C satisfies\(\max (\text{rank} _{v}(c),\text{rank} _{w}(c)) > \frac {1}{3} C\)foreach itemc ∈ C.
Proof
First, note that using a lineartime selection algorithm, we can determine for each item c whether \(\text{rank} _{v}(c)\le \frac {1}{3}C\) and whether \(\text{rank} _{w}(c)\le \frac {1}{3}C\). If there is no item satisfying both conditions, we keep C unaltered. Otherwise, we have an item which dominates at least \(C\text{rank} _{v}(c)\text{rank} _{w}(c) \ge \frac {1}{3}C\) other items. We scan through all items in C and remove those dominated by c. Next, we repeat the algorithm. The running time of a single phase is clearly linear, and since C decreases geometrically, the total running time is also linear. □
The second reduction step decreases the number of classes by replacing two distinct classes C_{i}, C_{j} with their Cartesian product C_{i} × C_{j}, assuming that the value (weight) of a pair (c_{i},c_{j}) is the sum of values (weights) of c_{i} and c_{j}. This clearly leads to an equivalent instance of the Multichoice Knapsack problem, does not alter the parameters A_{V}, A_{W}, and decreases n. On the other hand, N and λ may increase; the latter happens only if C_{i}⋅C_{j} > λ.
These two reduction rules let us implement our preprocessing procedure.
Lemma 5.11
Given an instance I of theMultichoice Knapsackproblem, one cancompute in\(O(N+\lambda \log A)\)timean equivalent instance\(I^{\prime }\)with\(A_{V}^{\prime }\le A_{V}\),\(A_{W}^{\prime }\le A_{W}\),\(\lambda ^{\prime }\le \lambda \),and\(n^{\prime }=O\left (\frac {\log A}{\log \lambda }\right )\).
Proof
First, we apply Lemma 5.9 to make sure that \(n\le 2\log A\) and \(N = O(\lambda \log A)\). We may now assume that λ ≥ 3^{6}, as otherwise we already have \(n = O\left (\frac {\log A}{\log \lambda }\right )\).
Throughout the algorithm, whenever there are two distinct classes of size at most \(\sqrt {\lambda }\), we shall replace them with their Cartesian product. This may happen only n − 1 times, and a single execution takes O(λ) time, so the total running time needed for this part is \(O(\lambda \log A)\).
Furthermore, for every class that we get in the input instance or obtain as a Cartesian product, we apply Lemma 5.10. The total running time spent on this is also \(O(\lambda \log A)\).
Having exhaustively applied these reduction rules, we are guaranteed that we have \(\max (\text{rank} _{v}(c),\text{rank} _{w}(c))>\frac {1}{3}\sqrt {\lambda }\ge \lambda ^{\frac {1}{3}}\) for items c from all but one class. Without loss of generality, we assume that the classes satisfying this condition are C_{1},…,C_{k}.
Recall that \(v_{\min }(i)\) and \(w_{\min }(i)\) are defined as minimum values and weights of items in class C_{i} and that \(V_{\min }\) and \(W_{\min }\) are their sums over all classes. For 1 ≤ i ≤ k, we define Δ_{V}(i) as the difference between the \(\left \lceil {\lambda ^{\frac {1}{3}}}\right \rceil \)th smallest value in the multiset {v(c) : c ∈ C_{i}} and \(v_{\min }(i)\). Next, we define \({\Delta }_{V}^{\text {mid}}\) as the sum of the \(\left \lceil {\frac {k}{2}} \right \rceil \) smallest values Δ_{V}(i). Symmetrically, we define Δ_{W}(i) and \({\Delta }_{W}^{\text {mid}}\). We shall prove a claim analogous to that in the proof of Lemma 5.9.
Claim 2
If \(V_{\min } + {\Delta }_{V}^{\text {mid}}\le V\), then \(A_{V} \ge \lambda ^{\frac {1}{6} k}\); if \(W_{\min } + {\Delta }_{W}^{\text {mid}}\le W\), then \(A_{W} \ge \lambda ^{\frac {1}{6} k}\); otherwise, we are dealing with a NOinstance.
Proof
First, suppose that \(V_{\min } + {\Delta }_{V}^{\text {mid}}\le V\). This means that there is a choice S with v(S) ≤ V which contains at least \(\frac {k}{2}\) items c with \(\text{rank} _{v}(c)\ge \lambda ^{\frac {1}{3}}\). By Fact 5.2, the rank of this choice is at least \(\lambda ^{\frac {1}{6} k}\), so \(A_{V} \ge \lambda ^{\frac {1}{6} k}\), as claimed. The proof of the second case is analogous.
Now, suppose that there is a feasible solution S = {c_{1},…,c_{n}}. For 1 ≤ i ≤ k, we have \(\text{rank} _{v}(c_{i})\ge \lambda ^{\frac {1}{3}}\) or \(\text{rank} _{w}(c_{i}) \ge \lambda ^{\frac {1}{3}}\). Consequently, \(\text{rank} _{v}(c_{i})\ge \lambda ^{\frac {1}{3}}\) holds for at least \(\left \lceil {\frac {k}{2}} \right \rceil \) classes or \(\text{rank} _{w}(c_{i})\ge \lambda ^{\frac {1}{3}}\) holds for at least \(\left \lceil {\frac {k}{2}} \right \rceil \) classes. Without loss of generality, we assume that the former holds. Let D be the set of (at least \(\left \lceil {\frac {k}{2}} \right \rceil \)) classes i satisfying the condition. For each i ∈ D, we clearly have \(v(c_{i})\ge v_{\min }(i)+{\Delta }_{V}(i)\), while for each i∉D, we have \(v(c_{i})\ge v_{\min }(i)\). Consequently, \(V\ge v(S) \ge V_{\min } + {\sum }_{i\in D} {\Delta }_{V}(i) \ge V_{\min } + {\Delta }_{V}^{\text {mid}}\). Hence, \(V \ge V_{\min } + {\Delta }_{V}^{\text {mid}}\), which concludes the proof. □
The condition from the claim can be verified using a lineartime selection algorithm: first, we apply it for each class to compute Δ_{V}(i) and Δ_{W}(i), and then, globally, to determine \({\Delta }_{V}^{\text {mid}}\) and \({\Delta }_{W}^{\text {mid}}\). If one of the first two conditions holds, we return the instance obtained through the reduction. It satisfies \(A \ge \lambda ^{\frac {1}{6} k}\), i.e., \(n \le 1+k \le 1 + 6\frac {\log A}{\log \lambda }\). Otherwise, we construct a dummy NOinstance. □
5.8 Main Result
We apply the preprocessing of the previous section to arrive at our final algorithm.
Theorem 1.3
Multichoice Knapsack can be solved in \(O(N+\sqrt {a\lambda }\log A)\) time.
Proof
Before running the algorithm of Proposition 5.7, we apply the reductions of Lemmas 5.8 and 5.11. With this order of reductions, we already have λ ≤ a during the execution of Lemma 5.11, so the \(O(\lambda \log A)\) term is dominated by \(O(\sqrt {a\lambda }\log A)\). □
6 Weighted Consensus and General Weighted Pattern Matching
Due to Facts 4.1 and 5.1, the Weighted Consensus problem is essentially equivalent to Multichoice Knapsack. The only difference is that we study Multichoice Knapsack with respect to unknown parameters a and A, whereas in Weighted Consensus we know the parameter z. By Observation 4.2, these values for equivalent instances satisfy a ≤ A ≤ z, so Theorem 1.3 immediately yields:
Proposition 6.1
Weighted Consensus can be solved in \(O(R+\sqrt {z\lambda }\log z)\) time.
In Sections 6.2 and 6.3 we show that the \(O(\log z)\) term can be reduced to \(O(\log \lambda + \log \log z)\). Such an improvement is possible because the bound a ≤ A ≤ z is not tight in general.
In the case of the GWPM problem, it is more useful to provide an oracle that finds witness strings that correspond to the respective occurrences of the pattern. Such an oracle, given \(i \in \mathit {Occ}_{\frac {1}{z}}(P,T)\), computes a string that matches both P and T[i..i + m − 1].
6.1 Reduction to Weighted Consensus on Short Sequences
The GWPM problem clearly can be reduced to n + m − 1 instances of Weighted Consensus. This leads to a naïve \(O(nR + n\sqrt {z\lambda }\log z)\)time algorithm. In this subsection, we remove the first term in this complexity.
Our solution applies the tools developed in Section 4 for Weighted Pattern Matching and uses an observation that is a consequence of Observation 4.3.
Observation 6.2
If X and Y are weighted sequences that match with threshold \(\frac {1}{z}\), then \(d_H(\textbf {H}(X),\textbf {H}(Y)) \le 2\left \lfloor {\log z} \right \rfloor \). Moreover, there exists a consensus string S such that S[i] = H(X)[i] = H(Y )[i] unless H(X)[i]≠H(Y )[i].
Proof
The fact that \(X \approx _{\frac {1}{z}} Y\) means that there exists a string P such that \(P \approx _{\frac {1}{z}} X\) and \(P \approx _{\frac {1}{z}} Y\). Let the set A_{1} represent the positions of mismatches between H(X) and P and the set A_{2} represent the positions of mismatches between H(Y ) and P. By Observation 4.3, \(A_{1},A_{2} \le \left \lfloor {\log z} \right \rfloor \). Let A be the set of mismatches between H(X) and H(Y ). We have \(A \subseteq A_{1} \cup A_{2}\) and thus \(A \le 2\left \lfloor {\log z} \right \rfloor \). Finally, observe that for each i ∈ A ∖ (A_{1} ∪ A_{2}) we may replace P[i] with H(X)[i] = H(Y )[i] to obtain a string S such that \(S \approx _{\frac {1}{z}} X\) and \(S \approx _{\frac {1}{z}} Y\) and S[i] = H(X)[i] = H(Y )[i] unless i ∈ A. □
The algorithm starts by computing \(P^{\prime }=\textbf {H}(P)\) and \(T^{\prime }=\textbf {H}(T)\) and the data structure for lcpqueries in \(P^{\prime }T^{\prime }\). We try to match P against every factor T[p..p + m − 1] of the text. Following Observation 6.2, we check if \(d_H(T^{\prime }[p\mathinner {..} p+m1], P^{\prime })\)\( \le 2\left \lfloor {\log z} \right \rfloor \). If not, then we know that no match is possible. Otherwise, let D be the set of positions of mismatches between \(T^{\prime }[p\mathinner {..} p+m1]\) and \(P^{\prime }\). Assume that we store \(\alpha = {\prod }_{j = 1}^{m} \pi ^{(T)}_{p+j1}(T^{\prime }[p+j1])\) and \(\beta = {\prod }_{j = 1}^{m} \pi ^{(P)}_{j}(P^{\prime }[j])\). Now, we only need to check what happens at the positions in D. If D = ∅, it suffices to check if \(\alpha \ge \frac {1}{z}\) and \(\beta \ge \frac {1}{z}\).
Otherwise, we construct two weighted sequences X and Y by selecting only the positions from D in T[p..p + m − 1] and in P. In O(D) time we can compute \(\alpha ^{\prime }={\prod }_{j\notin D} \pi ^{(T)}_{p+j1}(T^{\prime }[p+j1])\) and \(\beta ^{\prime } = {\prod }_{j \notin D} \pi ^{(P)}_{j}(P^{\prime }[j])\). We multiply the probabilities of all letters at the first position in X by \(\alpha ^{\prime }\) and in Y by \(\beta ^{\prime }\). It is clear that \(X\approx _{\frac {1}{z}} Y\) if and only if \(T[p\mathinner {..} p+m1]\approx _{\frac {1}{z}} P\).
Thus, we reduced the GWPM problem to at most n − m + 1 instances of the problem of Weighted Consensus for sequences of length \(O(\log z)\). If we memorise the solutions to all those instances together with the underlying sets of mismatches D, we can also implement the oracle for the GWPM problem with O(m)time queries. We obtain the following reduction.
Lemma 6.3
TheGWPMproblem and the computation of its oracle can be reducedin\(O(R + (nm + 1)\log z)\)timeto at mostn − m + 1 instancesof theWeighted Consensusproblem for weighted sequences of length\(O(\log z)\).
By Proposition 6.1, each of the resulting instances of Weighted Consensus can be solved in \(O(\lambda \log z + \sqrt {z\lambda }\log z)=O(\sqrt {z\lambda }\log z)\) time (due to z ≥ λ).
Proposition 6.4
GWPMproblem can be solved in\(O(n\sqrt {z\lambda }\log z)\)time.An oracle for theGWPMproblem using\(O(n \log z)\)spaceand supporting queries inO(m) timecan be computed within the same time complexity.
In the remainder of this section, we design a tailormade solution which lets us improve the \(O(\log z)\) factors in Propositions 6.1 and 6.4 to \(O(\log \log z + \log \lambda )\).
6.2 Reduction to Short Dissimilar Weighted Consensus
In the SDWC problem, we further require an ordering of letters according to their probabilities. This assumption is trivial if σ = O(1); otherwise, we use the preprocessing of Section 5.7 to expedite sorting. The following result refines Lemma 6.3.
Lemma 6.5
TheGWPMproblem and the computation of its oracle can be reducedin\(O(R + (nm + 1)\lambda \log z)\)timeto at mostn − m + 1 instancesofSDWC.
Proof
The reduction of Section 6.1 in \(O(R + (nm + 1)\log z)\) time results in n − m + 1 dissimilar instances of length at most \(2\log z\). However, the characters are not ordered by nonincreasing probabilities. Before we sort them, we apply Lemma 5.11 in order to reduce the length to \(O(\frac {\log z}{\log \lambda })\); this takes \(O(\lambda \log z)\) time. Note that both removing irrelevant characters and merging two positions into their Cartesian product preserves the property that the probabilities at each position sum up to at most one, so the resulting instance of Multichoice Knapsack can be interpreted back as an instance of Weighted Consensus. Finally, we sort the probabilities in \(O(\lambda \log \lambda )\) time per position, i.e., in \(O(\lambda \log z)\) time per instance of SDWC. □
6.3 Solving Short Dissimilar Weighted Consensus
6.3.1 Overview
We follow the same general meetinthemiddle scheme as the algorithm for Multichoice Knapsack presented in Proposition 5.7. The latter relies on Lemma 5.3, whose analogue in terms of weighted sequences and probabilities is much simpler.
Observation 6.6

\(\P (L, X[1\mathinner {..} L])\ge \frac {1}{z_{\ell }}\),

c is a single letter,

\(\P (R, X[nR+ 1\mathinner {..} n])\ge \frac {1}{z_{r}}\).
Motivated by this formulation, we employ a notion of \(\frac {1}{z}\)solid prefixes of a weighted sequence X—strings S such that \(S \approx _{\frac {1}{z}} X[1\mathinner {..} S]\)—and a symmetric notion of \(\frac {1}{z}\)solid suffixes. By Observation 4.2, the number of \(\frac {1}{z}\)solid prefixes of weighted sequence X of length n is at most nz. A direct application of the approach of Proposition 5.7, using solid prefixes and suffixes as partial choices, would result in generating up to nz_{ℓ} solid prefixes and nz_{r} solid suffixes of X. Recall that, in case of SDWC, \(n = O(\log z)\).
However, \(\frac {1}{z}\)solid prefixes have more structure than prefix partial choices of rank at most z. We exploit this structure by introducing a notion of light \(\frac {1}{z}\)solid prefixes, that is, \(\frac {1}{z}\)solid prefixes that end with a nonheavy letter in X, that are the key ingredient in our solution. We show that the number of light \(\frac {1}{z}\)solid prefixes of X is at most z. Our algorithm for SDWC applies this fact to limit the number of generated \(\frac {1}{z_{\ell }}\)solid prefixes and \(\frac {1}{z_{r}}\)solid suffixes to z_{ℓ} and z_{r}, respectively.

In Section 6.3.2 (corresponds to Section 5.3) we show the O(z) bound on the number of light \(\frac {1}{z}\)solid prefixes (or sufixes) and prove a decomposition property for them that is similar to Observation 6.6 (but more complex).

Section 6.3.3 (corresponds to Section 5.4) contains an algorithm for generating light \(\frac {1}{z^{\prime }}\)solid prefixes of X that are simultaneously \(\frac {1}{z}\)solid prefixes of Y. Intuitively, light solid prefixes of a given length k ≤ n can be obtained from light solid prefixes of any length smaller than k by extending them with any character. This gives O(nλ) lists of solid prefixes to be merged by probabilities which multiplies the complexity by \(O(\log (n\lambda )) = O(\log \log z + \log \lambda )\).

Section 6.3.4 (corresponds to Section 5.5) shows how to compute a solution based on sorted lists of common solid prefixes and suffixes of lengths summing up to n.

Section 6.3.5 (corresponds to Section 5.6) implements the meetinthemiddle approach. Because of the more complicated decomposition property this part of the algorithm is the most complex. It consists of \(O(\log n)=O(\log \log z)\) phases.
6.3.2 Combinatorics of Light Solid Prefixes (Counterpart of Section 5.3)
We define a light \(\frac {1}{z}\)solid prefix of a weighted sequence X as a \(\frac {1}{z}\)solid prefix S of length k such that k = 0 or S[k]≠H(X)[k].
We say that a string P is a maximal \(\frac {1}{z}\)solid prefix of a weighted sequence X if P is a \(\frac {1}{z}\)solid prefix of X and no string \(P^{\prime } = Ps\), for s ∈Σ, is a \(\frac {1}{z}\)solid prefix of X. Maximal solid prefixes have following simple property, originally due to Amir et al. [1].
Fact 6.7 ([1])
A weighted sequence has at most z maximal \(\frac {1}{z}\)solid prefixes, that is, \(\frac {1}{z}\)solid prefixes which cannot be extended to any longer \(\frac {1}{z}\)solid prefix.
Fact 6.7 lets us bound the number of light solid prefixes.
Fact 6.8
A weighted sequence has at most z different light \(\frac {1}{z}\)solid prefixes.
Proof
We show a pair of inverse mappings between the set of maximal \(\frac {1}{z}\)solid prefixes of a weighted sequence X and the set of light \(\frac {1}{z}\)solid prefixes of X. If P is a maximal \(\frac {1}{z}\)solid prefix of X, then we obtain a light \(\frac {1}{z}\)solid prefix by removing all trailing letters of P that are heavy letters at the corresponding positions in X. For the inverse mapping, we extend each light \(\frac {1}{z}\)solid prefix by heavy letters as long as the prefix is \(\frac {1}{z}\)solid. □
With this notion and its symmetric counterpart, light\(\frac {1}{z}\)solid suffixes, we can state a stronger version of Observation 6.6. Note that this is where the dissimilarity is crucial.
Lemma 6.9

L is a light\(\frac {1}{z_{\ell }}\)solidprefix of U,

c is a single letter,

all letters of C are heavy in V,

R is a light\(\frac {1}{z_{r}}\)solidsuffix of V.
Proof
We set L as the longest proper prefix of S which is a \(\frac {1}{z_{\ell }}\)solid prefix of both X and Y , and we define k := L. Note that L is a light \(\frac {1}{z_{\ell }}\)solid prefix of X or Y , because H(X) and H(Y ) are dissimilar. If k = n − 1, we conclude the proof setting c = S[n] and C = R to empty strings.
Otherwise, we have \(\P (S[1\mathinner {..} k + 1],V[1\mathinner {..} k + 1])<\frac {1}{z_{\ell }}\) for V = X or V = Y. Since \(\P (S,V)\ge \frac {1}{z}\) and z_{ℓ} ⋅ z_{r} ≥ z, this implies \(\P (S[k + 2\mathinner {..} n],V[k + 2\mathinner {..} n])\ge \frac {1}{z_{r}}\), i.e., that S[k + 2..n] is a \(\frac {1}{z_{r}}\)solid suffix of V . We set c = S[k + 1], C as the longest prefix of S[k + 2..n] composed of letters heavy in V , and R as the remaining suffix of S[k + 2..n]. Then R is clearly a light \(\frac {1}{z_{r}}\)solid suffix of V . □
6.3.3 Generating Solid Prefixes (Counterpart of Section 5.4)
We say that a string P is a common \(\frac {1}{z}\)solid prefix (suffix) of weighted sequences X and Y if it is a \(\frac {1}{z}\)solid prefix (suffix) of both X and Y. Let \((X,Y,\frac {1}{z})\) be an instance of the SDWC problem. A standard representation of a common \(\frac {1}{z}\)solid prefix P of length k of X and Y is a triple (P,p_{1},p_{2}) such that p_{1} and p_{2} are the probabilities p_{1} = P(P,X[1..k]) and p_{2} = P(P,Y [1..k]).
If σ is constant, the string P can be directly represented using \(O(\log z)\) bits due to \(P=O(\log z)\). Otherwise, P is written using variablelength encoding so that a letter that occurs at a given position with probability p in X has a representation that consists of \(O(\log \frac {1}{p})\) bits. For every position i, the encoding can be constructed by assigning subsequent integer identifiers to letters according nonincreasing order of \(\pi _{i}^{(X)}(c)\). Note that an instance of SDWC problem provides us with the desired sorted order of letters. This lets us store a \(\frac {1}{z}\)solid prefix using \(O(\log z)\) bits: we concatenate the variablelength representations of its letters and we store a bit mask of size \(O(\log z)\) that stores the delimiters between the representations of single letters.
In either case, our assumptions on the model of computations imply that the standard representation takes constant space. Moreover, constant time is sufficient to extend a common \(\frac {1}{z}\)solid prefix by a given letter. An analogous representation can be used also to store common \(\frac {1}{z}\)solid suffixes.
The following observation describes longer light solid prefixes in terms of shorter ones.
Observation 6.10
Let P be a nonempty light \(\frac {1}{z}\)solid prefix of X. If one removes its last letter and then removes all the trailing letters which are heavy at the respective positions in X, then a shorter light \(\frac {1}{z}\)solid prefix of X is obtained.
We build upon Observation 6.10 to derive an efficient algorithm for generating light solid prefixes.
Lemma 6.11
Let\((X,Y,\frac {1}{z})\)bean instance of theSDWCproblem andlet\(z^{\prime }\le z\).The standard representations of all common\(\frac {1}{z}\)solidprefixes of X and Y being light\(\frac {1}{z^{\prime }}\)solidprefixes of X, sorted first by their length andthen by the probabilities in X, can be generatedin\(O(z^{\prime } (\log \log z+\log \lambda )+\log ^{2} z)\)time.
Proof
For k ∈{0,…,n}, let B_{k} be a list of the requested solid prefixes of length k sorted by their probabilities p_{1} in X. Fact 6.8 guarantees that \({\sum }_{k = 0}^{n} \textbf {B}_{k} \le z^{\prime }\).
We compute the lists B_{k} for subsequent lengths k. We start with B_{0} containing the empty string with its probabilities p_{1} = p_{2} = 1. To compute B_{k} for k > 0, we use Observation 6.10. For a given i ∈{0,…,k − 1}, we iterate over all elements (P,p_{1},p_{2}) of B_{i} ordered by the nonincreasing probabilities p_{1} and try to extend each of them by the heavy letters in X at positions i + 1,…,k − 1 and by the letter s at position k. We process the letters s ordered by \({\pi }_{k}^{(X)}(s)\), ignoring the first one (H(X)[k]) and stopping as soon as we do not get a \(\frac {1}{z^{\prime }}\)solid prefix of X.
Let us analyse the time complexity of the kth step of the algorithm. If an element (P,p_{1},p_{2}) and letter s that we consider satisfy \(p^{\prime }_{1} \ge \frac {1}{z^{\prime }}\), this accounts for a new light \(\frac {1}{z^{\prime }}\)solid prefix of X. Hence, in total (over all steps) we consider \(O(z^{\prime })\) such elements. Note that some of these elements may be discarded due to the condition on \(p^{\prime }_{2}\).
For each inspected element (P,p_{1},p_{2}), we also consider at most one letter s for which \(p^{\prime }_{1}\) is not sufficiently large. If this is not the only letter considered for this element, such a candidate can be charged to the previously considered letter. The opposite situation may happen once for each list B_{i}, which may give O(k) additional operations in the kth step, \(O(\log ^{2} z)\) in total.
Thanks to the order in which the lists are considered, we can store products of probabilities \({\prod }_{j=i + 1}^{k1} \pi ^{(X)}_{j}(X^{\prime }[j])\), \({\prod }_{j=i + 1}^{k1}\pi ^{(Y)}_{j}(X^{\prime }[j])\) and factors \(X^{\prime }[i + 1\mathinner {..} k1]\) so that the representation of each subsequent light \(\frac {1}{z^{\prime }}\)solid prefix of length k is computed in O(1) time. Finally, the merging step in the kth phase takes \(O(\textbf {B}_{k}\log (k\lambda )) = O(\textbf {B}_{k} (\log \log z+\log \lambda ))\) time if a binary heap of O(kλ) elements is used.
6.3.4 Merging Solid Prefixes with Suffixes (Counterpart of Section 5.5)
Next, we provide an analogue of Lemma 5.6.
Lemma 6.12
Let L and R be lists containing, for somek ∈{0,…,n},standard representations of common\(\frac {1}{z}\)solidprefixes of length k and common\(\frac {1}{z}\)solidsuffixes of lengthn − kof X and Y, respectively. If the elements of the lists are sorted according tonondecreasing probabilities in X and Y, respectively, one can check inO(L + R) timewhether the concatenation of any\(\frac {1}{z}\)solidprefix from L and\(\frac {1}{z}\)solidsuffix from R yields a consensus string S for X and Y.
Proof
First, we filter out dominated elements of the lists, i.e., elements (P,p_{1},p_{2}) such that there exists another element \((P^{\prime },p_{1}^{\prime },p_{2}^{\prime })\) with \(p_{1}^{\prime } \ge p_{1}\) and \(p_{2}^{\prime } \ge p_{2}\). This can be done in linear time. After this operation, the list R is ordered according to nonincreasing probabilities in X, so we reverse the list so that now both both lists are ordered with respect to the nondecreasing probabilities in X.
For every element (P,p_{1},p_{2}) of L, we compute the leftmost element \((P^{\prime },p^{\prime }_{1},p^{\prime }_{2})\) of R such that \(p_{1} p^{\prime }_{1} \ge \frac {1}{z}\). This element maximises \(p^{\prime }_{2}\) among all elements satisfying the latter condition. Hence, it suffices to check if \(p_{2} p^{\prime }_{2} \ge \frac {1}{z}\), and if so, report the result \(S=PP^{\prime }\). As the lists are ordered by p_{1} and \(p^{\prime }_{1}\), respectively, all such elements can be computed in O(L + R) total time. □
6.3.5 MergeintheMiddle Implementation (Counterpart of Section 5.6)
In this section, we solve the SDWC problem based on Lemma 6.9. We generate all candidates for L ⋅ c and R using Lemma 6.11, and we apply a divideandconquer procedure to fill this with C. Our procedure works for fixed U,V ∈{X,Y }; the algorithm repeats it for all four choices.
Let L_{i} denote a list of all common \(\frac {1}{z}\)solid prefixes of X and Y obtained by extending a light \(\frac {\sqrt {\lambda }}{\sqrt {z}}\)solid prefix of U of length i − 1 by a single letter s at position i, and let R_{i} denote a list of all common \(\frac {1}{z}\)solid suffixes of X and Y of length n − i + 1 that are light \(\frac {1}{\sqrt {z\lambda }}\)solid suffixes of V. We assume that the lists L_{i} and R_{i} are sorted according to the probabilities in U and V, respectively. We assume that L_{n+ 1} = ∅, whereas R_{n+ 1} contains only a representation of an empty string.
The following lemma shows how to compute the lists L_{i} and R_{i} and bounds their total size. In case of σ = O(1) it is a direct consequence of Lemma 6.11. Otherwise, one needs to exercise caution when computing the lists L_{i}.
Lemma 6.13
The total size of listsL_{i}andR_{i}fori ∈{1,…,n + 1} is\(O(\sqrt {z \lambda })\);they can be computed in\(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\)time.
Proof
\(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\)time computation of the lists R_{i} is directly due to Lemma 6.11. As for the lists L_{i}, we first compute in \(O\left (\frac {\sqrt {z}}{\sqrt {\lambda }}(\log \log z+\log \lambda )\right )\) time the lists of all light \(\frac {\sqrt {\lambda }}{\sqrt {z}}\)solid prefixes of U, sorted by the lengths of strings and then by the probabilities in U, again using Lemma 6.11. Then for each length i − 1 and for each letter s at the ith position, we extend all these prefixes by a single letter. This way we obtain λ lists for a given i − 1 that can be merged according to the probabilities in U to form the list L_{i}. Generation of the auxiliary lists takes \(O\left (\frac {\sqrt {z}}{\sqrt {\lambda }}\cdot \lambda \right )=O(\sqrt {z\lambda })\) time in total, and merging them using a binary heap takes \(O(\sqrt {z\lambda }\log \lambda )\) time. This way we obtain an \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\)time algorithm. □
Let \(\textbf {L}^{*}_{a,b}\) be a list of common \(\frac {1}{z}\)solid prefixes of X and Y of length b obtained by taking a common \(\frac {1}{z}\)solid prefix from L_{i} for some i ∈{a,…,b} and extending it by b − i letters that are heavy at the respective positions in V. Similarly, \(\textbf {R}^{*}_{a,b}\) is a list of common \(\frac {1}{z}\)solid suffixes of length n − a + 1 obtained by taking a common \(\frac {1}{z}\)solid suffix from R_{i} for some i ∈{a,…,b} and prepending it by i − a letters that are heavy in V. Again, we assume that each of the lists \(\textbf {L}^{*}_{a,b}\) and \(\textbf {R}^{*}_{a,b}\) is sorted according to the probabilities in U and V, respectively.
A basic interval is an interval [a,b] represented by its endpoints 1 ≤ a ≤ b ≤ n + 1 such that 2^{j} divides a − 1 and \(b=\min (n + 1,a + 2^{j}1)\) for some integer j called the layer of the interval. For every \(j = 0,\ldots ,\left \lceil {\log (n + 1)} \right \rceil \), there are \({\Theta }\left (\frac {n}{2^{j}}\right )\) basic intervals in the jth layer and they are pairwise disjoint.
Example 6.14
For n = 7, the basic intervals are [1,1], …, [8,8], [1,2], [3,4], [5,6], [7,8], [1,4], [5,8], [1,8].
Lemma 6.15
The total size of the lists\(\textbf {L}^{*}_{a,b}\)and\(\textbf {R}^{*}_{a,b}\)forall basic intervals [a,b] is\(O(\sqrt {z\lambda }\log \log z)\)andthey can all be constructed in\(O(\sqrt {z\lambda }(\log \log z+\log \lambda ))\)time.
Proof
We compute all the lists \(\textbf {L}^{*}_{a,b}\) and \(\textbf {R}^{*}_{a,b}\) for basic intervals [a,b] of subsequent layers \(j = 0,\ldots ,\left \lceil {\log (n + 1)} \right \rceil \). For j = 0, we have \(\textbf {L}^{*}_{a,a} = \textbf {L}_{a}\) and \(\textbf {R}^{*}_{a,a} = \textbf {R}_{a}\). All these lists can be computed in \(O(\sqrt {z\lambda }(\log \log z+\log \lambda ))\) time via Lemma 6.13.
Thus, we can compute \(\textbf {L}^{*}_{a,b}\) in time proportional to the sum of lengths of \(\textbf {L}^{*}_{a,c}\) and \(\textbf {L}^{*}_{c + 1,b}\). (Note that the necessary products of probabilities can be computed in \(O(n) = O(\log z)\) total time.) For every \(j = 1,\ldots ,\left \lceil {\log n} \right \rceil \), the total length of the lists from the jth layer does not exceed the total length of the lists from the (j − 1)th layer. By Lemma 6.13, the lists at the 0th layer have size \(O(\sqrt {z\lambda })\). The conclusion follows from the fact that \(\log n = O(\log \log z)\). □
Finally, we are ready to apply a divideandconquer approach to solve the SDWC problem:
Lemma 6.16
The SDWC problem can be solved in \(O(\sqrt {z\lambda } (\log \log z + \log \lambda ))\) time.
Proof
The algorithm goes along Lemma 6.9, considering all choices of U and V . For each of them, we proceed as follows.
First, we compute the lists L_{i}, R_{i} for all i = 1,…,n and \(\textbf {L}^{*}_{a,b}\), \(\textbf {R}^{*}_{a,b}\) for all basic intervals. By Lemmas 6.13 and 6.15, this takes \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\) time.
Note that, in order to find out if there is a feasible solution, it suffices to attempt joining a common \(\frac {1}{z}\)solid prefix from L_{j} with a common \(\frac {1}{z}\)solid suffix from R_{k} for some indices 1 ≤ j < k ≤ n + 1 by heavy letters of V at positions j + 1,…,k − 1. We use a recursive routine to find such a pair of indices j,k ∈ [a,b] which has positive length and therefore can be decomposed into two basic subintervals [a,c] and [c + 1,b]. Then either j ≤ c < k, or both indices j, k belong to the same interval [a,c] or [c + 1,b]. To check the first case, we apply the algorithm of Lemma 6.12 to \(L = \textbf {L}^{*}_{a,c}\) and \(R = \textbf {R}^{*}_{c + 1,b}\). The remaining two cases are solved by recursive calls for the subintervals. The recursive routine is called first for the basic interval [1,n + 1].
The computations performed by the routine for the basic intervals at the jth level take at most the time proportional to the total size of lists \(\textbf {L}^{*}_{a,b}\), \(\textbf {R}^{*}_{a,b}\) at the (j − 1)th level. Lemma 6.15 shows that the total size of the lists at all levels is \(O(\sqrt {z\lambda } \log \log z)\). Consequently, the whole recursive procedure works in \(O(\sqrt {z\lambda } \log \log z)\) time. Together with the computation of the lists, this gives \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\) time in total. □
Lemma 6.16 combined with Lemma 6.5 provides an efficient solution for General Weighted Pattern Matching. It also gives a solution to Weighted Consensus (which is a special case of GWPM with n = m). Note that \(\lambda \log z = O(\sqrt {z\lambda } \log z)\) due to z ≥ λ.
Theorem 1.4
The General Weighted Pattern Matching problem can be solved in \(O(n\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time, and the Weighted Consensus problem can be solved in \(O(R +\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time.
7 Conditional Hardness of GWPM
The following reduction from Multichoice Knapsack to Weighted Consensus immediately yields that any significant improvement in the dependence on z and λ in the running time of our algorithm would lead to breaking longstanding barriers for special cases of Multichoice Knapsack.
Lemma 7.1
Given an instance I of theMultichoice Knapsackproblem with n classesC_{1},…,C_{n}ofmaximum sizeλ,in linear time one can construct an equivalentinstance of theWeighted Consensusproblemwith\(z=O({\prod }_{i = 1}^{n}C_{i})\)andsequences of lengthO(n) overalphabet of sizeλ.
Proof
We construct a pair of weighted sequences X,Y of length n over alphabet Σ = {1,…,λ}. Let \(C_{i} = \{c_{i,1},\ldots ,c_{i,C_{i}}\}\). Intuitively, choosing letter j at position i will correspond to taking c_{i,j} to the solution S.
Without loss of generality, we assume that weights and values are nonnegative. Otherwise, we may subtract \(v_{\min }(i)\) from v(c_{i,j}) and \(w_{\min }(i)\) from w(c_{i,j}) for each item c_{i,j}, as well \(V_{\min }\) from V and \(W_{\min }\) from W .
Claim 3
\({\sum }_{j = 1}^{C_{i}} \pi _{i}^{(X)}(j)\le 1\), \({\sum }_{j = 1}^{C_{i}} \pi _{i}^{(Y)}(j)\le 1\), and \(\max (z_{X},z_{Y}) \le 4{\prod }_{i = 1}^{n}C_{i}\).
Proof
Thus, P is a solution to the constructed instance of the Weighted Consensus problem with two threshold probabilities, \(\frac {1}{z_{X}}\) and \(\frac {1}{z_{Y}}\), if and only if S = {c_{i,j} : P[i] = j} is a solution to the underlying instance of the Multichoice Knapsack problem. To have a single threshold \(z=\max (z_{X},z_{Y})\), we append an additional position n + 1 with symbol 1 only, with \(p_{n + 1}^{(X)}(1)= 0\) and \(p_{n + 1}^{(Y)}(1)=\log z_{Y}  \log z_{X}\) provided that z_{X} ≥ z_{Y}, and symmetrically otherwise.
If one wants to make sure that the probabilities at each position sum up to exactly one, two further letters can be introduced, one of which gathers the remaining probability in X and has probability 0 in Y, and the other gathers the remaining probability in Y, and has probability 0 in X. □
For completeness, let us recall the folklore reductions that show that Subset Sum and 3Sum are special cases of Multichoice Knapsack. To express an instance of Subset Sum with integers a_{1},…,a_{n} and threshold R as an instance of Multichoice Knapsack, we introduce n classes of two items each, which correspond to taking and omitting the respective elements. The first item has value a_{i} and weight − a_{i}, while for the other these are both 0. The thresholds are V = R and W = −R.
Similarly, given an instance of 3Sum with classes a_{1,1},…,a_{1,λ}, a_{2,1},…,a_{2,λ}, and a_{3,1},…,a_{3,λ}, we can create an instance of Multichoice Knapsack with the same three classes of items with values a_{i,j} and weights − a_{i,j}. The thresholds are V = W = 0.
Theorem 1.6
 1.
O^{∗}(z^{ε}) timefor everyε > 0,unless the exponential time hypothesis (ETH) fails;
 2.
O^{∗}(z^{0.5−ε}) timefor someε > 0,unless there is anO^{∗}(2^{(0.5−ε)n})timealgorithm for theSubset Sumproblem;
 3.
\(\tilde {O}(R+z^{0.5}\lambda ^{0.5\varepsilon })\)timefor someε > 0 andforn = O(1),unless the 3Sumconjecture fails.
Proof
We use Lemma 7.1 to derive algorithms for the Multichoice Knapsack problem based on hypothetical solutions for Weighted Consensus. Subset Sum is a special case of Multichoice Knapsack with λ = 2, i.e., \({\prod }_{i}C_{i}= 2^{n}\). Hence, an O^{∗}(z^{o(1)})time solution for Weighted Consensus would yield an O^{∗}(2^{o(n)})time algorithm for Subset Sum, which contradicts ETH by the results of Etscheid et al. [9] and Gurari [11]. Similarly, an O^{∗}(z^{0.5−ε})time solution for Weighted Consensus would yield an O^{∗}(2^{(0.5−ε)n})time algorithm for Subset Sum. Moreover, 3Sum is a special case of Multichoice Knapsack with n = 3 and \({\prod }_{i}C_{i}=\lambda ^{3}\). Hence, an \(\tilde {O}(R+z^{0.5}\lambda ^{0.5\varepsilon })\)time solution for Weighted Consensus with n = O(1) yields an \(\tilde {O}(\lambda + \lambda ^{1.5 + 0.5\varepsilon })=\tilde {O}(\lambda ^{2\varepsilon })\)time algorithm for 3Sum. □
Nevertheless, it might still be possible to improve the dependence on n in the GWPM problem. For example, one may hope to achieve \(\tilde {O}(nz^{0.5\varepsilon }+z^{0.5})\) time for λ = O(1).
8 Multivariate Analysis of Multichoice Knapsack and GWPM
In Section 5, we gave an \(O(N+a^{0.5}\lambda ^{0.5}\log A)\)time algorithm for the Multichoice Knapsack problem. Improvement of either exponent to 0.5 − ε would result in a breakthrough for the Subset Sum and 3Sum problems, respectively. Nevertheless, this does not refute the existence of faster algorithms for some particular values (a,λ) other than those emerging from instances of Subset Sum or 3Sum. Indeed, in this section we show an algorithm that is superior if \(\frac {\log a}{\log \lambda }\) is a constant other than an odd integer. We also argue that it is optimal (up to lower order terms) for every constant \(\frac {\log a}{\log \lambda }\) unless the kSum conjecture fails.
We analyse the running times of algorithms for the Multichoice Knapsack problem expressed as O(n^{O(1)} ⋅ T(a,λ)) for some function T monotone with respect to both arguments. The algorithm of Theorem 1.3 proves that achieving \(T(a,\lambda )=\sqrt {a\lambda }\) is possible. On the other hand, if we assume that Subset Sum does not admit an O^{∗}(2^{(0.5−ε)n})time solution, then we immediately get that we cannot have T(a,2) = O(a^{0.5−ε}) for any ε ≥ 0. Similarly, the 3Sum conjecture implies that T(λ^{3},λ) = O(λ^{2−ε}) is impossible. While this already refutes the possibility of having T(a,λ) = O(a^{0.5}λ^{0.5−ε}) across all arguments (a,λ), such a bound may still hold for some special cases covering an infinite number of arguments. For example, we may potentially achieve T(a,λ) = O((aλ)^{0.5−ε}) = O(λ^{1.5−ε}) for a = λ^{2}.
Consequently, we have some room between the lower and the upper bound of \(\sqrt {a \lambda }\). In the aforementioned case of a = λ^{2}, the upper bound is \(\lambda ^{\frac {3}{2}}\), compared to the lower bound of \(\lambda ^{\frac {4}{3}\varepsilon }\). Below, we show that the upper bound can be improved to meet the lower bound. More precisely, we show an algorithm whose running time is \(O(N + (a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\log \lambda \cdot n^{k})\) for every positive integer k. Note that \(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k} = \lambda ^{c\frac {k + 1}{2k + 1}}+ \lambda ^{k}\), so for 2k − 1 ≤ c ≤ 2k + 1 the running time indeed matches the lower bounds up to the n^{k} term.
Due to Lemma 5.11, the extra n^{k} term reduces to \(O((\frac {\log A}{\log \lambda })^{k})\). Finally, we study the complexity of the GWPM problem.
8.1 Algorithm for Multichoice Knapsack
Let us start by discussing the bottleneck of the algorithm of Theorem 1.3 for large λ. The problem is that the size of the classes does not let us partition every choice S into a prefix L and a suffix R with ranks both \(O(\sqrt {A_{V}})\). Lemma 5.3 leaves us with an extra letter c between L and R, and in the algorithm we append it to the prefix (while generating \(\textbf {L}_{j1}^{(\ell )}\odot C_{j}\)).
We provide a workaround based on reordering of classes. Our goal is to make sure that items with large rank appear only in a few leftmost classes. For this, we guess the classes of the k items with largest rank (in a feasible solution) and move them to the front. Since this depends on the sought feasible solution, we shall actually verify all \(\binom {n}{k}\) possibilities.
Now, our solution considers two cases: For j > k, the reordering lets us assume \(\text{rank} _{v}(c)< \ell ^{\frac {1}{k}}\), so we do not need to consider all items from C_{j}. For j ≤ k, on the other hand, we exploit the fact that \(\textbf {L}_{j1}^{(\ell )}\odot C_{j}\le \lambda ^{j}\), which at most λ^{k}.
The underlying combinatorial foundation is formalised as a variant of Lemma 5.3:
Lemma 8.1
Letℓand r be positive integers such thatv(L_{j}[ℓ]) + v(R_{j+ 1}[r]) > Vforevery 0 ≤ j ≤ n.Letk ∈{1,…,n} andsuppose that S is a choice withv(S) ≤ Vsuchthat rank_{v}(S ∩ C_{i}) ≥ rank_{v}(S ∩ C_{j}) fori ≤ k < j.There is an indexj ∈{1,…,n} anda decompositionS = L ∪{c}∪ Rsuchthat\(L\in \textbf {L}_{j1}^{(\ell )}\),\(R\in \textbf {R}_{j + 1}^{(r)}\),c ∈ C_{j},and either\(\text{rank} _{v}(c) < \ell ^{\frac {1}{k}}\)orj ≤ k.
Proof
We claim that the decomposition constructed in the proof of Lemma 5.3 satisfies the extra condition on rank_{v}(c) if j > k. Let S = {c_{1},…,c_{n}} and S_{i} = {c_{1},…,c_{i}}. Obviously rank_{v}(c_{i}) ≥ 1 for k < i < j and, by the extra assumption, rank_{v}(c_{i}) ≥ rank_{v}(c) for 1 ≤ i ≤ k. Hence, Fact 5.2 yields rank_{v}(S_{j− 1}) ≥ rank_{v}(c)^{k}. Simultaneously, we have v(S_{j− 1}) < v(L_{j− 1}[ℓ]), so rank_{v}(S_{j− 1}) < ℓ. Combining these inequalities, we immediately get the claimed bound. □
Theorem 1.7
For every positive integerk = O(1),theMultichoice Knapsackproblem can be solvedin\(O(N+ {(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log A (\frac {\log A}{\log \lambda })^{k})\)time.
Proof
As in the proof of Theorem 1.3, we actually provide an algorithm whose running time depends on A_{V} rather than a. Moreover, Lemmas 5.8 and 5.11 let us assume that \(n=O(\frac {\log A}{\log \lambda })\).
We first guess the k positions where items with largest ranks rank_{v} are present in the solution S and move these positions to the front. This gives \(\binom {n}{k}=O((\frac {\log A}{\log \lambda })^{k})\) possible selections. For each of them, we proceed as follows.
We increment an integer r starting from 1, maintaining \(\ell =\lceil r^{\frac {k}{k + 1}}\rceil \) and all the lists \(\textbf {L}_{j}^{(\ell )}\) and \(\textbf {R}_{j + 1}^{(r)}\) for 0 ≤ j ≤ n, as long as v(L_{j}[ℓ]) + v(R_{j+ 1}[r]) ≤ V for some j. By Fact 5.4, we stop with \(r=O(A_{V}^{\frac {k + 1}{2k + 1}})\) and thus the total time of this phase is \(O(A_{V}^{\frac {k + 1}{2k + 1}}\log A)\) due to the online procedure of Lemma 5.5.
By Lemma 8.1, every feasible solution S for some j admits a decomposition S = L ∪{c}∪ R, where \(L\in \textbf {L}_{j1}^{(\ell )}\), \(R\in \textbf {R}_{j + 1}^{(r)}\), c ∈ C_{j}, and either \(\text{rank} _{v}(c) < \ell ^{\frac {1}{k}}\) or j ≤ k; we consider all possibilities for j. For each of them, we shall reduce searching for S to an instance of the Multichoice Knapsack problem with \(N^{\prime }=O(A_{V}^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\) and \(n^{\prime }= 2\). By Lemma 5.6, these instances can be solved in \(O((A_{V}^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\frac {\log A}{\log \lambda })\) time in total.
For j ≤ k, the items of the jth instance are going to belong to classes \(\textbf {L}_{j1}^{(\ell )}\odot C_{j}\) and \(\textbf {R}_{j + 1}^{(r)}\). The set \(\textbf {L}_{j1}^{(\ell )}\odot C_{j}\) can be sorted by merging C_{j} sorted lists of size at most λ^{j− 1} each, i.e., in \(O(\lambda ^{k} \log \lambda )\) time. On the other hand, for j > k, we take \(\{L\cup \{c\} : L\in \textbf {L}_{j1}^{(\ell )} , c\in C_{j}, \text{rank} _{v}(c)\le \ell ^{\frac {1}{k}}\}\) and \(\textbf {R}_{j + 1}^{(r)}\). The former set can be constructed by merging at most \(\min (\ell ^{\frac {1}{k}},\lambda )=\min (O(r^{\frac {1}{k + 1}}),\lambda )\) sorted lists of size \(\ell =O(r^{\frac {k}{k + 1}})\) each, i.e., in \(O(r\log \lambda )=O(A_{V}^{\frac {k + 1}{2k + 1}}\log \lambda )\) time.
Summing up over all indices j, this gives \(O((A_{V}^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log A)\) time for a single selection of the k positions with largest ranks, and \(O((A_{V}^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log A (\frac {\log A}{\log \lambda })^{k})\) in total.
Clearly, each solution of the constructed instances represents a solution of the initial instance, and by Lemma 8.1, every feasible solution of the initial instance has its counterpart in one of the constructed instances.
Before we conclude the proof, we need to note that the optimal k does not need to be known in advance. To deal with this issue, we try consecutive integers k and stop the procedure if Fact 5.4 yields that A_{V} > λ^{2k+ 1}, i.e., if r is incremented beyond λ^{k+ 1}. If the same happens for the other instance of the algorithm (operating on rank_{w} instead of rank_{v}), we conclude that a > λ^{2k+ 1}, and thus we shall better use larger k. The running time until this point is \(O(\lambda ^{k + 1}\log \lambda (\frac {\log A}{\log \lambda })^{k})\) due to Lemma 5.5. On the other hand, if r ≤ λ^{k+ 1}, the algorithm behaves as if a ≤ λ^{2k+ 1}, i.e., runs in \(O(\lambda ^{k + 1}\log \lambda (\frac {\log A}{\log \lambda })^{k})\) time. This workaround (considering all smaller values k) adds extra \(O(\lambda ^{k}\log \lambda (\frac {\log A}{\log \lambda })^{k1})\) to the time complexity for the optimal value k, which is less than the upper bound on the running time we have for this value k. □
8.2 Algorithm for General Weighted Pattern Matching
Corollary 8.2
Letk = O(1) bea positive integer such thatA ≤ λ^{2k+ 1}.TheMultichoice Knapsackproblem can be solvedin\(O(N+ {(A^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log \lambda )\)time.
This leads to the following result for General Weighted Pattern Matching:
Theorem 1.8
Ifλ^{2k− 1} ≤ z ≤ λ^{2k+ 1}forsome positive integerk = O(1),then theWeighted Consensusproblem can be solvedin\(O(R+(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\)time,and theGWPMproblem can be solvedin\(O(n(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\)time.
As we noted at the beginning of this section, Lemma 7.1 implies that any improvement of the dependence of the running time on z or λ by z^{ε} (equivalently, by λ^{ε}) wound contradict the kSum conjecture.
Footnotes
Notes
Acknowledgments
This work was supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science cofinanced by the European Union under the European Regional Development Fund.
References
 1.Amir, A., Chencinski, E., Iliopoulos, C. S., Kopelowitz, T., Zhang, H.: Property matching and weighted matching. Theor. Comput. Sci. 395(23), 298–310 (2008). https://doi.org/10.1016/j.tcs.2008.01.006 MathSciNetCrossRefzbMATHGoogle Scholar
 2.Barton, C., Kociumaka, T., Pissis, S. P., Radoszewski, J.: Efficient index for weighted sequences. In: Grossi, R., Lewenstein, M. (eds.) Combinatorial Pattern Matching, CPM 2016, LIPIcs, vol. 54, pp. 4:1–4:13. Schloss Dagstuhl–LeibnizZentrum für Informatik. https://doi.org/10.4230/LIPIcs.CPM.2016.4. Dagstuhl, Germany (2016)
 3.Barton, C., Liu, C., Pissis, S. P.: Lineartime computation of prefix table for weighted strings & applications. Theor. Comput. Sci. 656, 160–172 (2016). https://doi.org/10.1016/j.tcs.2016.04.029 MathSciNetCrossRefzbMATHGoogle Scholar
 4.Barton, C., Liu, C., Pissis, S. P.: Online pattern matching on uncertain sequences and applications. In: Chan, T.H., Li, M., Wang, L. (eds.) Combinatorial optimization and applications, COCOA 2016, LNCS, vol. 10043, pp. 547–562. https://doi.org/10.1007/9783319487496_40. Springer, Berlin (2016)
 5.Barton, C., Liu, C., Pissis, S.P.: Fast averagecase pattern matching on weighted sequences. To appear in the International Journal of Foundations of Computer Science (2017)Google Scholar
 6.Biswas, S., Patil, M., Thankachan, S. V., Shah, R.: Probabilistic threshold indexing for uncertain strings. In: E. Pitoura, S. Maabout, G. Koutrika, A. Marian, L. Tanca, I. Manolescu, K. Stefanidis (eds.) 19th International Conference on Extending Database Technology, EDBT 2016, pp. 401–412. OpenProceedings.org. https://doi.org/10.5441/002/edbt.2016.37 (2016)
 7.Christodoulakis, M., Iliopoulos, C. S., Mouchard, L., Tsichlas, K.: Pattern matching on weighted sequences. In: Algorithms and Computational Methods for Biochemical and Evolutionary Networks, CompBioNets 2004, KCL publications (2004)Google Scholar
 8.Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on strings. Cambridge University Press, Cambridge (2007). https://doi.org/10.1017/cbo9780511546853 CrossRefzbMATHGoogle Scholar
 9.Etscheid, M., Kratsch, S., Mnich, M., Röglin, H.: Polynomial kernels for weighted problems. In: G.F. Italiano, G. Pighizzini, D. Sannella (eds.) Mathematical Foundations of Computer Science, MFCS 2015, Part II, LNCS, vol. 9235, pp. 287–298. Springer. https://doi.org/10.1007/9783662480540_24 (2015)
 10.Gajentaan, A., Overmars, M. H.: On a class of O(n ^{2}) problems in computational geometry. Comput. Geom. 5, 165–185 (1995). https://doi.org/10.1016/09257721(95)000222 MathSciNetCrossRefzbMATHGoogle Scholar
 11.Gurari, E. M.: Introduction to the theory of computation. Computer Science Press (1989)Google Scholar
 12.Horowitz, E., Sahni, S.: Computing partitions with applications to the knapsack problem. J. ACM, 21(2), 277–292 (1974). https://doi.org/10.1145/321812.321823 MathSciNetCrossRefzbMATHGoogle Scholar
 13.Iliopoulos, C.S., Makris, C., Panagis, Y., Perdikuri, K., Theodoridis, E., Tsakalidis, A.K.: The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundamenta Informaticae 71 (23), 259–277 (2006). http://content.iospress.com/articles/fundamentainformaticae/fi712307 MathSciNetzbMATHGoogle Scholar
 14.Iliopoulos, C. S., Rahman, M. S.: Faster index for property matching. Inf. Process. Lett. 105(6), 218–223 (2008). https://doi.org/10.1016/j.ipl.2007.09.004 MathSciNetCrossRefzbMATHGoogle Scholar
 15.Impagliazzo, R., Paturi, R.: On the complexity of kSAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001). https://doi.org/10.1006/jcss.2000.1727 MathSciNetCrossRefzbMATHGoogle Scholar
 16.Juan, M. T., Liu, J. J., Wang, Y. L.: Errata for “Faster index for property matching”. Inf. Process. Lett. 109(18), 1027–1029 (2009). https://doi.org/10.1016/j.ipl.2009.06.009 CrossRefzbMATHGoogle Scholar
 17.Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack problems. Springer. https://doi.org/10.1007/9783540247777(2004)
 18.Kociumaka, T., Pissis, S. P., Radoszewski, J.: Pattern matching and consensus problems on weighted sequences and profiles. In: S. Hong (ed.) Algorithms and Computation, ISAAC 2016, LIPIcs, vol. 64, pp. 46:1–46:12. Schloss Dagstuhl–LeibnizZentrum für Informatik. https://doi.org/10.4230/LIPIcs.ISAAC.2016.46 (2016)
 19.Kopelowitz, T.: The property suffix tree with dynamic properties. Theor. Comput. Sci. 638, 44–51 (2016). https://doi.org/10.1016/j.tcs.2016.02.033 MathSciNetCrossRefzbMATHGoogle Scholar
 20.Lokshtanov, D., Marx, D., Saurabh, S.: Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105, 41–72 (2011). http://bulletin.eatcs.org/index.php/beatcs/article/view/92 MathSciNetzbMATHGoogle Scholar
 21.Mehlhorn, K.: Nearly optimal binary search trees. Acta Inform. 5, 287–295 (1975). https://doi.org/10.1007/BF00264563 MathSciNetCrossRefzbMATHGoogle Scholar
 22.Pizzi, C., Ukkonen, E.: Fast profile matching algorithms  A survey. Theor. Comput. Sci. 395(23), 137–157 (2008). https://doi.org/10.1016/j.tcs.2008.01.015 MathSciNetCrossRefzbMATHGoogle Scholar
 23.Radoszewski, J., Starikovskaya, T. A.: Streaming kmismatch with error correcting and applications. In: A. Bilgin, M.W. Marcellin, J. SerraSagristȧ, J.A. Storer (eds.) Data Compression Conference, DCC 2017, pp. 290–299. IEEE. https://doi.org/10.1109/DCC.2017.14 (2017)
 24.Rajasekaran, S., Jin, X., Spouge, J. L.: The efficient computation of positionspecific match scores with the fast Fourier transform. J. Comput. Biol. 9(1), 23–33 (2002). https://doi.org/10.1089/10665270252833172 CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.