Advertisement

Pattern Matching and Consensus Problems on Weighted Sequences and Profiles

  • Tomasz Kociumaka
  • Solon P. Pissis
  • Jakub Radoszewski
Open Access
Article

Abstract

We study pattern matching problems on two major representations of uncertain sequences used in molecular biology: weighted sequences (also known as position weight matrices, PWM) and profiles (scoring matrices). In the simple version, in which only the pattern or only the text is uncertain, we obtain efficient algorithms with theoretically-provable running times using a variation of the lookahead scoring technique. We also consider a general variant of the pattern matching problems in which both the pattern and the text are uncertain. Central to our solution is a special case where the sequences have equal length, called the consensus problem. We propose algorithms for the consensus problem parameterised by the number of strings that match one of the sequences. As our basic approach, a careful adaptation of the classic meet-in-the-middle algorithm for the knapsack problem is used. On the lower bound side, we prove that our dependence on the parameter is optimal up to lower-order terms conditioned on the optimality of the original algorithm for the knapsack problem. Therefore, we make an effort to keep the lower order terms of the complexities of our algorithms as small as possible.

Keywords

Weighted sequence Position weight matrix Profile matching Multichoice Knapsack 

1 Introduction

We study two well-known representations of uncertain texts: weighted sequences and profiles. A weighted sequence (also known as position weight matrix, PWM) for every position and every letter of the alphabet specifies the probability of occurrence of this letter at this position; see Table 1 for an example. A weighted sequence represents many different strings, each with the probability of occurrence equal to the product of probabilities of its letters at subsequent positions of the weighted sequence. Usually a threshold \(\frac {1}{z}\) is specified, and one considers only strings that match the weighted sequence with probability at least \(\frac {1}{z}\). A scoring matrix (or a profile) of length m is a matrix with m columns indexed by positions 1,…,m and σ rows corresponding to the alphabet. The score of a string of length m is the sum of scores in the scoring matrix of the subsequent letters of the string at the respective positions. A string is said to match a scoring matrix if its matching score is above a specified threshold Z.
Table 1

A weighted sequence X of length 4 over the alphabet Σ = {,}

X[1]

X[2]

X[3]

X[4]

\(\pi _{1}^{(X)}(\mathtt {a})= 1/2\)

HCode \(\pi _{2}^{(X)}(\mathtt {a})= 1\)

HCode \(\pi _{3}^{(X)}(\mathtt {a})= 3/4\)

HCode \(\pi _{4}^{(X)}(\mathtt {a})= 0\)

\(\pi _{1}^{(X)}(\mathtt {b})= 1/2\)

HCode \(\pi _{2}^{(X)}(\mathtt {b})= 0\)

\(\pi _{3}^{(X)}(\mathtt {b})= 1/4\)

HCode \(\pi _{4}^{(X)}(\mathtt {b})= 1\)

1.1 Weighted Pattern Matching and Profile Matching

First of all, we study the standard variants of pattern matching problems on weighted sequences and profiles, in which only the pattern or the text is an uncertain sequence. In the most popular formulation of the Weighted Pattern Matching problem, we are given a weighted sequence of length n, called a text, a solid (standard) string of length m, called a pattern, both over an alphabet of size σ, and a threshold probability \(\frac {1}{z}\). We are asked to find all positions in the text where the fragment of length m represents the pattern with probability at least \(\frac {1}{z}\). Each such position is called an occurrence of the pattern in the text; we also say that the fragment of the text and the pattern match. The Weighted Pattern Matching problem can be solved in \(O(\sigma n \log m)\) time via the Fast Fourier Transform [7]. The average-case complexity of the WPM problem has also been studied and a number of fast algorithms have been presented for certain values of weight ratio \(\frac {z}{m}\) [4, 5]. An indexing variant of the problem has also been considered [1, 2, 13, 14, 16, 19]; here, one is to preprocess a weighted text to efficiently answer pattern matching queries. The most efficient index [2] for a constant-sized alphabet uses O(nz) space, takes O(nz) time to construct and answers queries in optimal O(m + occ) time, where occ is the number of occurrences reported. A more general indexing data structure, which assumes z = O(1), was presented in [6]. A streaming variant of the Weighted Pattern Matching problem was considered very recently in [23].

In the classic Profile Matching problem, the pattern is an m × σ profile, the text is a solid string of length n, and our task is to find all positions in the text where the fragment of length m has score at least Z. A naïve approach to the Profile Matching problem works in O(nm + mσ) time. A broad spectrum of heuristics improving this algorithm in practice is known; for a survey, see [22]. However, all these algorithms have the same worst-case running time. One of the principal heuristic techniques, coming in different flavours, is lookahead scoring that consists in checking if a partial match could possibly be completed by the highest scoring letters in the remaining positions of the scoring matrix and, if not, pruning the naïve search. The Profile Matching problem can also be solved in \(O(\sigma n \log m)\) time via the Fast Fourier Transform [24].

Our results

As the first result, we show how the lookahead scoring technique combined with a data structure for answering longest common extension (LCE) queries in a string can be applied to obtain simple and efficient algorithms for the standard pattern matching problems on uncertain sequences. For a weighted sequence, by R we denote the size of its list representation. In the case that σ = O(1), which often occurs in molecular biology applications, we have R = O(n). In the Profile Matching problem, we set M as the number of strings that match the scoring matrix with score above Z. In general Mσm; however, we may assume that for practical data this number is actually much smaller. We obtain the following results:

Theorem 1.1

Profile Matching can be solved in \(O(m\sigma + n \log M)\) time.

Theorem 1.2

Weighted Pattern Matching can be solved in \(O(R+n \log z)\) time.

1.2 Profile Consensus and Multichoice Knapsack

Along the way to our most involved contribution, we study Profile Consensus, a consensus problem on uncertain sequences. Specifically, we are to check for the existence of a string that matches two scoring matrices, each above threshold Z. The Profile Consensus problem is essentially equivalent to the well-known Multichoice Knapsack problem (also known as the Multiple Choice Knapsack problem). In this problem, we are given n classes C1,…,Cn of at most λ items each—N items in total—each item c characterised by a value v(c) and a weight w(c). The goal is to select one item from each class so that the sums of values and of weights of the items are below two specified thresholds, V and W. (In the more intuitive formulation of the problem, we require the sum of values to be above a specified threshold, but here we consider an equivalent variant in which both parameters are symmetric.) This problem generalises the (binary) Knapsack problem, in which we have λ = 2. The Multichoice Knapsack problem is widely used in practice, but most research concerns approximation or heuristic solutions; see [17] and references therein. As far as exact solutions are concerned, the classic meet-in-the middle approach by Horowitz and Sahni [12], originally designed for the (binary) Knapsack problem, immediately generalises to an \(O^{*}(\lambda ^{\lceil {\frac {n}{2}\rceil }})\)-time1 solution for Multichoice Knapsack.

Several important problems can be expressed as special cases of the Multichoice Knapsack problem using folklore reductions (see [17]). This includes the Subset Sum problem, which, for a set of n integers, asks whether there is a subset summing up to a given integer Q, and the k-Sum problem which, for k classes of λ integers, asks to choose one element from each class so that the selected integers sum up to zero. These reductions give immediate hardness results for the Multichoice Knapsack problem and thus yield the same consequences for Profile Consensus. For the Subset Sum problem, as shown in [9, 11], the existence of an O(2εn)-time solution for every ε > 0 would violate the Exponential Time Hypothesis (ETH) [15, 20]. Moreover, the O(2n/2) running time, achieved in [12], has not been improved yet despite much effort. The 3-Sum conjecture [10] and the more general k-Sum conjecture state that the 3-Sum and k-Sum problems cannot be solved in O(λ2−ε) time and \(O(\lambda ^{\lceil {\frac {k}{2}} \rceil (1-\varepsilon )})\) time, respectively, for any ε > 0.

Our results

In the complexities of our algorithms, the instance size of Multichoice Knapsack is described by the number of classes n, the total number of items N = |C1| + ⋯ + |Cn|, and the maximum size of a class \(\lambda =\max \{|C_{1}|,\ldots ,|C_{n}|\}\). We also introduce additional parameters based on the number of solutions with feasible weight or value:
$$A_{V} = \left|\left\{(c_{1},\ldots,c_{n}):c_{i} \in C_{i}\text{ for all } i = 1,\ldots,n,\sum\limits_{i} v(c_{i}) \le V\right\}\right|, $$
that is, the number of choices of one element from each class that satisfy the value threshold,
$$A_{W} = \left|\left\{(c_{1},\ldots,c_{n}):c_{i} \in C_{i}\text{ for all } i = 1,\ldots,n,\sum\limits_{i} w(c_{i}) \le W\right\}\right|, $$
\(A = \max (A_{V},A_{W})\), and \(a=\min (A_{V},A_{W})\). We obtain the following result.

Theorem 1.3

Multichoice Knapsack can be solved in \(O(N+\sqrt {a\lambda }\log A)\) time.

Note that aAλn and thus the running time of our algorithm for Multichoice Knapsack is bounded by \(O(N+n\lambda ^{(n + 1)/2}\log \lambda )\). Up to lower order terms (i.e., the factor \(n\log \lambda =(\lambda ^{(n + 1)/2})^{o(1)}\)), this matches the time complexities of the fastest known solutions for both Subset Sum (also binary Knapsack) and 3-Sum. Our parameters identify a new measure of difficulty for the Multichoice Knapsack problem. The main novel part of our algorithm for Multichoice Knapsack is an appropriate (yet intuitive) notion of ranks of partial solutions.

1.3 Weighted Consensus and General Weighted Pattern Matching

Analogously to the Profile Consensus problem, we define the Weighted Consensus problem. In the Weighted Consensus problem, given two weighted sequences of the same length, we are to check if there is a string that matches each of them with probability at least \(\frac {1}{z}\). A routine to compare user-entered weighted sequences with existing weighted sequences in the database is used, e.g., in JASPAR,2 a well-known database of PWMs. Finally, we study a general variant of pattern matching on weighted sequences. In the General Weighted Pattern Matching (GWPM) problem, both the pattern and the text are weighted. In the most common definition of the problem (see [3, 13]), we are to find all fragments of the text that give a positive answer to the Weighted Consensus problem with the pattern. The authors of [3] proposed an algorithm for the GWPM problem based on the weighted prefix table that works in \(O(n z^{2} \log z + n\sigma )\) time. Solutions to these problems can be applied in transcriptional regulation: motif and regulatory module finding; and annotation of regulatory genomic regions.

Our results

For a weighted sequence, by λ let us denote the maximal number of letters with score at least \(\frac {1}{z}\) at a single position (thus \(\lambda \le \min (\sigma ,z)\)). Our algorithm for the Multichoice Knapsack problem (covered in Section 1.2) yields time complexities \(O(R+\sqrt {z\lambda }\log z)\) and \(O(n\sqrt {z\lambda }\log z)\) for Weighted Consensus and GWPM, respectively. Using a tailor-made solution based on the same scheme, we obtain faster procedures as specified below.

Theorem 1.4

The General Weighted Pattern Matching problem can be solved in \(O(n\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time, and the Weighted Consensus problem can be solved in \(O(R +\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time.

In particular, we obtain the following result for the practical case of σ = O(1).

Corollary 1.5

General Weighted Pattern Matching over a constant-sized alphabet can be solved in \(O(n \sqrt {z} \log \log z)\) time.

We also provide a simple reduction from Multichoice Knapsack to Weighted Consensus, which lets us transfer the negative results to the GWPM problem.

Theorem 1.6

Weighted Consensus is NP-hard and cannot be solved in:
  1. 1.

    O(zε) timefor every ε > 0,unless the exponential time hypothesis (ETH) fails;

     
  2. 2.

    O(z0.5−ε) timefor some ε > 0,unless there is an O(2(0.5−ε)n)-timealgorithm for the Subset Sum problem;

     
  3. 3.

    \(\tilde {O}(R+z^{0.5}\lambda ^{0.5-\varepsilon })\) timefor some ε > 0 andfor n = O(1),unless the 3-Sum conjecture fails.

     

For the higher-order terms, our complexities match the conditional lower bounds; therefore, in the proofs of Theorems 1.3 and 1.4 we put significant effort to keep the lower order terms of the complexities as small as possible.

Finally, we analyse the complexity of the Multichoice Knapsack and General Weighted Pattern Matching problems in case of a large λ. This is a theoretical study that shows a possibility of improvement of the complexity for instances that do not originate from the Subset Sum and k-Sum problems.

Theorem 1.7

For every positive integer k = O(1), the Multichoice Knapsack problem can be solved in \(O(N+ {(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log A (\frac {\log A}{\log \lambda })^{k})\) time.

Theorem 1.8

If λ2k− 1zλ2k+ 1 for some positive integer k = O(1), then the Weighted Consensus problem can be solved in \(O(R+(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\) time, and the GWPM problem can be solved in \(O(n(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\) time.

A preliminary version of this research appeared as [18].

1.4 Structure of the Paper

We start with Preliminaries, where we recall basic notions on classic strings and formalise the model of computation. The following four sections describe our algorithms: in Section 3 for Profile Matching; in Section 4 for Weighted Pattern Matching; in Section 5 for Profile Consensus; and in Section 6 for Weighted Consensus and GWPM. In Section 7 we present conditional lower bounds for the GWPM problem based on the special cases of Multichoice Knapsack. Finally, in Section 8 we perform a multivariate analysis of Profile Consensus and GWPM and present improved solutions in the case that \(\frac {\log a}{\log \lambda }\) is a constant other than an odd integer.

2 Preliminaries

Let Σ = {1,…,σ} be an alphabet. A string S over Σ is a finite sequence of letters from Σ. By Σm we denote the set of strings of length m over Σ. We denote the length of S by |S| and, for 1 ≤ i ≤|S|, the i-th letter of S by S[i]. By S[i..j] we denote the string S[i]⋯S[j] called a factor of S (if i > j, then the factor is an empty string). A factor is called a prefix if i = 1 and a suffix if j = |S|. For two strings S and T, we denote their concatenation by ST (ST in short).

For a string S of length n, by LCE(i,j) = lcp(S[i..n],S[j..n]) we denote the length of the longest common prefix of suffixes S[i..n] and S[j..n]. This value lets us easily determine the longest common prefix \(\mathit {lcp}(S[i\mathinner {..} i^{\prime }],S[j\mathinner {..} j^{\prime }])\) of any two factors starting at positions i and j, respectively. The following fact specifies a well-known efficient data structure answering LCE queries; see [8] for details.

Fact 2.1

Let S be a string of length n over an integer alphabet of size σ = nO(1). After O(n)-time preprocessing, in O(1) time one can compute LCE(i,j) for any indices i,j.

The Hamming distance between two strings X and Y of the same length, denoted by dH(X,Y ), is the number of positions where the strings differ.

2.1 Model of Computations

For problems on weighted sequences, we assume the word-RAM model with word size \(w = {\Omega }(\log n + \log z)\) and integer alphabet of size σ = nO(1). We consider the log-probability model of representations of weighted sequences, that is, we assume that probabilities in the weighted sequences and the threshold probability \(\frac {1}{z}\) are all of the form \(c^{\frac {p}{2^{dw}}}\), where c and d are constants and p is an integer that fits in a constant number of machine words. Additionally, the probability 0 has a special representation. The only operations on probabilities in our algorithms are multiplications and divisions, which can be performed exactly in O(1) time in this model. Our solutions to the Multichoice Knapsack problem only assume the word-RAM model with word size \(w={\Omega }(\log S+\log a)\), where S is the sum of integers in the input instance; this does not affect the O running time.

3 Profile Matching

In the Profile Matching problem, we consider a scoring matrix (a profile) P of size m × σ. For i ∈{1,…,m} and j ∈{1,…,σ}, we denote the integer score of the letter j at the position i by P[i,j]. The matching score of a string S of length m with the matrix P is
$$\text{Score}(S,P) = {\sum}_{i = 1}^{m} P[i,S[i]]. $$
If Score(S,P) ≥ Z for an integer threshold Z, then we say that the string S matches the matrix P above threshold Z. We denote the family of strings S that match P above threshold Z by MZ(P).
For a string T and a scoring matrix P, we say that P occurs in T at position i with threshold Z if T[i..i + m − 1] matches P above threshold Z. Then OccZ(P,T) is the set of all positions where P occurs in T. These notions let us define Profile Matching:

3.1 Solution to Profile Matching

For a scoring matrix P, the heavy string of P, denoted H(P), is constructed by choosing at each position the heaviest letter, that is, the letter with the maximum score (breaking ties arbitrarily). In other words, H(P) is a string that matches P with the maximum score.

Observation 3.1

If Score(S,P) ≥ Z for a string S of length m and an m × σ scoring matrix P, then \(d_H(\textbf {H}(P),S) \le \left \lfloor {\log |\textbf {M}_{Z}(P)|} \right \rfloor \).

Proof

Let d = dH(H(P),S). We can construct 2d strings of length |S| that match P with a score above Z by taking either of the letters S[j] or H(P)[j] at each position j such that S[j]≠H(P)[j]. Hence, \(2^{d} \le |\textbf {M}_{Z}(P)|\), which concludes the proof. □

Our solution for the Profile Matching problem works as follows. We first construct \(P^{\prime } = \textbf {H}(P)\) and the data structure for lcp-queries in \(P^{\prime }T\). Let the variable s store the matching score of \(P^{\prime }\). In the p-th step, we calculate the matching score of T[p..p + m − 1] by iterating through subsequent mismatches between \(P^{\prime }\) and T[p..p + m − 1] and making adequate updates in the matching score s. The mismatches are found using lcp-queries: If \(P^{\prime }[i]\) is aligned against T[j], we compute \({\Delta } = \mathit {lcp}(P^{\prime }[i\mathinner {..} m],T[j\mathinner {..} n])\). Then, \(P^{\prime }[i\mathinner {..} i+{\Delta }-1]=T[j\mathinner {..} j+{\Delta }-1]\), but \(P^{\prime }[i+{\Delta }]\ne T[j+{\Delta }]\) yields a mismatch (assuming i + Δ ≤ m and j + Δ ≤ n). To locate the next mismatch, we need to repeat the procedure above with i and j increased by Δ + 1. This process terminates when the score drops below Z or when all the mismatches have been found. In the end, we include p in OccZ(P,T) if the final matching score is above Z. A pseudocode is given in the ProfileMatching(P, T, Z) procedure.

We obtain the following result.

Theorem 1.1

Profile Matching can be solved in \(O(m\sigma + n \log M)\) time.

Proof

Let us bound the time complexity of the presented algorithm. The heavy string \(P^{\prime }\) can be computed in O(mσ) time. The data structure for lcp-queries in \(P^{\prime }T\) can be constructed in O(n + m) time by Fact 2.1. Finally, for each position p in the text T we will consider at most \(\left \lfloor {\log M} \right \rfloor + 1\) mismatches between \(P^{\prime }\) and T, as afterwards the score \(s^{\prime }\) drops below Z due to Observation 3.1. □

4 Weighted Pattern Matching

A weighted sequence X = X[1]⋯X[n] of length |X| = n over alphabet Σ is a sequence of sets of pairs of the form \(X[i] = \{(j,\ \pi ^{(X)}_{i}(j))\ :\ j \in {\Sigma }\}\). Here, \(\pi _{i}^{(X)}(j)\) is the occurrence probability of the letter j at the position i ∈{1,…,n}. These values are non-negative and sum up to 1 for a given i. For all our algorithms, it is sufficient that the probabilities sum up to at most 1 for each position. Also, the algorithms sometimes produce auxiliary weighted sequences with sum of probabilities being smaller than 1 on some positions.

We denote the maximum number of letters occurring at a single position of the weighted sequence (with non-zero probability) by λ and the total size of the representation of a weighted sequence by R. The standard representation consists of n lists with up to λ elements each, so R = O(nλ). However, the lists can be shorter in general. Also, if the threshold probability \(\frac {1}{z}\) is specified, at each position of a weighted sequence it suffices to store letters with probability at least \(\frac {1}{z}\), and clearly there are at most z such letters for each position. This reduction can be performed in linear time, so we shall always assume that λz. Moreover, the assumption that Σ is an integer alphabet of size σ = nO(1) lets us assume without loss of generality that the entries \((j,\pi ^{(X)}_{i}(j))\) in the lists representing X[i] are ordered by increasing j: if this is not the case, we can simultaneously sort these lists in linear time.

The probability of matching of a string S with a weighted sequence X, |S| = |X| = n, is
$$\P(S,X) = {\prod}_{i = 1}^{n} \pi^{(X)}_{i}(S[i]).$$
We say that a string S matches a weighted sequence X with probability at least \(\frac {1}{z}\), denoted by \(S \approx _{\frac {1}{z}} X\), if \(\P (S,X) \!\ge \! \frac {1}{z}\). We also denote \(\textbf {M}_{z}(X)=\{S\in {\Sigma }^{n} : \P (S,X)\ge \frac {1}{z}\}\).
Given a weighted sequence T, by T[i..j] we denote a weighted sequence, called a factor of T, equal to T[i]⋯T[j] (if i > j, then the factor is empty). We say that a string P of length m occurs in T at position i if P matches the factor T[i..i + m − 1]. The set of positions where P occurs in T is denoted by \(\mathit {Occ}_{\frac {1}{z}}(P,T)\).

4.1 Weighted Sequences versus Profiles

As shown below, profiles and weighted sequences are essentially equivalent objects.

Fact 4.1

  1. 1.

    Given a weighted sequence X of length n over an alphabet of size σ and a probability \(\frac {1}{z}\), one can construct in O(nσ) time an n × σ profile P and a threshold Z such that MZ(P) = Mz(X).

     
  2. 2.

    Given an m × σ profile P and a threshold Z, one can construct in O(mσ) time a weighted sequence X and a probability \(\frac {1}{z}\) such that Mz(X) = MZ(P).

     

Proof

Given a weighted sequence X, one can construct an equivalent profile P setting \(P[i,s]=-\log \pi ^{(X)}_{i}(s)\) for each position i and character s. If \(\pi ^{(X)}_{i}(s)= 0\), we set \(P[i,s]=\infty \) (which can be replaced by a sufficiently large finite value after we fix the threshold Z). The profile P satisfies MZ(P) = Mz(X) for \(Z = \log z\).

To construct an inverse mapping, we need to normalise the scores first. For this, we construct a normalised profile \(P^{\prime }\) setting \(P^{\prime }[i,s] := P[i,s]+\log ({\sum }_{s \in {\Sigma }} 2^{-P[i,s]})\). As a result, we have \(\textbf {M}_{Z}(P)=\textbf {M}_{Z^{\prime }}(P^{\prime })\) for \(Z^{\prime } = Z + {\sum }_{i = 1}^{m} \log ({\sum }_{s \in {\Sigma }} 2^{-P[i,s]})\). Now, we can build an equivalent weighted sequence X by setting \(\pi ^{(X)}_{i}(s)= 2^{-P^{\prime }[i,s]}\). Note that
$$\sum\limits_{s\in {\Sigma}}\pi^{(X)}_{i}(s) = \sum\limits_{s\in {\Sigma}} 2^{-P[i,s]-\log({\sum}_{s^{\prime}\in {\Sigma}} 2^{-P[i,s^{\prime}]})}=\left( \sum\limits_{s\in {\Sigma}} 2^{-P[i,s]}\right)\cdot 2^{-\log({\sum}_{s\in {\Sigma}} 2^{-P[i,s]})}= 1$$
holds as required. Moreover, \(\textbf {M}_{z}(X)=\textbf {M}_{Z^{\prime }}(P^{\prime })=\textbf {M}_{Z}(P)\) for \(z = 2^{Z^{\prime }}\). □

In the light of Fact 4.1, it may seem that the results for profiles and weighted sequences should coincide. However, we use different parameters to study the complexity of the algorithmic problems in these models: for profiles this is the number |MZ(P)| of matching strings, while for weighted sequence this is the inverse z of the threshold probability \(\frac {1}{z}\). These parameters are related by the following observation:

Observation 4.2

A weighted sequence X satisfies |Mz(X)|≤ z for every threshold.

However, the bound |Mz(X)|≤ z is not tight in general, which gives more power to algorithms parameterised by z. Moreover, z is a part of the input (as opposed to |MZ(P)| for profiles). Furthermore, it is natural to consider a common threshold probability \(\frac {1}{z}\) for multiple weighted sequences, e.g., factors of a weighted text T as in Weighted Pattern Matching.

A more technical difference lies in the representation of profiles and weighted sequences, which we have chosen consistently with the literature. A profile is stored as a dense m × σ matrix, while in a weighted sequence of the same length we do not explicitly keep entries with \(\pi ^{(X)}_{i}(s)= 0\), so the input size R can be smaller than mσ. This allows for faster algorithms—because reading the input takes less time—but at the same time poses some challenges—because \(\pi ^{(X)}_{i}(s)\) cannot be accessed in constant time, unless σ = O(1) or we allow randomisation. This is illustrated below in case of the Weighted Pattern Matching problem and also in Section 6.

4.2 Solution to Weighted Pattern Matching

The approach from our solution to Profile Matching can be used for Weighted Pattern Matching. In a natural way, we extend the notion of a heavy string to weighted sequences. This lets us restate Observation 3.1 in the language of probabilities instead of scores.

Observation 4.3

If a string P matches a weighted sequence X of the same length with probability at least \(\frac {1}{z}\), then \(d_H(\textbf {H}(X),P) \le \left \lfloor {\log z} \right \rfloor \).

Comparing to the solution to Profile Matching, we compute the heavy string of the text instead of the pattern. An auxiliary variable α stores the matching probability between a factor of H(T) and the corresponding factor of T; it is updated when we move to the next position of the text. The rest of the algorithm is basically the same as previously; see the pseudocode of WeightedPatternMatching(P, T, \(\frac {1}{z}\)).

Implementation for large alphabets

The algorithm above takes \(O(n\log z)\) time for σ = O(1). In the general case, we need to efficiently implement the following operations on the weighted sequence:
  • finding the letter with the maximum probability at a given position,

  • computing the probability of a given letter at a given position.

For a weighted sequence in the standard list representation, we can compute the maximum-probability letter at each position in O(R) time which lets us perform the former operation in O(1) time. We also explicitly store the probabilities of the heaviest letters so that \(\pi ^{(T)}_{j}(T^{\prime }[j])\) can be retrieved in constant time for any index j.

To implement the latter operation for an arbitrary character, we store each T[j] in a weight-balanced binary tree [21], with the weight of \((s,\pi ^{(T)}_{j}(s))\) equal to \(\pi ^{(T)}_{j}(s)\). As a result, any \(\pi ^{(T)}_{j}(s)\) can be retrieved in \(O(-\log \pi ^{(T)}_{j}(s))=O(\log z)\) time. During the course of the p-th step of the algorithm, \(\alpha ^{\prime }\) is a product of some probabilities including all the retrieved probabilities \(\pi ^{(T)}_{j}(s)\) with \(s \ne T^{\prime }[j]\). The while loop is executed only when \(\alpha ^{\prime }\ge \frac {1}{z}\), so the product of these probabilities (excluding the one retrieved in the final iteration) is at least \(\frac {1}{z}\). Consequently, the overall retrieval time in the p-th step is \(O(\log z)\).

This way, we can implement the algorithm in \(O(R+n \log z)\) time.

Theorem 1.2

Weighted Pattern Matching can be solved in \(O(R+n \log z)\) time.

Remark 4.4

In the same complexity one can solve GWPM with a solid text.

5 Profile Consensus and Multichoice Knapsack

Let us start with a precise statement of the Multichoice Knapsack problem.

For a fixed instance of Multichoice Knapsack, we say that S is a partial choice if |SCi|≤ 1 for each class. The set D = {i : |SCi| = 1} is called its domain. For a partial choice S, we define \(v(S) = {\sum }_{c \in S} v(c)\) and \(w(S) = {\sum }_{c \in S} w(c)\).

5.1 Profile Consensus versus Multichoice Knapsack

As shown below, Profile Consensus and Multichoice Knapsack are essentially equivalent problems.

Fact 5.1

  1. 1.

    Consider an instance of Profile Consensus with two m × σ profiles P, Q and a common threshold Z. In O(mσ) time one can construct an equivalent instance of Multichoice Knapsack with m classes of σ items each, AV = |MZ(P)|, and AW = |MZ(Q)|.

     
  2. 2.

    Consider an instance of Multichoice Knapsack with n classes of at most λ items each. In O(nλ) time one can construct an equivalent instance of Profile Consensus with two n × λ profiles P, Q and a common threshold Z such that |MZ(P)| = AV and |MZ(Q)| = AW.

     

Proof

Given an instance (P,Q,Z) of the Profile Consensus problem, we construct an equivalent instance of Multichoice Knapsack with m classes of σ items each, denoted ci,j for 1 ≤ im and 1 ≤ jσ, each with value v(ci,j) = −P[i,j] and weight w(ci,j) = −Q[i,j]. We set both thresholds to V = W = −Z. It is straightforward to verify that the constructed instance satisfies the required conditions.

This construction is easily reversible if V = W and the size of each class is λ. In general, we add dummy items (with infinite or very large weight and value), decrease the weight of each item by \(\frac {1}{n}(W-V)\), and decrease the weight threshold to V . □

The only technical difference between Multichoice Knapsack and Profile Consensus is that the profiles are stored as dense m × σ matrices while the classes in Multichoice Knapsack can be of different size so the input size N can be smaller than the number of classes n times the bound λ on the class size.

Below, we formulate our results in the more established language of Multichoice Knapsack.

5.2 Overview of the Solution

The classic O(2n/2)-time solution to the Knapsack problem [12] is based on a meet-in-the-middle approach. The set D = {1,…,n} is partitioned into two domains D1,D2 of size roughly n/2, and for each Di, all partial choices S are generated and ordered by v(S). This reduces the problem to an instance of Multichoice Knapsack with two classes, which is solved using a folklore linear-time solution (described for completeness in Section 5.5).

The meet-in-the-middle approach to Knapsack generalises directly to a solution to Multichoice Knapsack. The partition may be chosen as to balance the number of partial choices in each domain, and so the worst-case time complexity is \(O(\sqrt {Q\lambda })\), where \(Q={\prod }_{i = 1}^{n} |C_{i}|\) is the number of choices.

Our aim in this section is to replace Q with the parameter a (which never exceeds Q). The overall running time is going to be \(O(N+\sqrt {a\lambda }\log A)\).

Again, we will partition the set of classes into two groups, for each group we will generate a subset of all partial choices, and then we will check if two partial choices can be joined into a feasible solution. However, several questions arise with this approach in order to obtain the desired complexity:
  1. (1)

    How to partition the set of classes?

     
  2. (2)

    In what order should the partial choices be generated?

     
  3. (3)

    How many partial choices should be generated, given that the value of the parameter a is not known in advance?

     
As for question (1), we consider all partitions of the form D = {1,…,j}∪{j + 1,…,n} for 1 ≤ jn. This results in an extra O(n) factor in the time complexity. However, in Section 5.7 we introduce preprocessing which reduces the general case to the case when \(n=O\left (\frac {\log A}{\log \lambda }\right )\).

A natural idea to deal with question (2) is to consider only partial choices with small values v(S) or w(S). This is close to our actual solution, which is based on a notion of ranks of partial choices that we introduce in Section 5.3.

Finally, to tackle question (3), we generate the partial choices batch-wise until either a solution is found or we can certify that it does not exist. The idea of this step is presented also in Section 5.3, while the generation procedure is detailed in Section 5.4. While dealing with these issues, a careful implementation is required to avoid several further extra factors in the running time.

In the end, we show that the number of partial choices that need to be generated is indeed \(O(\sqrt {a\lambda })\). Our final solution to Multichoice Knapsack is presented in Section 5.6 without the instance size reduction and in Section 5.8 using the reduction.

5.3 Ranks of Partial Choices

For a partial choice S, we define rankv(S) as the number of partial choices \(S^{\prime }\) with the same domain for which \(v(S^{\prime })\le v(S)\). We symmetrically define rankw(S). For simplicity, if cCi, we denote rankv(c) = rankv({c}) and rankw(c) = rankw({c}). Ranks are introduced as an analogue of match probabilities in weighted sequences. Probabilities are multiplicative, while for ranks we have submultiplicativity:

Fact 5.2

If S = S1S2 is a decomposition of a partial choice S into two disjoint subsets, then rankv(S1)rankv(S2) ≤ rankv(S) (and same for rankw).

Proof

Let D1 and D2 be the domains of S1 and S2, respectively. For every partial choices \(S^{\prime }_{1}\) over D1 and \(S^{\prime }_{2}\) over D2 such that \(v(S^{\prime }_{1}) \le v(S_{1})\) and \(v(S^{\prime }_{2}) \le v(S_{2})\), we have \(v(S^{\prime }_{1} \cup S^{\prime }_{2})=v(S^{\prime }_{1})+v(S^{\prime }_{2})\le v(S)\). Hence, \(S^{\prime }_{1}\cup S^{\prime }_{2}\) must be counted while determining rankv(S). □

For 0 ≤ jn, let Lj be the list of partial choices with domain {1,…,j} ordered by value v(S), and for > 0 let Lj[] be the -th element of Lj. Analogously, for 1 ≤ jn + 1, we define Rj as the list of partial choices over {j,…,n} ordered by v(S), and for r > 0, Rj[r] as the r-th element of Rj. If any of the partial choices Lj[], Rj[r] does not exist, we assume that its value is \(\infty \).

The following two observations yield a decomposition of each choice into a single item and two partial solutions of a small rank. Observe that we do not need to know AV in order to check if the ranks are sufficiently large.

Lemma 5.3

Let and r be positive integers such that v(Lj[]) + v(Rj+ 1[r]) > V for each 0 ≤ jn. For every choice S with v(S) ≤ V, there is an index j ∈{1,…,n} and a decomposition S = L ∪{c}∪ R such that v(L) < v(Lj− 1[]),cCj, and v(R) < v(Rj+ 1[r]).

Proof

Let S = {c1,…,cn} with ciCi and, for 0 ≤ in, let Si = {c1,…,ci}. If v(Sn− 1) < v(Ln− 1[]), we set L = Sn− 1, c = cn, and R = , satisfying the claimed conditions.

Otherwise, we define j as the smallest index i such that v(Si) ≥ v(Li[]), and we set L = Sj− 1, c = cj, and R = SSj. The definition of j implies v(L) < v(Lj− 1[]) and v(L ∪{c}) ≥ v(Lj[]). Moreover, we have v(L ∪{c}) + v(R) = v(S) ≤ V < v(Lj[]) + v(Rj+ 1[r]), and thus v(R) < v(Rj+ 1[r]). □

Fact 5.4

Let ,r > 0. If v(Lj[]) + v(Rj+ 1[r]) ≤ V for some j ∈{0,…,n}, then rAV.

Proof

Let L and R be the -th and r-th entry in Lj and Rj+ 1, respectively. Note that v(LR) ≤ V implies rankv(LR) ≤ AV by definition of AV. Moreover, rankv(L) ≥ and rankv(R) ≥ r (the equalities may be sharp due to draws). Now, Fact 5.2 yields the claimed bound. □

5.4 Generating Partial Choices of Small Rank

Note that Lj can be obtained by interleaving |Cj| copies of Lj− 1, where each copy corresponds to extending the choices from Lj− 1 with a different item. If we were to construct Lj having access to the whole Lj− 1, we could apply the following standard procedure. For each cCj, we maintain an iterator on Lj− 1 pointing to the first element S on Lj− 1 for which S ∪{c} has not yet been added to Lj. The associated value is v(S ∪{c}). All iterators initially point at the first element of Lj− 1. Then the next element to append to Lj is always S ∪{c} corresponding to the iterator with minimum value. Having processed this partial choice, we advance the iterator (or remove it, once it has already scanned the whole Lj− 1). This process can be implemented using a binary heap Hj as a priority queue, so that initialisation requires O(|Cj|) time and outputting a single element takes \(O(\log |C_{j}|)\) time. Each partial choice SLj is stored in O(1) space using a pointer to a partial choice \(S^{\prime } \in \textbf {L}_{j-1}\) such that \(S=S^{\prime } \cup \{c\}\) for some cCj.

For r ≥ 0, let \(\textbf {L}^{(i)}_{j}\) be the prefix of Lj of length \(\min (i,|\textbf {L}_{j}|)\) and \(\textbf {R}^{(i)}_{j}\) be the prefix of Rj of length \(\min (i, |\textbf {R}_{j}|)\). A technical transformation of the procedure stated above leads to an online algorithm that constructs the prefixes \(\textbf {L}^{(i)}_{j}\) and \(\textbf {R}^{(i)}_{j}\), as shown in the following lemma. Along with each reported partial choice S, the algorithm also computes w(S).

Lemma 5.5

After O(N)-time initialisation, one can compute L1[i],…,Ln[i] knowing \(\textbf {L}^{(i-1)}_{1},\ldots ,\textbf {L}^{(i-1)}_{n}\) in \(O(n\log \lambda )\) time. Symmetrically, one can construct R1[i],…,Rn[i] from \(\textbf {R}^{(i-1)}_{1},\ldots ,\textbf {R}^{(i-1)}_{n}\) in the same time complexity.

Proof

Our online algorithm is going to use the same approach as the offline computation of lists \(\textbf {L}_{j}^{(i)}\). The order of computations will be different, though.

At each step, for j = 1 to n we shall extend lists \(\textbf {L}^{(i-1)}_{j}\) with a single element (unless the whole Lj has already been generated) from the top of the heap Hj. We keep an invariant that each iterator in Hj always points to an element that is already in \(\textbf {L}^{(i-1)}_{j-1}\) or to Lj− 1[i]: the first element that has not been yet added to Lj− 1, which is represented by the top of the heap Hj− 1.

We initialise the heaps as follows: we introduce H0 which represents the empty choice with v() = 0. Next, for j = 1,…,n we build the heap Hj representing |Cj| iterators initially pointing to the top of Hj− 1. The initialisation takes O(N) time in total since a binary heap can be constructed in time linear in its size.

At each step, the lists \(\textbf {L}^{(i-1)}_{j}\) are extended for consecutive values j from 1 to n. Since \(\textbf {L}^{(i-1)}_{j-1}\) is extended before \(\textbf {L}^{(i-1)}_{j}\), by the invariant, all iterators in Hj point to the elements of \(\textbf {L}^{(i)}_{j-1}\) while we compute Lj[i]. We take the top of Hj and move it to \(\textbf {L}^{(i)}_{j}\). Next, we advance the corresponding iterator and update its position in the heap Hj. After this operation, the iterator might point to the top of Hj− 1. If Hj− 1 is empty, this means that the whole list Lj− 1 has already been generated and traversed by the iterator. In this case, we remove the iterator.

This way we indeed simulate the previous offline solution. A single phase makes O(1) operations on each heap Hj. The running time is bounded by \(O({\sum }_{j} \log |C_{j}|)=O(n\log \lambda )\) at each step of the algorithm. □

5.5 Multichoice Knapsack for n = 2 Classes

Let us recall the final processing of the meet-in-the-middle solution to the Knapsack problem [12]. We formulate it in terms of Multichoice Knapsack with two classes.

An item cCj is irrelevant if there is another item \(c^{\prime }\in C_{j}\) that dominates c, i.e., such that \(v(c) \ge v(c^{\prime })\) and \(w(c) \ge w(c^{\prime })\). Observe that removing an irrelevant item leads to an equivalent instance of the Multichoice Knapsack problem, and it may only decrease the parameters AV and AW.

Lemma 5.6

The Multichoice Knapsack problem can be solved inO(N) time if n = 2 and the elements c of C1 andC2 are sorted by v(c).

Proof

Since the items of C1 and C2 are sorted by v(c), a single scan through these items lets us remove all irrelevant elements. Next, for each c1C1 we compute c2C2 such that v(c2) ≤ Vv(c1) but otherwise v(c2) is largest possible. As we have removed irrelevant elements from C2, this item also minimises w(c2) among all elements satisfying v(c2) ≤ Vv(c1). Hence, if there is a feasible solution containing c1, then {c1,c2} is feasible. If we process elements c1 by non-decreasing values v(c1), the values v(c2) do not increase, and thus the items c2 can be computed in O(N) time in total. □

5.6 Multichoice Knapsack Parameterised by a

Combining the procedures of Lemmas 5.5 and 5.6 with the combinatorial results of Section 5.3, we obtain the first algorithm for Multichoice Knapsack parameterised by a.

Proposition 5.7

Multichoice Knapsack can be solved in \(O(n(\lambda +\sqrt {a\lambda })\log \lambda )\) time.

Proof

Below, we give an algorithm working in \(O(n(\lambda +\sqrt {A_{V}\lambda })\log \lambda )\) time. The final solution runs it in parallel on the original instance and on the instance with v and V swapped with w and W , waiting until at least one of them terminates.

We increment an integer r starting from 1, maintaining \(\ell =\left \lceil {\frac {r}{\lambda }} \right \rceil \) and the lists \(\textbf {L}_{j}^{(\ell )}\) and \(\textbf {R}_{j + 1}^{(r)}\) for 0 ≤ jn, as long as v(Lj[]) + v(Rj+ 1[r]) ≤ V for some j (or until all the lists have been completely generated). By Fact 5.4, we stop at \(r=O(\sqrt {A_{V} \lambda })\) and due to Lemma 5.5, the process takes \(O(n\sqrt {A_{V} \lambda }\log \lambda )\) time.

According to Lemma 5.3, every feasible solution S admits a decomposition S = L ∪{c}∪ R with \(L\in \textbf {L}_{j-1}^{(\ell )}\), cCj, and \(R\in \textbf {R}_{j + 1}^{(r)}\) for some index j. We consider all possibilities for j. For each of them we will reduce searching for S to an instance of the Multichoice Knapsack problem with 2 classes of \(O(\sqrt {A_{V}\lambda })\) items. By Lemma 5.6, these instances can be solved in \(O(n\sqrt {A_{V}\lambda })\) time in total.

The items of the j-th instance are going to belong to classes \(\textbf {L}_{j-1}^{(\ell )}\odot C_{j}\) and \(\textbf {R}_{j + 1}^{(r)}\), where \(\textbf {L}_{j-1}^{(\ell )}\odot C_{j} = \{L\cup \{c\} : L\in \textbf {L}_{j-1}^{(\ell )} , c\in C_{j}\}\). The set \(\textbf {L}_{j-1}^{(\ell )}\odot C_{j}\) is constructed by merging |Cj|≤ λ sorted lists, each of size \(\ell =O(1+\sqrt {A_{V}/\lambda })\). This takes \(O((\lambda +\sqrt {A_{V}\lambda })\log \lambda )\) time, which results in \(O(n(\lambda +\sqrt {A_{V} \lambda })\log \lambda )\) time over all indices j.

Clearly, each feasible solution of the constructed instances represents a feasible solution of the initial instance, and by Lemma 5.3, every feasible solution of the initial instance has its counterpart in one of the constructed instances. □

5.7 Preprocessing to Reduce Instance Size

In order to improve the running time for Multichoice Knapsack, we develop two reductions and run them as preprocessing to the procedure of Proposition 5.7. First, we observe that items c with rankv(c) > AV or rankw(c) > AW cannot belong to any feasible solution. Moreover their removal results in λa, which lets us hide the \(O(n\lambda \log \lambda )\) term in the running time. Our second reduction decreases the number of classes n to \(O\left (\frac {\log A}{\log \lambda }\right )\). For this, we repeatedly remove irrelevant items (as defined in Section 5.5) and merge small classes into their Cartesian product (so that the class sizes are more balanced).

For each class Ci, let \(v_{\min }(i) = \min \{v(c) : c\in C_{i}\}\). Also, let \(V_{\min } = {\sum }_{i = 1}^{n} v_{\min }(i)\); note that \(V_{\min }\) is the smallest possible value v(S) of a choice S. We symmetrically define \(w_{\min }(i)\) and \(W_{\min }\).

Lemma 5.8

Given an instance I of the Multichoice Knapsack problem, one can compute inO(N) time an equivalent instance \(I^{\prime }\) with \(N^{\prime } \le N\), \(n^{\prime }=n\), \(A_{V}^{\prime }= A_{V}\),\(A_{W}^{\prime }= A_{W}\), and \(\lambda ^{\prime }\le \min (\lambda ,a)\).

Proof

From each class Ci we remove all items c such that \(V_{\min }+v(c)-v_{\min }(i)>V\) or \(W_{\min }+w(c)-w_{\min }(i)>W\). Afterwards, for each item cCi one can obtain a choice S such that cS and v(S) ≤ V (or w(S) ≤ W) by choosing the elements with the minimal value (minimal weight, respectively) in all the remaining classes. □

Our second preprocessing consists of several steps. First, we quickly reduce the number of classes to \(n=O(\log A)\).

Lemma 5.9

Given an instance I of the Multichoice Knapsack problem, one can compute in linear time an equivalent instance \(I^{\prime }\) with \(N^{\prime }\le N\), \(A_{V}^{\prime }\le A_{V}\), \(A_{W}^{\prime }\le A_{W}\), \(\lambda ^{\prime }\le \lambda \), and \(n^{\prime } \le 2\log A\).

Proof

Observe that if a class Ci contains an item c for which both \(v(c)=v_{\min }(i)\) and \(w(c)=w_{\min }(i)\), then we can greedily include it in the solution S. Hence, we can remove such a class, setting \(V := V- v_{\min }(i)\) and \(W := W- w_{\min }(i)\). We execute this reduction rule exhaustively, which clearly takes O(N) time in total and may only decrease the parameters AV and AW. After the reduction, the minima \(v_{\min }(i)\) and \(w_{\min }(i)\) must be attained by distinct items of every class Ci.

We shall prove that now we can either find out that A ≥ 2n/2 or that we are dealing with a NO-instance. To decide which case holds, let us define ΔV(i) as the difference between the second smallest value in the multiset {v(c) : cCi} and \(v_{\min }(i)\). We set \({\Delta }_{V}^{\text {mid}}\) as the sum of the \(\left \lceil {\frac {n}{2}} \right \rceil \) smallest values ΔV(i) for 1 ≤ in; we define \({\Delta }_{W}^{\text {mid}}\) analogously.

Claim 1

If \(V_{\min } + {\Delta }_{V}^{\text {mid}} \le V\), then AV ≥ 2n/2; if \(W_{\min } + {\Delta }_{W}^{\text {mid}} \le W\), then AW ≥ 2n/2; otherwise, we are dealing with a NO-instance.

Proof

First, assume that \(V_{\min } + {\Delta }_{V}^{\text {mid}} \le V\). This means that there is a choice S with v(S) ≤ V containing at least \(\frac {n}{2}\) items c such that rankv(c) ≥ 2. Hence, Fact 5.2 yields \(\text{rank} _{v}(S)\ge 2^{\left \lceil {n/2} \right \rceil }\) and consequently AV ≥ 2n/2, as claimed. Symmetrically, if \(W_{\min } + {\Delta }_{W}^{\text {mid}} \le W\), then AW ≥ 2n/2.

Now, suppose that there is a feasible solution S. As no class contains a single item minimising both v(c) and w(c), there are at least \(\left \lceil {\frac {n}{2}} \right \rceil \) classes for which S contains an item not minimising v(c), or at least \(\left \lceil {\frac {n}{2}} \right \rceil \) classes for which S contains an item not minimising w(c). Without loss of generality, we assume that the former holds. Let D be the set of at least \(\left \lceil {\frac {n}{2}} \right \rceil \) classes i satisfying the condition. If cCi does not minimise v(c), then \(v(c)\ge v_{\min }(i)+{\Delta }_{V}(i)\). Consequently, \(V\ge v(S) = V_{\min } + {\sum }_{i\in D} {\Delta }_{V}(i)\). However, observe that \( {\sum }_{i\in D} {\Delta }_{V}(i) \ge {\Delta }_{V}^{\text {mid}}\), so \(V \ge V_{\min } + {\Delta }_{V}^{\text {mid}}\), as claimed. □

The conditions from the claim can be verified in O(N) time using a linear-time selection algorithm to compute \({\Delta }_{V}^{\text {mid}}\) and \({\Delta }_{W}^{\text {mid}}\). If any of the first two conditions holds, we return the instance obtained using our reduction. Otherwise, we output a dummy NO-instance. □

In the improved reduction we use two basic steps. The first one is expressed in the following lemma.

Lemma 5.10

Consider a class of items in an instance of the Multichoice Knapsack problem. In linear time, we can remove some irrelevant items from the class so that the resulting class C satisfies \(\max (\text{rank} _{v}(c),\text{rank} _{w}(c)) > \frac {1}{3} |C|\) for each item cC.

Proof

First, note that using a linear-time selection algorithm, we can determine for each item c whether \(\text{rank} _{v}(c)\le \frac {1}{3}|C|\) and whether \(\text{rank} _{w}(c)\le \frac {1}{3}|C|\). If there is no item satisfying both conditions, we keep C unaltered. Otherwise, we have an item which dominates at least \(|C|-\text{rank} _{v}(c)-\text{rank} _{w}(c) \ge \frac {1}{3}|C|\) other items. We scan through all items in C and remove those dominated by c. Next, we repeat the algorithm. The running time of a single phase is clearly linear, and since |C| decreases geometrically, the total running time is also linear. □

The second reduction step decreases the number of classes by replacing two distinct classes Ci, Cj with their Cartesian product Ci × Cj, assuming that the value (weight) of a pair (ci,cj) is the sum of values (weights) of ci and cj. This clearly leads to an equivalent instance of the Multichoice Knapsack problem, does not alter the parameters AV, AW, and decreases n. On the other hand, N and λ may increase; the latter happens only if |Ci|⋅|Cj| > λ.

These two reduction rules let us implement our preprocessing procedure.

Lemma 5.11

Given an instance I of the Multichoice Knapsack problem, one can compute in \(O(N+\lambda \log A)\) time an equivalent instance \(I^{\prime }\) with \(A_{V}^{\prime }\le A_{V}\), \(A_{W}^{\prime }\le A_{W}\),\(\lambda ^{\prime }\le \lambda \), and \(n^{\prime }=O\left (\frac {\log A}{\log \lambda }\right )\).

Proof

First, we apply Lemma 5.9 to make sure that \(n\le 2\log A\) and \(N = O(\lambda \log A)\). We may now assume that λ ≥ 36, as otherwise we already have \(n = O\left (\frac {\log A}{\log \lambda }\right )\).

Throughout the algorithm, whenever there are two distinct classes of size at most \(\sqrt {\lambda }\), we shall replace them with their Cartesian product. This may happen only n − 1 times, and a single execution takes O(λ) time, so the total running time needed for this part is \(O(\lambda \log A)\).

Furthermore, for every class that we get in the input instance or obtain as a Cartesian product, we apply Lemma 5.10. The total running time spent on this is also \(O(\lambda \log A)\).

Having exhaustively applied these reduction rules, we are guaranteed that we have \(\max (\text{rank} _{v}(c),\text{rank} _{w}(c))>\frac {1}{3}\sqrt {\lambda }\ge \lambda ^{\frac {1}{3}}\) for items c from all but one class. Without loss of generality, we assume that the classes satisfying this condition are C1,…,Ck.

Recall that \(v_{\min }(i)\) and \(w_{\min }(i)\) are defined as minimum values and weights of items in class Ci and that \(V_{\min }\) and \(W_{\min }\) are their sums over all classes. For 1 ≤ ik, we define ΔV(i) as the difference between the \(\left \lceil {\lambda ^{\frac {1}{3}}}\right \rceil \)-th smallest value in the multiset {v(c) : cCi} and \(v_{\min }(i)\). Next, we define \({\Delta }_{V}^{\text {mid}}\) as the sum of the \(\left \lceil {\frac {k}{2}} \right \rceil \) smallest values ΔV(i). Symmetrically, we define ΔW(i) and \({\Delta }_{W}^{\text {mid}}\). We shall prove a claim analogous to that in the proof of Lemma 5.9.

Claim 2

If \(V_{\min } + {\Delta }_{V}^{\text {mid}}\le V\), then \(A_{V} \ge \lambda ^{\frac {1}{6} k}\); if \(W_{\min } + {\Delta }_{W}^{\text {mid}}\le W\), then \(A_{W} \ge \lambda ^{\frac {1}{6} k}\); otherwise, we are dealing with a NO-instance.

Proof

First, suppose that \(V_{\min } + {\Delta }_{V}^{\text {mid}}\le V\). This means that there is a choice S with v(S) ≤ V which contains at least \(\frac {k}{2}\) items c with \(\text{rank} _{v}(c)\ge \lambda ^{\frac {1}{3}}\). By Fact 5.2, the rank of this choice is at least \(\lambda ^{\frac {1}{6} k}\), so \(A_{V} \ge \lambda ^{\frac {1}{6} k}\), as claimed. The proof of the second case is analogous.

Now, suppose that there is a feasible solution S = {c1,…,cn}. For 1 ≤ ik, we have \(\text{rank} _{v}(c_{i})\ge \lambda ^{\frac {1}{3}}\) or \(\text{rank} _{w}(c_{i}) \ge \lambda ^{\frac {1}{3}}\). Consequently, \(\text{rank} _{v}(c_{i})\ge \lambda ^{\frac {1}{3}}\) holds for at least \(\left \lceil {\frac {k}{2}} \right \rceil \) classes or \(\text{rank} _{w}(c_{i})\ge \lambda ^{\frac {1}{3}}\) holds for at least \(\left \lceil {\frac {k}{2}} \right \rceil \) classes. Without loss of generality, we assume that the former holds. Let D be the set of (at least \(\left \lceil {\frac {k}{2}} \right \rceil \)) classes i satisfying the condition. For each iD, we clearly have \(v(c_{i})\ge v_{\min }(i)+{\Delta }_{V}(i)\), while for each iD, we have \(v(c_{i})\ge v_{\min }(i)\). Consequently, \(V\ge v(S) \ge V_{\min } + {\sum }_{i\in D} {\Delta }_{V}(i) \ge V_{\min } + {\Delta }_{V}^{\text {mid}}\). Hence, \(V \ge V_{\min } + {\Delta }_{V}^{\text {mid}}\), which concludes the proof. □

The condition from the claim can be verified using a linear-time selection algorithm: first, we apply it for each class to compute ΔV(i) and ΔW(i), and then, globally, to determine \({\Delta }_{V}^{\text {mid}}\) and \({\Delta }_{W}^{\text {mid}}\). If one of the first two conditions holds, we return the instance obtained through the reduction. It satisfies \(A \ge \lambda ^{\frac {1}{6} k}\), i.e., \(n \le 1+k \le 1 + 6\frac {\log A}{\log \lambda }\). Otherwise, we construct a dummy NO-instance. □

5.8 Main Result

We apply the preprocessing of the previous section to arrive at our final algorithm.

Theorem 1.3

Multichoice Knapsack can be solved in \(O(N+\sqrt {a\lambda }\log A)\) time.

Proof

Before running the algorithm of Proposition 5.7, we apply the reductions of Lemmas 5.8 and 5.11. With this order of reductions, we already have λa during the execution of Lemma 5.11, so the \(O(\lambda \log A)\) term is dominated by \(O(\sqrt {a\lambda }\log A)\). □

6 Weighted Consensus and General Weighted Pattern Matching

The Weighted Consensus problem is formally defined as follows.

Due to Facts 4.1 and 5.1, the Weighted Consensus problem is essentially equivalent to Multichoice Knapsack. The only difference is that we study Multichoice Knapsack with respect to unknown parameters a and A, whereas in Weighted Consensus we know the parameter z. By Observation 4.2, these values for equivalent instances satisfy aAz, so Theorem 1.3 immediately yields:

Proposition 6.1

Weighted Consensus can be solved in \(O(R+\sqrt {z\lambda }\log z)\) time.

In Sections 6.2 and 6.3 we show that the \(O(\log z)\) term can be reduced to \(O(\log \lambda + \log \log z)\). Such an improvement is possible because the bound aAz is not tight in general.

If two weighted sequences admit a consensus, we write \(X \approx _{\frac {1}{z}} Y\) and say that X matches Y with probability at least \(\frac {1}{z}\). With this definition of a match, we extend the notion of an occurrence and the notation \(\mathit {Occ}_{\frac {1}{z}}(P,T)\) to arbitrary weighted sequences.

In the case of the GWPM problem, it is more useful to provide an oracle that finds witness strings that correspond to the respective occurrences of the pattern. Such an oracle, given \(i \in \mathit {Occ}_{\frac {1}{z}}(P,T)\), computes a string that matches both P and T[i..i + m − 1].

6.1 Reduction to Weighted Consensus on Short Sequences

The GWPM problem clearly can be reduced to n + m − 1 instances of Weighted Consensus. This leads to a naïve \(O(nR + n\sqrt {z\lambda }\log z)\)-time algorithm. In this subsection, we remove the first term in this complexity.

Our solution applies the tools developed in Section 4 for Weighted Pattern Matching and uses an observation that is a consequence of Observation 4.3.

Observation 6.2

If X and Y are weighted sequences that match with threshold \(\frac {1}{z}\), then \(d_H(\textbf {H}(X),\textbf {H}(Y)) \le 2\left \lfloor {\log z} \right \rfloor \). Moreover, there exists a consensus string S such that S[i] = H(X)[i] = H(Y )[i] unless H(X)[i]≠H(Y )[i].

Proof

The fact that \(X \approx _{\frac {1}{z}} Y\) means that there exists a string P such that \(P \approx _{\frac {1}{z}} X\) and \(P \approx _{\frac {1}{z}} Y\). Let the set A1 represent the positions of mismatches between H(X) and P and the set A2 represent the positions of mismatches between H(Y ) and P. By Observation 4.3, \(|A_{1}|,|A_{2}| \le \left \lfloor {\log z} \right \rfloor \). Let A be the set of mismatches between H(X) and H(Y ). We have \(A \subseteq A_{1} \cup A_{2}\) and thus \(|A| \le 2\left \lfloor {\log z} \right \rfloor \). Finally, observe that for each iA ∖ (A1A2) we may replace P[i] with H(X)[i] = H(Y )[i] to obtain a string S such that \(S \approx _{\frac {1}{z}} X\) and \(S \approx _{\frac {1}{z}} Y\) and S[i] = H(X)[i] = H(Y )[i] unless iA. □

The algorithm starts by computing \(P^{\prime }=\textbf {H}(P)\) and \(T^{\prime }=\textbf {H}(T)\) and the data structure for lcp-queries in \(P^{\prime }T^{\prime }\). We try to match P against every factor T[p..p + m − 1] of the text. Following Observation 6.2, we check if \(d_H(T^{\prime }[p\mathinner {..} p+m-1], P^{\prime })\) \( \le 2\left \lfloor {\log z} \right \rfloor \). If not, then we know that no match is possible. Otherwise, let D be the set of positions of mismatches between \(T^{\prime }[p\mathinner {..} p+m-1]\) and \(P^{\prime }\). Assume that we store \(\alpha = {\prod }_{j = 1}^{m} \pi ^{(T)}_{p+j-1}(T^{\prime }[p+j-1])\) and \(\beta = {\prod }_{j = 1}^{m} \pi ^{(P)}_{j}(P^{\prime }[j])\). Now, we only need to check what happens at the positions in D. If D = , it suffices to check if \(\alpha \ge \frac {1}{z}\) and \(\beta \ge \frac {1}{z}\).

Otherwise, we construct two weighted sequences X and Y by selecting only the positions from D in T[p..p + m − 1] and in P. In O(|D|) time we can compute \(\alpha ^{\prime }={\prod }_{j\notin D} \pi ^{(T)}_{p+j-1}(T^{\prime }[p+j-1])\) and \(\beta ^{\prime } = {\prod }_{j \notin D} \pi ^{(P)}_{j}(P^{\prime }[j])\). We multiply the probabilities of all letters at the first position in X by \(\alpha ^{\prime }\) and in Y by \(\beta ^{\prime }\). It is clear that \(X\approx _{\frac {1}{z}} Y\) if and only if \(T[p\mathinner {..} p+m-1]\approx _{\frac {1}{z}} P\).

Thus, we reduced the GWPM problem to at most nm + 1 instances of the problem of Weighted Consensus for sequences of length \(O(\log z)\). If we memorise the solutions to all those instances together with the underlying sets of mismatches D, we can also implement the oracle for the GWPM problem with O(m)-time queries. We obtain the following reduction.

Lemma 6.3

The GWPM problem and the computation of its oracle can be reduced in \(O(R + (n-m + 1)\log z)\) time to at most nm + 1 instances of the Weighted Consensus problem for weighted sequences of length\(O(\log z)\).

By Proposition 6.1, each of the resulting instances of Weighted Consensus can be solved in \(O(\lambda \log z + \sqrt {z\lambda }\log z)=O(\sqrt {z\lambda }\log z)\) time (due to zλ).

Proposition 6.4

GWPM problem can be solved in \(O(n\sqrt {z\lambda }\log z)\) time. An oracle for the GWPM problem using \(O(n \log z)\) space and supporting queries in O(m) time can be computed within the same time complexity.

In the remainder of this section, we design a tailor-made solution which lets us improve the \(O(\log z)\) factors in Propositions 6.1 and 6.4 to \(O(\log \log z + \log \lambda )\).

6.2 Reduction to Short Dissimilar Weighted Consensus

Let us notice that in the previous section we actually reduced GWPM to instances of Weighted Consensus that satisfy an additional dissimilarity requirement, as stated in the following problem.

In the SDWC problem, we further require an ordering of letters according to their probabilities. This assumption is trivial if σ = O(1); otherwise, we use the preprocessing of Section 5.7 to expedite sorting. The following result refines Lemma 6.3.

Lemma 6.5

The GWPM problem and the computation of its oracle can be reduced in \(O(R + (n-m + 1)\lambda \log z)\) time to at most nm + 1 instances of SDWC.

Proof

The reduction of Section 6.1 in \(O(R + (n-m + 1)\log z)\) time results in nm + 1 dissimilar instances of length at most \(2\log z\). However, the characters are not ordered by non-increasing probabilities. Before we sort them, we apply Lemma 5.11 in order to reduce the length to \(O(\frac {\log z}{\log \lambda })\); this takes \(O(\lambda \log z)\) time. Note that both removing irrelevant characters and merging two positions into their Cartesian product preserves the property that the probabilities at each position sum up to at most one, so the resulting instance of Multichoice Knapsack can be interpreted back as an instance of Weighted Consensus. Finally, we sort the probabilities in \(O(\lambda \log \lambda )\) time per position, i.e., in \(O(\lambda \log z)\) time per instance of SDWC. □

6.3 Solving Short Dissimilar Weighted Consensus

6.3.1 Overview

We follow the same general meet-in-the-middle scheme as the algorithm for Multichoice Knapsack presented in Proposition 5.7. The latter relies on Lemma 5.3, whose analogue in terms of weighted sequences and probabilities is much simpler.

Observation 6.6

Consider weighted sequences X and Y of length n and \(z,z_{\ell },z_{r}\in \mathbb {R}_{+}\) such that zzrz. Any SMz(X) ∩Mz(Y ) admits a decomposition S = LcR, where:
  • \(\P (L, X[1\mathinner {..} |L|])\ge \frac {1}{z_{\ell }}\),

  • c is a single letter,

  • \(\P (R, X[n-|R|+ 1\mathinner {..} n])\ge \frac {1}{z_{r}}\).

Motivated by this formulation, we employ a notion of \(\frac {1}{z}\)-solid prefixes of a weighted sequence X—strings S such that \(S \approx _{\frac {1}{z}} X[1\mathinner {..} |S|]\)—and a symmetric notion of \(\frac {1}{z}\)-solid suffixes. By Observation 4.2, the number of \(\frac {1}{z}\)-solid prefixes of weighted sequence X of length n is at most nz. A direct application of the approach of Proposition 5.7, using solid prefixes and suffixes as partial choices, would result in generating up to nz solid prefixes and nzr solid suffixes of X. Recall that, in case of SDWC, \(n = O(\log z)\).

However, \(\frac {1}{z}\)-solid prefixes have more structure than prefix partial choices of rank at most z. We exploit this structure by introducing a notion of light \(\frac {1}{z}\)-solid prefixes, that is, \(\frac {1}{z}\)-solid prefixes that end with a non-heavy letter in X, that are the key ingredient in our solution. We show that the number of light \(\frac {1}{z}\)-solid prefixes of X is at most z. Our algorithm for SDWC applies this fact to limit the number of generated \(\frac {1}{z_{\ell }}\)-solid prefixes and \(\frac {1}{z_{r}}\)-solid suffixes to z and zr, respectively.

The following subsections correspond to subsequent subsections of Section 5:
  • In Section 6.3.2 (corresponds to Section 5.3) we show the O(z) bound on the number of light \(\frac {1}{z}\)-solid prefixes (or sufixes) and prove a decomposition property for them that is similar to Observation 6.6 (but more complex).

  • Section 6.3.3 (corresponds to Section 5.4) contains an algorithm for generating light \(\frac {1}{z^{\prime }}\)-solid prefixes of X that are simultaneously \(\frac {1}{z}\)-solid prefixes of Y. Intuitively, light solid prefixes of a given length kn can be obtained from light solid prefixes of any length smaller than k by extending them with any character. This gives O(nλ) lists of solid prefixes to be merged by probabilities which multiplies the complexity by \(O(\log (n\lambda )) = O(\log \log z + \log \lambda )\).

  • Section 6.3.4 (corresponds to Section 5.5) shows how to compute a solution based on sorted lists of common solid prefixes and suffixes of lengths summing up to n.

  • Section 6.3.5 (corresponds to Section 5.6) implements the meet-in-the-middle approach. Because of the more complicated decomposition property this part of the algorithm is the most complex. It consists of \(O(\log n)=O(\log \log z)\) phases.

6.3.2 Combinatorics of Light Solid Prefixes (Counterpart of Section 5.3)

We define a light \(\frac {1}{z}\)-solid prefix of a weighted sequence X as a \(\frac {1}{z}\)-solid prefix S of length k such that k = 0 or S[k]≠H(X)[k].

We say that a string P is a maximal \(\frac {1}{z}\)-solid prefix of a weighted sequence X if P is a \(\frac {1}{z}\)-solid prefix of X and no string \(P^{\prime } = Ps\), for s ∈Σ, is a \(\frac {1}{z}\)-solid prefix of X. Maximal solid prefixes have following simple property, originally due to Amir et al. [1].

Fact 6.7 ([1])

A weighted sequence has at most z maximal \(\frac {1}{z}\)-solid prefixes, that is, \(\frac {1}{z}\)-solid prefixes which cannot be extended to any longer \(\frac {1}{z}\)-solid prefix.

Fact 6.7 lets us bound the number of light solid prefixes.

Fact 6.8

A weighted sequence has at most z different light \(\frac {1}{z}\)-solid prefixes.

Proof

We show a pair of inverse mappings between the set of maximal \(\frac {1}{z}\)-solid prefixes of a weighted sequence X and the set of light \(\frac {1}{z}\)-solid prefixes of X. If P is a maximal \(\frac {1}{z}\)-solid prefix of X, then we obtain a light \(\frac {1}{z}\)-solid prefix by removing all trailing letters of P that are heavy letters at the corresponding positions in X. For the inverse mapping, we extend each light \(\frac {1}{z}\)-solid prefix by heavy letters as long as the prefix is \(\frac {1}{z}\)-solid. □

With this notion and its symmetric counterpart, light \(\frac {1}{z}\)-solid suffixes, we can state a stronger version of Observation 6.6. Note that this is where the dissimilarity is crucial.

Lemma 6.9

Consider an instance \((X,Y,\frac {1}{z})\) of theSDWC problem, and let z,zr ≥ 1 be real numbers such that zzrz. If \(X \approx _{\frac {1}{z}} Y\), then every consensus string S can be decomposed intoS = LcCR such that the following conditions hold for some U,V ∈{X,Y }:
  • L is a light \(\frac {1}{z_{\ell }}\)-solidprefix of U,

  • c is a single letter,

  • all letters of C are heavy in V,

  • R is a light \(\frac {1}{z_{r}}\)-solidsuffix of V.

Proof

We set L as the longest proper prefix of S which is a \(\frac {1}{z_{\ell }}\)-solid prefix of both X and Y , and we define k := |L|. Note that L is a light \(\frac {1}{z_{\ell }}\)-solid prefix of X or Y , because H(X) and H(Y ) are dissimilar. If k = n − 1, we conclude the proof setting c = S[n] and C = R to empty strings.

Otherwise, we have \(\P (S[1\mathinner {..} k + 1],V[1\mathinner {..} k + 1])<\frac {1}{z_{\ell }}\) for V = X or V = Y. Since \(\P (S,V)\ge \frac {1}{z}\) and zzrz, this implies \(\P (S[k + 2\mathinner {..} n],V[k + 2\mathinner {..} n])\ge \frac {1}{z_{r}}\), i.e., that S[k + 2..n] is a \(\frac {1}{z_{r}}\)-solid suffix of V . We set c = S[k + 1], C as the longest prefix of S[k + 2..n] composed of letters heavy in V , and R as the remaining suffix of S[k + 2..n]. Then R is clearly a light \(\frac {1}{z_{r}}\)-solid suffix of V . □

6.3.3 Generating Solid Prefixes (Counterpart of Section 5.4)

We say that a string P is a common \(\frac {1}{z}\)-solid prefix (suffix) of weighted sequences X and Y if it is a \(\frac {1}{z}\)-solid prefix (suffix) of both X and Y. Let \((X,Y,\frac {1}{z})\) be an instance of the SDWC problem. A standard representation of a common \(\frac {1}{z}\)-solid prefix P of length k of X and Y is a triple (P,p1,p2) such that p1 and p2 are the probabilities p1 = P(P,X[1..k]) and p2 = P(P,Y [1..k]).

If σ is constant, the string P can be directly represented using \(O(\log z)\) bits due to \(|P|=O(\log z)\). Otherwise, P is written using variable-length encoding so that a letter that occurs at a given position with probability p in X has a representation that consists of \(O(\log \frac {1}{p})\) bits. For every position i, the encoding can be constructed by assigning subsequent integer identifiers to letters according non-increasing order of \(\pi _{i}^{(X)}(c)\). Note that an instance of SDWC problem provides us with the desired sorted order of letters. This lets us store a \(\frac {1}{z}\)-solid prefix using \(O(\log z)\) bits: we concatenate the variable-length representations of its letters and we store a bit mask of size \(O(\log z)\) that stores the delimiters between the representations of single letters.

In either case, our assumptions on the model of computations imply that the standard representation takes constant space. Moreover, constant time is sufficient to extend a common \(\frac {1}{z}\)-solid prefix by a given letter. An analogous representation can be used also to store common \(\frac {1}{z}\)-solid suffixes.

The following observation describes longer light solid prefixes in terms of shorter ones.

Observation 6.10

Let P be a non-empty light \(\frac {1}{z}\)-solid prefix of X. If one removes its last letter and then removes all the trailing letters which are heavy at the respective positions in X, then a shorter light \(\frac {1}{z}\)-solid prefix of X is obtained.

We build upon Observation 6.10 to derive an efficient algorithm for generating light solid prefixes.

Lemma 6.11

Let \((X,Y,\frac {1}{z})\) be an instance of the SDWC problem and let \(z^{\prime }\le z\). The standard representations of all common \(\frac {1}{z}\)-solid prefixes of X and Y being light \(\frac {1}{z^{\prime }}\)-solid prefixes of X, sorted first by their length and then by the probabilities in X, can be generated in \(O(z^{\prime } (\log \log z+\log \lambda )+\log ^{2} z)\) time.

Proof

For k ∈{0,…,n}, let Bk be a list of the requested solid prefixes of length k sorted by their probabilities p1 in X. Fact 6.8 guarantees that \({\sum }_{k = 0}^{n} |\textbf {B}_{k}| \le z^{\prime }\).

We compute the lists Bk for subsequent lengths k. We start with B0 containing the empty string with its probabilities p1 = p2 = 1. To compute Bk for k > 0, we use Observation 6.10. For a given i ∈{0,…,k − 1}, we iterate over all elements (P,p1,p2) of Bi ordered by the non-increasing probabilities p1 and try to extend each of them by the heavy letters in X at positions i + 1,…,k − 1 and by the letter s at position k. We process the letters s ordered by \({\pi }_{k}^{(X)}(s)\), ignoring the first one (H(X)[k]) and stopping as soon as we do not get a \(\frac {1}{z^{\prime }}\)-solid prefix of X.

More precisely, with \(X^{\prime }=\textbf {H}(X)\), we compute
$$p^{\prime}_{1}:=p_{1} \cdot \overset{k-1}{\underset{j=i + 1}{\prod}} \pi^{(X)}_{j}(X^{\prime}[j]) \cdot \pi^{(X)}_{k}(s)\quad\text{and}\quad p^{\prime}_{2}:=p_{2} \cdot \overset{k-1}{\underset{j=i + 1}{\prod}} \pi^{(Y)}_{j}(X^{\prime}[j]) \cdot \pi^{(Y)}_{k}(s),$$
check if \(p^{\prime }_{1} \ge \frac {1}{z^{\prime }}\) and \(p^{\prime }_{2} \ge \frac {1}{z}\), and, if so, insert \((P \cdot X^{\prime }[i + 1\mathinner {..} k-1] \cdot s,p^{\prime }_{1},p^{\prime }_{2})\) at the beginning of a new list Li,s, indexed both by the letter s and by the length i of the shorter light \(\frac {1}{z^{\prime }}\)-solid prefix. When we encounter an element (P,p1,p2) of Bi and a letter s for which \(p^{\prime }_{1} < \frac {1}{z^{\prime }}\), we proceed to the next element of Bi. If this happens for the heaviest letter sH(X)[k], we stop considering the current list Bi and proceed to Bi− 1. The final step consists in merging all the kλ lists Li,s in the order of probabilities in X; the result is Bk.

Let us analyse the time complexity of the k-th step of the algorithm. If an element (P,p1,p2) and letter s that we consider satisfy \(p^{\prime }_{1} \ge \frac {1}{z^{\prime }}\), this accounts for a new light \(\frac {1}{z^{\prime }}\)-solid prefix of X. Hence, in total (over all steps) we consider \(O(z^{\prime })\) such elements. Note that some of these elements may be discarded due to the condition on \(p^{\prime }_{2}\).

For each inspected element (P,p1,p2), we also consider at most one letter s for which \(p^{\prime }_{1}\) is not sufficiently large. If this is not the only letter considered for this element, such a candidate can be charged to the previously considered letter. The opposite situation may happen once for each list Bi, which may give O(k) additional operations in the k-th step, \(O(\log ^{2} z)\) in total.

Thanks to the order in which the lists are considered, we can store products of probabilities \({\prod }_{j=i + 1}^{k-1} \pi ^{(X)}_{j}(X^{\prime }[j])\), \({\prod }_{j=i + 1}^{k-1}\pi ^{(Y)}_{j}(X^{\prime }[j])\) and factors \(X^{\prime }[i + 1\mathinner {..} k-1]\) so that the representation of each subsequent light \(\frac {1}{z^{\prime }}\)-solid prefix of length k is computed in O(1) time. Finally, the merging step in the k-th phase takes \(O(|\textbf {B}_{k}|\log (k\lambda )) = O(|\textbf {B}_{k}| (\log \log z+\log \lambda ))\) time if a binary heap of O(kλ) elements is used.

The time complexity of the whole algorithm is
$$O\left( \log^{2} z + {\sum}_{k = 1}^{n}|\textbf{B}_{k}| (\log \log z+\log \lambda)\right).$$
By the already mentioned Fact 6.8, this is \(O(\log ^{2} z+z^{\prime } (\log \log z+\log \lambda ))\). □

6.3.4 Merging Solid Prefixes with Suffixes (Counterpart of Section 5.5)

Next, we provide an analogue of Lemma 5.6.

Lemma 6.12

Let L and R be lists containing, for somek ∈{0,…,n}, standard representations of common \(\frac {1}{z}\)-solid prefixes of length k and common \(\frac {1}{z}\)-solid suffixes of length nk of X and Y, respectively. If the elements of the lists are sorted according to non-decreasing probabilities in X and Y, respectively, one can check inO(|L| + |R|) time whether the concatenation of any \(\frac {1}{z}\)-solid prefix from L and \(\frac {1}{z}\)-solid suffix from R yields a consensus string S for X and Y.

Proof

First, we filter out dominated elements of the lists, i.e., elements (P,p1,p2) such that there exists another element \((P^{\prime },p_{1}^{\prime },p_{2}^{\prime })\) with \(p_{1}^{\prime } \ge p_{1}\) and \(p_{2}^{\prime } \ge p_{2}\). This can be done in linear time. After this operation, the list R is ordered according to non-increasing probabilities in X, so we reverse the list so that now both both lists are ordered with respect to the non-decreasing probabilities in X.

For every element (P,p1,p2) of L, we compute the leftmost element \((P^{\prime },p^{\prime }_{1},p^{\prime }_{2})\) of R such that \(p_{1} p^{\prime }_{1} \ge \frac {1}{z}\). This element maximises \(p^{\prime }_{2}\) among all elements satisfying the latter condition. Hence, it suffices to check if \(p_{2} p^{\prime }_{2} \ge \frac {1}{z}\), and if so, report the result \(S=PP^{\prime }\). As the lists are ordered by p1 and \(p^{\prime }_{1}\), respectively, all such elements can be computed in O(|L| + |R|) total time. □

6.3.5 Merge-in-the-Middle Implementation (Counterpart of Section 5.6)

In this section, we solve the SDWC problem based on Lemma 6.9. We generate all candidates for Lc and R using Lemma 6.11, and we apply a divide-and-conquer procedure to fill this with C. Our procedure works for fixed U,V ∈{X,Y }; the algorithm repeats it for all four choices.

Let Li denote a list of all common \(\frac {1}{z}\)-solid prefixes of X and Y obtained by extending a light \(\frac {\sqrt {\lambda }}{\sqrt {z}}\)-solid prefix of U of length i − 1 by a single letter s at position i, and let Ri denote a list of all common \(\frac {1}{z}\)-solid suffixes of X and Y of length ni + 1 that are light \(\frac {1}{\sqrt {z\lambda }}\)-solid suffixes of V. We assume that the lists Li and Ri are sorted according to the probabilities in U and V, respectively. We assume that Ln+ 1 = , whereas Rn+ 1 contains only a representation of an empty string.

The following lemma shows how to compute the lists Li and Ri and bounds their total size. In case of σ = O(1) it is a direct consequence of Lemma 6.11. Otherwise, one needs to exercise caution when computing the lists Li.

Lemma 6.13

The total size of lists Li andRi fori ∈{1,…,n + 1} is \(O(\sqrt {z \lambda })\); they can be computed in \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\) time.

Proof

\(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\)-time computation of the lists Ri is directly due to Lemma 6.11. As for the lists Li, we first compute in \(O\left (\frac {\sqrt {z}}{\sqrt {\lambda }}(\log \log z+\log \lambda )\right )\) time the lists of all light \(\frac {\sqrt {\lambda }}{\sqrt {z}}\)-solid prefixes of U, sorted by the lengths of strings and then by the probabilities in U, again using Lemma 6.11. Then for each length i − 1 and for each letter s at the i-th position, we extend all these prefixes by a single letter. This way we obtain λ lists for a given i − 1 that can be merged according to the probabilities in U to form the list Li. Generation of the auxiliary lists takes \(O\left (\frac {\sqrt {z}}{\sqrt {\lambda }}\cdot \lambda \right )=O(\sqrt {z\lambda })\) time in total, and merging them using a binary heap takes \(O(\sqrt {z\lambda }\log \lambda )\) time. This way we obtain an \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\)-time algorithm. □

Let \(\textbf {L}^{*}_{a,b}\) be a list of common \(\frac {1}{z}\)-solid prefixes of X and Y of length b obtained by taking a common \(\frac {1}{z}\)-solid prefix from Li for some i ∈{a,…,b} and extending it by bi letters that are heavy at the respective positions in V. Similarly, \(\textbf {R}^{*}_{a,b}\) is a list of common \(\frac {1}{z}\)-solid suffixes of length na + 1 obtained by taking a common \(\frac {1}{z}\)-solid suffix from Ri for some i ∈{a,…,b} and prepending it by ia letters that are heavy in V. Again, we assume that each of the lists \(\textbf {L}^{*}_{a,b}\) and \(\textbf {R}^{*}_{a,b}\) is sorted according to the probabilities in U and V, respectively.

A basic interval is an interval [a,b] represented by its endpoints 1 ≤ abn + 1 such that 2j divides a − 1 and \(b=\min (n + 1,a + 2^{j}-1)\) for some integer j called the layer of the interval. For every \(j = 0,\ldots ,\left \lceil {\log (n + 1)} \right \rceil \), there are \({\Theta }\left (\frac {n}{2^{j}}\right )\) basic intervals in the j-th layer and they are pairwise disjoint.

Example 6.14

For n = 7, the basic intervals are [1,1], …, [8,8], [1,2], [3,4], [5,6], [7,8], [1,4], [5,8], [1,8].

Lemma 6.15

The total size of the lists \(\textbf {L}^{*}_{a,b}\) and \(\textbf {R}^{*}_{a,b}\) for all basic intervals [a,b] is \(O(\sqrt {z\lambda }\log \log z)\) and they can all be constructed in \(O(\sqrt {z\lambda }(\log \log z+\log \lambda ))\) time.

Proof

We compute all the lists \(\textbf {L}^{*}_{a,b}\) and \(\textbf {R}^{*}_{a,b}\) for basic intervals [a,b] of subsequent layers \(j = 0,\ldots ,\left \lceil {\log (n + 1)} \right \rceil \). For j = 0, we have \(\textbf {L}^{*}_{a,a} = \textbf {L}_{a}\) and \(\textbf {R}^{*}_{a,a} = \textbf {R}_{a}\). All these lists can be computed in \(O(\sqrt {z\lambda }(\log \log z+\log \lambda ))\) time via Lemma 6.13.

Suppose that we wish to compute \(\textbf {L}^{*}_{a,b}\) for a < b at layer j (the computation of \(\textbf {R}^{*}_{a,b}\) is symmetric). Take c = a + 2j− 1 − 1. Let us iterate through all the elements (P,p1,p2) of the list \(\textbf {L}^{*}_{a,c}\), extend each string P by H(V )[c + 1..b], and multiply the probabilities p1 and p2 by
$$\overset{b}{\underset{i=c + 1}{\prod}} \pi^{(X)}_{i}(\textbf{H}(V)[i]) \quad\text{and}\quad \overset{b}{\underset{i=c + 1}{\prod}} \pi^{(Y)}_{i}(\textbf{H} (V)[i]),$$
respectively. If a common \(\frac {1}{z}\)-solid prefix is obtained, it is inserted at the end of an auxiliary list L. The resulting list L is merged with \(\textbf {L}^{*}_{c + 1,b}\) according to the probabilities in U; the result is \(\textbf {L}^{*}_{a,b}\).

Thus, we can compute \(\textbf {L}^{*}_{a,b}\) in time proportional to the sum of lengths of \(\textbf {L}^{*}_{a,c}\) and \(\textbf {L}^{*}_{c + 1,b}\). (Note that the necessary products of probabilities can be computed in \(O(n) = O(\log z)\) total time.) For every \(j = 1,\ldots ,\left \lceil {\log n} \right \rceil \), the total length of the lists from the j-th layer does not exceed the total length of the lists from the (j − 1)-th layer. By Lemma 6.13, the lists at the 0-th layer have size \(O(\sqrt {z\lambda })\). The conclusion follows from the fact that \(\log n = O(\log \log z)\). □

Finally, we are ready to apply a divide-and-conquer approach to solve the SDWC problem:

Lemma 6.16

The SDWC problem can be solved in \(O(\sqrt {z\lambda } (\log \log z + \log \lambda ))\) time.

Proof

The algorithm goes along Lemma 6.9, considering all choices of U and V . For each of them, we proceed as follows.

First, we compute the lists Li, Ri for all i = 1,…,n and \(\textbf {L}^{*}_{a,b}\), \(\textbf {R}^{*}_{a,b}\) for all basic intervals. By Lemmas 6.13 and 6.15, this takes \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\) time.

Note that, in order to find out if there is a feasible solution, it suffices to attempt joining a common \(\frac {1}{z}\)-solid prefix from Lj with a common \(\frac {1}{z}\)-solid suffix from Rk for some indices 1 ≤ j < kn + 1 by heavy letters of V at positions j + 1,…,k − 1. We use a recursive routine to find such a pair of indices j,k ∈ [a,b] which has positive length and therefore can be decomposed into two basic subintervals [a,c] and [c + 1,b]. Then either jc < k, or both indices j, k belong to the same interval [a,c] or [c + 1,b]. To check the first case, we apply the algorithm of Lemma 6.12 to \(L = \textbf {L}^{*}_{a,c}\) and \(R = \textbf {R}^{*}_{c + 1,b}\). The remaining two cases are solved by recursive calls for the subintervals. The recursive routine is called first for the basic interval [1,n + 1].

The computations performed by the routine for the basic intervals at the j-th level take at most the time proportional to the total size of lists \(\textbf {L}^{*}_{a,b}\), \(\textbf {R}^{*}_{a,b}\) at the (j − 1)-th level. Lemma 6.15 shows that the total size of the lists at all levels is \(O(\sqrt {z\lambda } \log \log z)\). Consequently, the whole recursive procedure works in \(O(\sqrt {z\lambda } \log \log z)\) time. Together with the computation of the lists, this gives \(O(\sqrt {z\lambda } (\log \log z+\log \lambda ))\) time in total. □

Lemma 6.16 combined with Lemma 6.5 provides an efficient solution for General Weighted Pattern Matching. It also gives a solution to Weighted Consensus (which is a special case of GWPM with n = m). Note that \(\lambda \log z = O(\sqrt {z\lambda } \log z)\) due to zλ.

Theorem 1.4

The General Weighted Pattern Matching problem can be solved in \(O(n\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time, and the Weighted Consensus problem can be solved in \(O(R +\sqrt {z \lambda } (\log \log z+\log \lambda ))\) time.

7 Conditional Hardness of GWPM

The following reduction from Multichoice Knapsack to Weighted Consensus immediately yields that any significant improvement in the dependence on z and λ in the running time of our algorithm would lead to breaking long-standing barriers for special cases of Multichoice Knapsack.

Lemma 7.1

Given an instance I of the Multichoice Knapsack problem with n classesC1,…,Cn ofmaximum size λ, in linear time one can construct an equivalent instance of the Weighted Consensus problem with \(z=O({\prod }_{i = 1}^{n}|C_{i}|)\) and sequences of length O(n) over alphabet of size λ.

Proof

We construct a pair of weighted sequences X,Y of length n over alphabet Σ = {1,…,λ}. Let \(C_{i} = \{c_{i,1},\ldots ,c_{i,|C_{i}|}\}\). Intuitively, choosing letter j at position i will correspond to taking ci,j to the solution S.

Without loss of generality, we assume that weights and values are non-negative. Otherwise, we may subtract \(v_{\min }(i)\) from v(ci,j) and \(w_{\min }(i)\) from w(ci,j) for each item ci,j, as well \(V_{\min }\) from V and \(W_{\min }\) from W .

We set M to the smallest power of two such that \(M\ge \max (n, V, W)\). For j ∈{1,…,|Ci|}, we set:
$$p_{i}^{(X)}(j) = -\frac{\left\lceil {M\log|C_{i}|} \right\rceil + v(c_{i,j})}{M}, \quad p_{i}^{(Y)}(j)=-\frac{\left\lceil {M\log|C_{i}|} \right\rceil +w(c_{i,j})}{M}.$$
We then define \(\log \pi _{i}^{(X)}(j) = p_{i}^{(X)}(j)\) and \(\log \pi _{i}^{(Y)}(j) = p_{i}^{(Y)}(j)\) for j ∈Σ. Moreover, we set
$$\log z_{X} = \frac{1}{M} \left( V + \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil\right), \quad \log z_{Y} = \frac{1}{M}\left( W + \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil\right).$$
The following claim holds.

Claim 3

\({\sum }_{j = 1}^{|C_{i}|} \pi _{i}^{(X)}(j)\le 1\), \({\sum }_{j = 1}^{|C_{i}|} \pi _{i}^{(Y)}(j)\le 1\), and \(\max (z_{X},z_{Y}) \le 4{\prod }_{i = 1}^{n}|C_{i}|\).

Proof

As for the first inequality, we have:
$$\sum\limits_{j = 1}^{|C_{i}|} \pi_{i}^{(X)}(j)\ =\ \sum\limits_{j = 1}^{|C_{i}|} 2^{-\left\lceil {M\log|C_{i}|} \right\rceil/M} 2^{-v(c_{i,j})/M}\ \le\ \sum\limits_{j = 1}^{|C_{i}|} 2^{-\log |C_{i}|}\ =\ \sum\limits_{j = 1}^{|C_{i}|} \frac{1}{|C_{i}|} \le 1. $$
The second inequality is analogous. Finally, by the choice of M, we have
$$\max(z_{X},z_{Y})\ \le\ 2^{\frac{1}{M}(\max(V,W)+n)}\overset{n}{\underset{i = 1}{\prod}}|C_{i}|\ \le\ 4\overset{n}{\underset{i = 1}{\prod}}|C_{i}|.$$
This way, for a string P of length n, we have
$$\begin{array}{@{}rcl@{}} \log \P(P,X)=-\frac{1}{M}\left( \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil+\sum\limits_{i = 1}^{n} v(c_{i,P[i]})\right) \ge -\log z_{X} \\ \Longleftrightarrow \sum\limits_{i = 1}^{n} v(c_{i,P[i]}) \le V, \end{array} $$
$$\begin{array}{@{}rcl@{}} \log \P(P,Y)=-\frac{1}{M}\left( \sum\limits_{i = 1}^{n}\left\lceil {M\log|C_{i}|} \right\rceil+\sum\limits_{i = 1}^{n} w(c_{i,P[i]})\right) \ge -\log z_{Y} \\ \Longleftrightarrow \sum\limits_{i = 1}^{n} w(c_{i,P[i]}) \le W. \end{array} $$

Thus, P is a solution to the constructed instance of the Weighted Consensus problem with two threshold probabilities, \(\frac {1}{z_{X}}\) and \(\frac {1}{z_{Y}}\), if and only if S = {ci,j : P[i] = j} is a solution to the underlying instance of the Multichoice Knapsack problem. To have a single threshold \(z=\max (z_{X},z_{Y})\), we append an additional position n + 1 with symbol 1 only, with \(p_{n + 1}^{(X)}(1)= 0\) and \(p_{n + 1}^{(Y)}(1)=\log z_{Y} - \log z_{X}\) provided that zXzY, and symmetrically otherwise.

If one wants to make sure that the probabilities at each position sum up to exactly one, two further letters can be introduced, one of which gathers the remaining probability in X and has probability 0 in Y, and the other gathers the remaining probability in Y, and has probability 0 in X. □

For completeness, let us recall the folklore reductions that show that Subset Sum and 3-Sum are special cases of Multichoice Knapsack. To express an instance of Subset Sum with integers a1,…,an and threshold R as an instance of Multichoice Knapsack, we introduce n classes of two items each, which correspond to taking and omitting the respective elements. The first item has value ai and weight − ai, while for the other these are both 0. The thresholds are V = R and W = −R.

Similarly, given an instance of 3-Sum with classes a1,1,…,a1,λ, a2,1,…,a2,λ, and a3,1,…,a3,λ, we can create an instance of Multichoice Knapsack with the same three classes of items with values ai,j and weights − ai,j. The thresholds are V = W = 0.

Theorem 1.6

Weighted Consensus is NP-hard and cannot be solved in:
  1. 1.

    O(zε) timefor every ε > 0,unless the exponential time hypothesis (ETH) fails;

     
  2. 2.

    O(z0.5−ε) timefor some ε > 0,unless there is an O(2(0.5−ε)n)-timealgorithm for the Subset Sum problem;

     
  3. 3.

    \(\tilde {O}(R+z^{0.5}\lambda ^{0.5-\varepsilon })\) timefor some ε > 0 andfor n = O(1),unless the 3-Sum conjecture fails.

     

Proof

We use Lemma 7.1 to derive algorithms for the Multichoice Knapsack problem based on hypothetical solutions for Weighted Consensus. Subset Sum is a special case of Multichoice Knapsack with λ = 2, i.e., \({\prod }_{i}|C_{i}|= 2^{n}\). Hence, an O(zo(1))-time solution for Weighted Consensus would yield an O(2o(n))-time algorithm for Subset Sum, which contradicts ETH by the results of Etscheid et al. [9] and Gurari [11]. Similarly, an O(z0.5−ε)-time solution for Weighted Consensus would yield an O(2(0.5−ε)n)-time algorithm for Subset Sum. Moreover, 3-Sum is a special case of Multichoice Knapsack with n = 3 and \({\prod }_{i}|C_{i}|=\lambda ^{3}\). Hence, an \(\tilde {O}(R+z^{0.5}\lambda ^{0.5-\varepsilon })\)-time solution for Weighted Consensus with n = O(1) yields an \(\tilde {O}(\lambda + \lambda ^{1.5 + 0.5-\varepsilon })=\tilde {O}(\lambda ^{2-\varepsilon })\)-time algorithm for 3-Sum. □

Nevertheless, it might still be possible to improve the dependence on n in the GWPM problem. For example, one may hope to achieve \(\tilde {O}(nz^{0.5-\varepsilon }+z^{0.5})\) time for λ = O(1).

8 Multivariate Analysis of Multichoice Knapsack and GWPM

In Section 5, we gave an \(O(N+a^{0.5}\lambda ^{0.5}\log A)\)-time algorithm for the Multichoice Knapsack problem. Improvement of either exponent to 0.5 − ε would result in a breakthrough for the Subset Sum and 3-Sum problems, respectively. Nevertheless, this does not refute the existence of faster algorithms for some particular values (a,λ) other than those emerging from instances of Subset Sum or 3-Sum. Indeed, in this section we show an algorithm that is superior if \(\frac {\log a}{\log \lambda }\) is a constant other than an odd integer. We also argue that it is optimal (up to lower order terms) for every constant \(\frac {\log a}{\log \lambda }\) unless the k-Sum conjecture fails.

We analyse the running times of algorithms for the Multichoice Knapsack problem expressed as O(nO(1)T(a,λ)) for some function T monotone with respect to both arguments. The algorithm of Theorem 1.3 proves that achieving \(T(a,\lambda )=\sqrt {a\lambda }\) is possible. On the other hand, if we assume that Subset Sum does not admit an O(2(0.5−ε)n)-time solution, then we immediately get that we cannot have T(a,2) = O(a0.5−ε) for any ε ≥ 0. Similarly, the 3-Sum conjecture implies that T(λ3,λ) = O(λ2−ε) is impossible. While this already refutes the possibility of having T(a,λ) = O(a0.5λ0.5−ε) across all arguments (a,λ), such a bound may still hold for some special cases covering an infinite number of arguments. For example, we may potentially achieve T(a,λ) = O((aλ)0.5−ε) = O(λ1.5−ε) for a = λ2.

Before we prove that this is indeed possible, let us see the consequences of the conjectured hardness of 3-Sum and, in general, (2k − 1)-Sum. For a non-negative integer k, the (2k − 1)-Sum conjecture refutes T(λ2k− 1,λ) = O(λkε). By monotonicity of T with respect to the first argument, we conclude that T(λc,λ) = O(λkε) is impossible for c ≥ 2k − 1. On the other hand, monotonicity with respect to the second argument shows that \(T(\lambda ^{c},\lambda )=O(\lambda ^{c\frac {k}{2k-1}-\varepsilon })\) is impossible for c ≤ 2k − 1. The lower bounds following from (2k − 1)-Sum and (2k + 1)-Sum turn out to meet at \(c = 2k-1+\frac {1}{k + 1}\); see Figure 1.
Fig. 1

Illustration of the upper bound (dotted) and lower bound (solid) on \(\log _{\lambda }T(\lambda ^{c},\lambda )\)

Consequently, we have some room between the lower and the upper bound of \(\sqrt {a \lambda }\). In the aforementioned case of a = λ2, the upper bound is \(\lambda ^{\frac {3}{2}}\), compared to the lower bound of \(\lambda ^{\frac {4}{3}-\varepsilon }\). Below, we show that the upper bound can be improved to meet the lower bound. More precisely, we show an algorithm whose running time is \(O(N + (a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\log \lambda \cdot n^{k})\) for every positive integer k. Note that \(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k} = \lambda ^{c\frac {k + 1}{2k + 1}}+ \lambda ^{k}\), so for 2k − 1 ≤ c ≤ 2k + 1 the running time indeed matches the lower bounds up to the nk term.

Due to Lemma 5.11, the extra nk term reduces to \(O((\frac {\log A}{\log \lambda })^{k})\). Finally, we study the complexity of the GWPM problem.

8.1 Algorithm for Multichoice Knapsack

Let us start by discussing the bottleneck of the algorithm of Theorem 1.3 for large λ. The problem is that the size of the classes does not let us partition every choice S into a prefix L and a suffix R with ranks both \(O(\sqrt {A_{V}})\). Lemma 5.3 leaves us with an extra letter c between L and R, and in the algorithm we append it to the prefix (while generating \(\textbf {L}_{j-1}^{(\ell )}\odot C_{j}\)).

We provide a workaround based on reordering of classes. Our goal is to make sure that items with large rank appear only in a few leftmost classes. For this, we guess the classes of the k items with largest rank (in a feasible solution) and move them to the front. Since this depends on the sought feasible solution, we shall actually verify all \(\binom {n}{k}\) possibilities.

Now, our solution considers two cases: For j > k, the reordering lets us assume \(\text{rank} _{v}(c)< \ell ^{\frac {1}{k}}\), so we do not need to consider all items from Cj. For jk, on the other hand, we exploit the fact that \(|\textbf {L}_{j-1}^{(\ell )}\odot C_{j}|\le \lambda ^{j}\), which at most λk.

The underlying combinatorial foundation is formalised as a variant of Lemma 5.3:

Lemma 8.1

Let and r be positive integers such that v(Lj[]) + v(Rj+ 1[r]) > V for every 0 ≤ jn. Let k ∈{1,…,n} and suppose that S is a choice with v(S) ≤ V such that rankv(SCi) ≥ rankv(SCj) forik < j. There is an index j ∈{1,…,n} and a decomposition S = L ∪{c}∪ R such that \(L\in \textbf {L}_{j-1}^{(\ell )}\), \(R\in \textbf {R}_{j + 1}^{(r)}\),cCj, and either \(\text{rank} _{v}(c) < \ell ^{\frac {1}{k}}\) orjk.

Proof

We claim that the decomposition constructed in the proof of Lemma 5.3 satisfies the extra condition on rankv(c) if j > k. Let S = {c1,…,cn} and Si = {c1,…,ci}. Obviously rankv(ci) ≥ 1 for k < i < j and, by the extra assumption, rankv(ci) ≥ rankv(c) for 1 ≤ ik. Hence, Fact 5.2 yields rankv(Sj− 1) ≥ rankv(c)k. Simultaneously, we have v(Sj− 1) < v(Lj− 1[]), so rankv(Sj− 1) < . Combining these inequalities, we immediately get the claimed bound. □

Theorem 1.7

For every positive integer k = O(1), the Multichoice Knapsack problem can be solved in \(O(N+ {(a^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log A (\frac {\log A}{\log \lambda })^{k})\) time.

Proof

As in the proof of Theorem 1.3, we actually provide an algorithm whose running time depends on AV rather than a. Moreover, Lemmas 5.8 and 5.11 let us assume that \(n=O(\frac {\log A}{\log \lambda })\).

We first guess the k positions where items with largest ranks rankv are present in the solution S and move these positions to the front. This gives \(\binom {n}{k}=O((\frac {\log A}{\log \lambda })^{k})\) possible selections. For each of them, we proceed as follows.

We increment an integer r starting from 1, maintaining \(\ell =\lceil r^{\frac {k}{k + 1}}\rceil \) and all the lists \(\textbf {L}_{j}^{(\ell )}\) and \(\textbf {R}_{j + 1}^{(r)}\) for 0 ≤ jn, as long as v(Lj[]) + v(Rj+ 1[r]) ≤ V for some j. By Fact 5.4, we stop with \(r=O(A_{V}^{\frac {k + 1}{2k + 1}})\) and thus the total time of this phase is \(O(A_{V}^{\frac {k + 1}{2k + 1}}\log A)\) due to the online procedure of Lemma 5.5.

By Lemma 8.1, every feasible solution S for some j admits a decomposition S = L ∪{c}∪ R, where \(L\in \textbf {L}_{j-1}^{(\ell )}\), \(R\in \textbf {R}_{j + 1}^{(r)}\), cCj, and either \(\text{rank} _{v}(c) < \ell ^{\frac {1}{k}}\) or jk; we consider all possibilities for j. For each of them, we shall reduce searching for S to an instance of the Multichoice Knapsack problem with \(N^{\prime }=O(A_{V}^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\) and \(n^{\prime }= 2\). By Lemma 5.6, these instances can be solved in \(O((A_{V}^{\frac {k + 1}{2k + 1}}+\lambda ^{k})\frac {\log A}{\log \lambda })\) time in total.

For jk, the items of the j-th instance are going to belong to classes \(\textbf {L}_{j-1}^{(\ell )}\odot C_{j}\) and \(\textbf {R}_{j + 1}^{(r)}\). The set \(\textbf {L}_{j-1}^{(\ell )}\odot C_{j}\) can be sorted by merging |Cj| sorted lists of size at most λj− 1 each, i.e., in \(O(\lambda ^{k} \log \lambda )\) time. On the other hand, for j > k, we take \(\{L\cup \{c\} : L\in \textbf {L}_{j-1}^{(\ell )} , c\in C_{j}, \text{rank} _{v}(c)\le \ell ^{\frac {1}{k}}\}\) and \(\textbf {R}_{j + 1}^{(r)}\). The former set can be constructed by merging at most \(\min (\ell ^{\frac {1}{k}},\lambda )=\min (O(r^{\frac {1}{k + 1}}),\lambda )\) sorted lists of size \(\ell =O(r^{\frac {k}{k + 1}})\) each, i.e., in \(O(r\log \lambda )=O(A_{V}^{\frac {k + 1}{2k + 1}}\log \lambda )\) time.

Summing up over all indices j, this gives \(O((A_{V}^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log A)\) time for a single selection of the k positions with largest ranks, and \(O((A_{V}^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log A (\frac {\log A}{\log \lambda })^{k})\) in total.

Clearly, each solution of the constructed instances represents a solution of the initial instance, and by Lemma 8.1, every feasible solution of the initial instance has its counterpart in one of the constructed instances.

Before we conclude the proof, we need to note that the optimal k does not need to be known in advance. To deal with this issue, we try consecutive integers k and stop the procedure if Fact 5.4 yields that AV > λ2k+ 1, i.e., if r is incremented beyond λk+ 1. If the same happens for the other instance of the algorithm (operating on rankw instead of rankv), we conclude that a > λ2k+ 1, and thus we shall better use larger k. The running time until this point is \(O(\lambda ^{k + 1}\log \lambda (\frac {\log A}{\log \lambda })^{k})\) due to Lemma 5.5. On the other hand, if rλk+ 1, the algorithm behaves as if aλ2k+ 1, i.e., runs in \(O(\lambda ^{k + 1}\log \lambda (\frac {\log A}{\log \lambda })^{k})\) time. This workaround (considering all smaller values k) adds extra \(O(\lambda ^{k}\log \lambda (\frac {\log A}{\log \lambda })^{k-1})\) to the time complexity for the optimal value k, which is less than the upper bound on the running time we have for this value k. □

8.2 Algorithm for General Weighted Pattern Matching

If we are to bound the complexity in terms of A only, the running time becomes
$${O(N+ {(A^{\frac{k + 1}{2k + 1}}+\lambda^{k})}\log A (\tfrac{\log A}{\log \lambda})^{k})}.$$
Assumptions that Aλ2k+ 1 and k = O(1) let us get rid of the \((\frac {\log A}{\log \lambda })^{k}\) term, which can be bounded by (2k + 1)k = O(1).

Corollary 8.2

Let k = O(1) be a positive integer such that Aλ2k+ 1. The Multichoice Knapsack problem can be solved in \(O(N+ {(A^{\frac {k + 1}{2k + 1}}+\lambda ^{k})}\log \lambda )\) time.

This leads to the following result for General Weighted Pattern Matching:

Theorem 1.8

If λ2k− 1zλ2k+ 1 for some positive integer k = O(1), then the Weighted Consensus problem can be solved in \(O(R+(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\) time, and the GWPM problem can be solvedin \(O(n(z^{\frac {k + 1}{2k + 1}} + \lambda ^{k})\log \lambda )\) time.

As we noted at the beginning of this section, Lemma 7.1 implies that any improvement of the dependence of the running time on z or λ by zε (equivalently, by λε) wound contradict the k-Sum conjecture.

Footnotes

  1. 1.

    The O notation suppresses factors polynomial with respect to the instance size, whereas the \(\tilde {O}\) notation ignores factors polylogarithmic with respect to the instance size (encoded in binary).

  2. 2.

Notes

Acknowledgments

This work was supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund.

References

  1. 1.
    Amir, A., Chencinski, E., Iliopoulos, C. S., Kopelowitz, T., Zhang, H.: Property matching and weighted matching. Theor. Comput. Sci. 395(2-3), 298–310 (2008).  https://doi.org/10.1016/j.tcs.2008.01.006 MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Barton, C., Kociumaka, T., Pissis, S. P., Radoszewski, J.: Efficient index for weighted sequences. In: Grossi, R., Lewenstein, M. (eds.) Combinatorial Pattern Matching, CPM 2016, LIPIcs, vol. 54, pp. 4:1–4:13. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.  https://doi.org/10.4230/LIPIcs.CPM.2016.4. Dagstuhl, Germany (2016)
  3. 3.
    Barton, C., Liu, C., Pissis, S. P.: Linear-time computation of prefix table for weighted strings & applications. Theor. Comput. Sci. 656, 160–172 (2016).  https://doi.org/10.1016/j.tcs.2016.04.029 MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Barton, C., Liu, C., Pissis, S. P.: On-line pattern matching on uncertain sequences and applications. In: Chan, T.H., Li, M., Wang, L. (eds.) Combinatorial optimization and applications, COCOA 2016, LNCS, vol. 10043, pp. 547–562.  https://doi.org/10.1007/978-3-319-48749-6_40. Springer, Berlin (2016)
  5. 5.
    Barton, C., Liu, C., Pissis, S.P.: Fast average-case pattern matching on weighted sequences. To appear in the International Journal of Foundations of Computer Science (2017)Google Scholar
  6. 6.
    Biswas, S., Patil, M., Thankachan, S. V., Shah, R.: Probabilistic threshold indexing for uncertain strings. In: E. Pitoura, S. Maabout, G. Koutrika, A. Marian, L. Tanca, I. Manolescu, K. Stefanidis (eds.) 19th International Conference on Extending Database Technology, EDBT 2016, pp. 401–412. OpenProceedings.org.  https://doi.org/10.5441/002/edbt.2016.37 (2016)
  7. 7.
    Christodoulakis, M., Iliopoulos, C. S., Mouchard, L., Tsichlas, K.: Pattern matching on weighted sequences. In: Algorithms and Computational Methods for Biochemical and Evolutionary Networks, CompBioNets 2004, KCL publications (2004)Google Scholar
  8. 8.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on strings. Cambridge University Press, Cambridge (2007).  https://doi.org/10.1017/cbo9780511546853 CrossRefzbMATHGoogle Scholar
  9. 9.
    Etscheid, M., Kratsch, S., Mnich, M., Röglin, H.: Polynomial kernels for weighted problems. In: G.F. Italiano, G. Pighizzini, D. Sannella (eds.) Mathematical Foundations of Computer Science, MFCS 2015, Part II, LNCS, vol. 9235, pp. 287–298. Springer.  https://doi.org/10.1007/978-3-662-48054-0_24 (2015)
  10. 10.
    Gajentaan, A., Overmars, M. H.: On a class of O(n 2) problems in computational geometry. Comput. Geom. 5, 165–185 (1995).  https://doi.org/10.1016/0925-7721(95)00022-2 MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Gurari, E. M.: Introduction to the theory of computation. Computer Science Press (1989)Google Scholar
  12. 12.
    Horowitz, E., Sahni, S.: Computing partitions with applications to the knapsack problem. J. ACM, 21(2), 277–292 (1974).  https://doi.org/10.1145/321812.321823 MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Iliopoulos, C.S., Makris, C., Panagis, Y., Perdikuri, K., Theodoridis, E., Tsakalidis, A.K.: The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundamenta Informaticae 71 (2-3), 259–277 (2006). http://content.iospress.com/articles/fundamenta-informaticae/fi71-2-3-07 MathSciNetzbMATHGoogle Scholar
  14. 14.
    Iliopoulos, C. S., Rahman, M. S.: Faster index for property matching. Inf. Process. Lett. 105(6), 218–223 (2008).  https://doi.org/10.1016/j.ipl.2007.09.004 MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Impagliazzo, R., Paturi, R.: On the complexity of k-SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001).  https://doi.org/10.1006/jcss.2000.1727 MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Juan, M. T., Liu, J. J., Wang, Y. L.: Errata for “Faster index for property matching”. Inf. Process. Lett. 109(18), 1027–1029 (2009).  https://doi.org/10.1016/j.ipl.2009.06.009 CrossRefzbMATHGoogle Scholar
  17. 17.
    Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack problems. Springer.  https://doi.org/10.1007/978-3-540-24777-7(2004)
  18. 18.
    Kociumaka, T., Pissis, S. P., Radoszewski, J.: Pattern matching and consensus problems on weighted sequences and profiles. In: S. Hong (ed.) Algorithms and Computation, ISAAC 2016, LIPIcs, vol. 64, pp. 46:1–46:12. Schloss Dagstuhl–Leibniz-Zentrum für Informatik.  https://doi.org/10.4230/LIPIcs.ISAAC.2016.46 (2016)
  19. 19.
    Kopelowitz, T.: The property suffix tree with dynamic properties. Theor. Comput. Sci. 638, 44–51 (2016).  https://doi.org/10.1016/j.tcs.2016.02.033 MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Lokshtanov, D., Marx, D., Saurabh, S.: Lower bounds based on the Exponential Time Hypothesis. Bulletin of the EATCS 105, 41–72 (2011). http://bulletin.eatcs.org/index.php/beatcs/article/view/92 MathSciNetzbMATHGoogle Scholar
  21. 21.
    Mehlhorn, K.: Nearly optimal binary search trees. Acta Inform. 5, 287–295 (1975).  https://doi.org/10.1007/BF00264563 MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Pizzi, C., Ukkonen, E.: Fast profile matching algorithms - A survey. Theor. Comput. Sci. 395(2-3), 137–157 (2008).  https://doi.org/10.1016/j.tcs.2008.01.015 MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Radoszewski, J., Starikovskaya, T. A.: Streaming k-mismatch with error correcting and applications. In: A. Bilgin, M.W. Marcellin, J. Serra-Sagristȧ, J.A. Storer (eds.) Data Compression Conference, DCC 2017, pp. 290–299. IEEE.  https://doi.org/10.1109/DCC.2017.14 (2017)
  24. 24.
    Rajasekaran, S., Jin, X., Spouge, J. L.: The efficient computation of position-specific match scores with the fast Fourier transform. J. Comput. Biol. 9(1), 23–33 (2002).  https://doi.org/10.1089/10665270252833172 CrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Institute of InformaticsUniversity of WarsawWarsawPoland
  2. 2.Department of InformaticsKing’s College LondonLondonUK

Personalised recommendations