# The Complexity of Computations

• Pavel Pudlák
Chapter
Part of the Springer Monographs in Mathematics book series (SMM)

## Abstract

This chapter provides a survey of the main concepts and results of the computational complexity theory. We start with basic concepts such as time and space complexities, complexity classes P and NP, and Boolean circuits. We discuss how the use of randomness and interaction enables one to compute more efficiently. In this context we also mention some basic concepts from theoretical cryptography. We then discuss two related ways to make computations more efficient: the use of processors working in parallel and the use of quantum circuits. Among other things, we explain Shor’s quantum algorithm for factoring integers. The topic of the last section is important for the foundations of mathematics, although it is less related to computational complexity; it is about algorithmic complexity of finite strings of bits.

## Keywords

Polynomial Time Boolean Function Turing Machine Unitary Transformation Polynomial Time Algorithm
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Complexity is a notion about which we do not learn in schools, but which is very familiar to us. Our generation has witnessed a tremendous increase of complexity in various parts of our life. It is not only the complexity of industrial products that we use. The world economy is a much more complex system now than it used to be; the same is true about transportation, laws and so on. Computers help us to cope with it, but they also enhance the process of making our lives more complex. The progress in science reveals more and more about the complexity of nature. This concerns not only biology and physics, but also mathematics. In spite of the great role that it plays in our lives, complexity has become an object of mathematical research only recently. More precisely, the word complexity had not been used until about the 1960s, but many parameters introduced long before can be thought of as some sort of complexity measures. Already the words used for these parameters suggest that they are used to classify concepts according to their complexity: degree, rank, dimension, etc. The most important instantiation of the notion of complexity is in computability theory, which is the subject of this chapter.

Originally the motivation for studying computational complexity was to understand which algorithms can be used in practice. It had been known that some problems, although algorithmically solvable, require so large a number of steps that they never can be used. It was, therefore, necessary to develop a theory for classifying problems according to their feasibility. When theoretical studies began, it turned out that there are fundamental problems concerning computational complexity. Moreover, some of these problems appeared to be very difficult. We now appreciate their difficulty because only a few of them have been solved after many years.

These problems concern the relationship of the basic resources used by algorithms: time, space, nondeterminism and randomness. Our inability to make any substantial progress in solving them suggests that there may be fundamental obstacles that prevent us from solving them. It is conceivable that these problems not only need new methods, but may need new axioms. This seems to be a rather bold conjecture, but recall the history of Diophantine equations. The problem appeared to be just a difficult number-theoretical problem and Hilbert even assumed that it was algorithmically solvable. Now we know that this is not the case: there is no theory that would suffice to prove the unsolvability of every unsolvable equation. History may repeat itself in computational complexity and we may need mathematical logic to solve the fundamental problems of computational complexity theory.

In the next chapter, we will see slightly more explicit connections of computational complexity with logic and the foundations of mathematics, mediated by proof complexity.

## 5.1 What Is Complexity?

From our daily experience we know that there are easy tasks and there are difficult ones. Everybody knows that it is more difficult to multiply two numbers than to add them. Those who use computers more extensively also know that they are able to solve certain problems fast, while some other problems require a long time. But we also know that some people are faster than others, that we can solve a task more easily if we know more about it and that some programs are slow for a given problem, but sometimes a sophisticated program can solve the same problem very efficiently. Thus it is not clear whether there is a particular property of problems that prevents us (and computers) from solving some problems quickly, or if it is just the question of knowing how to solve a particular problem fast.

Therefore the first thing to learn is that, indeed, there is a quantity associated with every problem, which we call the complexity of the problem, that determines how efficiently the problem can be solved. This quantity is represented by a natural number. When studying computational complexity, we always consider only algorithmically solvable problems, problems solvable using a finite amount of computational resources. Since algorithms make discrete steps, also the resources can be measured in discrete units. The amount of computational resources needed to solve a particular instance of a problem is this number. In fact there is not only one, but several such quantities corresponding to the type of resources that we study. Furthermore, each one depends on the particular model of computation that we use.

Let us start with the most important type of complexity, which is called time. If we use the classical model of computation, Turing machines, then the time complexity of a problem is the minimal number of steps that a Turing machine needs to solve the problem. However, the time complexity of computations cannot be defined for a single input. Recall that when we considered the concept of decidability, it was important to have an infinite set of instances of a given problem. Typically, we asked if a property of natural numbers was decidable. For a finite set, there always exists an algorithm—a look up table. So the same is true about complexity; it only makes sense, if we have an infinite, or at least very large set of inputs.

Suppose, for example, that the problem is to decide if a given number N has an even number of prime divisors. The problem is, clearly, decidable: we can enumerate all primes less than N and try to divide N by each of them. This is certainly not the best way to solve this problem, but it seems that the problem is difficult if we do not know anything special about N. But suppose that we somehow determine that the number of prime divisors of N is, say, even. Then we cannot say anymore that, for this particular N, it is difficult to decide this property, because we know the answer. Therefore talking about the complexity of such a problem only makes sense if we consider a large set of numbers. To be more specific, consider a Turing machine M that correctly decides the above property for every number. Since M has to work with arbitrarily large numbers, the number of steps that M needs to answer will vary with the input; in general, it will increase. If M is a formalization of an efficient algorithm, then it will not increase very fast. If, however, the problem is hard, it will increase fast for every Turing machine.

The idea of defining the time complexity of a problem itself, not just with respect to a particular machine, is to take the least number of steps that a machine needs for solving the problem. Ideally, we would like to prove that there exists a machine which is the fastest one and define its time complexity to be the time complexity of the problem. This is surely not possible, since we know, for example, that for every fixed number there exists a procedure that solves the problem almost immediately. (If we use Turing machines it will take some nontrivial time for the machine to compare the input number written in binary with the one it has in its lookup table, but it will be fairly short.) Therefore we content ourselves with asymptotic estimates on the possible time complexities of Turing machines solving the problem.

For every nontrivial problem, we naturally expect that the time will increase with the size of the input data. The time that an algorithm needs varies not only with the size, but also if the input size is fixed, it may need a different number of steps for different data. In order to be able to decide if we can use an algorithm for data of a given length, we need an upper bound on the time needed for all such data. Thus we define the time complexity of an algorithm (or of a Turing machine, etc.) to be the function t(n) such that t(n) is the maximal time the algorithm needs on inputs of size n. The size of an input is usually the length of a string which encodes the data using a finite alphabet; we call it the input length. This approach is called worst case complexity, since we classify algorithms by how they behave in the worst case on data of a given input size. (In practice one may prefer to use average case complexity, but I will not deal with this concept here.)

Hence the computational complexity of a problem is not measured by a simple object such as a number, but rather by a function that depends on the input length. In most cases it seems very difficult even to estimate these functions; in fact, the task of determining the complexity of particular problems is so difficult that often we are happy to get any estimates of the time bounds. Therefore we usually content ourselves with asymptotic bounds.

### The Three Types of Numbers

Rather than talking about asymptotic bounds, it seems better to start with a more pedestrian point of view. The three types of numbers mentioned in the title are small, medium and large natural numbers. Such a classification makes sense only if we specify what we want to do with numbers. Let us assume that we want to do elementary computations, more specifically, we want to use the basic arithmetical operations: addition, subtraction, multiplication and division. Then we can ask, how large numbers can be added, multiplied etc. This, of course, depends on whether we want to do the computation with a pencil and paper, or with a computer (and what kind of computer), and how much time we are willing to spend. If needed, we can do such computations without a computer with numbers that have dozens of digits. If we use a computer, we can easily handle numbers that have millions of digits, and with some effort we can, perhaps, go as far as 1020 digits. So these should be the medium size numbers.

On the other hand, we estimate that 10200 digits cannot be physically represented in the visible part of the universe. Hence numbers with so many digits are large. When a number is large, it does not mean that we cannot represent it at all and that we cannot compute with it. For example, $$10^{10^{100}}$$ is such a number and we see immediately that it is divisible by 5, and with a little bit of math we can show that it is not divisible by 7. But notice that we need to apply mathematics to prove these assertions; we cannot simply do the divisions and see what the remainder is.

Small numbers are specified by means of algorithms based on the brute-force search. These are algorithms that search for the solution in a very simplistic way: they just check all possible values that can be a solution of the problem.1 The oldest problem to which people have applied such algorithms is the integer factoring problem. This is the problem, for a given natural number N, to find a proper divisor of N, called a factor. A factor is a number which divides N, and it is not a trivial divisor, which means, it is different from 1 and N. The simplest factoring algorithm is to take numbers from 2 to N−1 one by one and try to divide n by them. We can save a lot of work if we realize that we only need to test numbers less than or equal to $$\sqrt{N}$$. This is because if N has a nontrivial divisor, then it can be factored as ab with a, b different from 1 and N and then either a is less than or equal to $$\sqrt{N}$$ or b is less than or equal to $$\sqrt{N}$$. Though it is an improvement, such an algorithm is still not applicable to typical medium size numbers.

Suppose, for instance, that we want to factor a medium size composite number2
$$11438162575788886766923577997614661201021829672124236256256184293$$
$$5706935245733897830597123563958705058989075147599290026879543541$$
with 129 decimal digits. If we systematically tried all numbers starting with 2, then we would use about 1064 divisions until we found the first factor that has 64 digits:
$$3490529510847650949147849619903898133417764638493387843990820577$$
Each division can be computed fairly easily; with enough patience, we can even do it using only paper and pencil. But the number of divisions that we have to do is so huge that we cannot do them all even with a powerful computer.
(The second factor has 65 digits:
$$32769132993266709549961988190834461413177642967992942539798288533$$
These factors are primes, hence they are the only proper divisors.)

Small, medium size and large are not mathematical concepts. They are rather vague concepts concerning our present or future ability to perform certain type of algorithms. Nevertheless, there are “mathematical” relations between them. Observe that if n is a small number, then a number with n digits is medium size. In other words, if n is small, then 10 n is medium size. Also, if N is medium size, but not small, then 10 N is large. Notice that these relations are based on the exponential function, which plays an important role in complexity.

Often we also compute with data that are not numbers. In such a case we encode them by strings in a finite alphabet, usually strings of zeros and ones. This is not much different from numbers, as we can always imagine numbers as written in binary representation. To get a more general description of the three sizes, we can speak about elementary operations with strings.

### A Field Full of Open Problems

In spite of the huge amount of results produced in mathematics, what we know seems to be still just a small fraction of what we would like to. That there are more unsolved problems than results, can be said about every field of science. Typically, when a big problem is solved, it raises more new questions than it gives answers to. However, there are differences; in some fields we have a lot of fundamental results and we just need to get deeper knowledge, in others the fundamental questions are still open. Complexity theory is an example of the latter kind. It seems that what we can prove now are only basic facts, while the truly interesting facts are still out of our reach. We can make conjectures about the fundamental relations, but we do not have means to prove them.

The problem that is the simplest to explain is:

Is multiplication more difficult than addition?

Everybody “knows” that it takes more time to multiply two large natural numbers than to add them. Therefore children start with addition and learn multiplication later. Circuit designers also “know” that circuits for multiplication are more complex than circuits for addition. But then why are we not able to prove it? The point is that most people only know the usual school algorithm for multiplication in which we have to, among other things, multiply every digit of the first number with every digit of the second one. This, of course, takes more time than only adding the digits on the same places. Thus this particular algorithm for multiplication needs more time than the usual algorithm for addition. However, the question is whether every algorithm for multiplication needs more time than the usual algorithm for addition. The most naive approach to this problem would be to show that the school algorithm is the fastest possible. However, that is not true: we do have algorithms that run much faster than the school algorithm when the numbers are large.

If the answer to this problem is as we expect (that the multiplication of integers is a more complex operation than addition), then it is a typical impossibility problem. It would mean that it is impossible to find an algorithm for multiplication that is as fast as the algorithm for addition. This is essentially the form of all big problems in complexity theory.

What is not quite clear is why we are not able to solve these problems. As we already know, impossibility problems are usually hard, which is one explanation. Another reason may be that since the field is young, the theory and the proof methods are not sufficiently developed. But there may be more fundamental reasons. We will see in Chap.  that the problems in complexity theory are connected with problems in foundations of mathematics. Thus it is conceivable that we may need new mathematical axioms to solve them.

Let us look at the most important of these problems.

### The P Versus NP Problem

The P versus NP problem is the main open problem in complexity theory. It is not by accident that it was the first of the deep problems asked in this field; it is because it is a really fundamental question, and at the same time it is of practical interest. In plain words it can be very roughly stated as the question:

Can we always replace the brute-force search by an essentially more efficient algorithm?

Suppose we are looking for a solution of a problem P and we have an efficient way how to determine what is a solution and what isn’t. Then we can find a solution, if there is any, by searching the entire space of possible solutions. The question is then whether we are able to find a better algorithm which does not need so much time.

It is important to realize that here the brute-force algorithms are used only to define a certain class of problems. It does not mean that we are interested in such algorithms; on the contrary, we would like to avoid them. There are other ways to define the same class. One of them is based on the concept of guessing. If, for example we are searching a factor of a number, we may simply try to guess it. If the number is of medium size, the probability that we succeed is usually extremely small and we cannot use it in practice, but in theoretical research we can use this property as the definition of a certain type of problems. Thus we can restate the P vs. NP problem, again very roughly, as follows.

Suppose we know that we can find a solution by guessing. Can we then find a solution by a fast algorithm?

It should be stressed that the two descriptions above are only attempts to describe the problem succinctly and in plain words. I have also assumed that most readers have some experience with programming, that’s why I presented it in the form of replacing one type of algorithm by another, but people with different backgrounds may prefer different descriptions. A logician, for instance, would see the essence of the problem rather in the classical question of replacing existence by construction in the context of efficient computations. I will shortly give a more precise definition, which will eliminate possible ambiguities in the interpretation.

Intuitively the answer to the problem seems to be clear: there is no reason why such an algorithm should always be possible. This intuition is based on our everyday experience: if I absent-mindedly put some paper in a random place, then next time I need the paper I have to search all drawers to find it because the information of where it is lost. However, this argument is wrong. In the P vs. NP problem the crucial point is that we ask about mathematical properties. Hence the information about the solution is present; there is no uncertainty about where the solution is.

Consider the problem of factoring integers, which is one of the situations to which P vs. NP problem refers. For given numbers N and M, we can easily determine if M is a factor of N, but we do not know how to find M without trying a lot of numbers. If I forget where I put the paper, then from my point of view it can be anywhere and I cannot use reasoning to determine its place. On the other hand, factors of a given number N are uniquely determined by N and I can use mathematics to find one. There are several ingenious algorithms for integer factoring based on non-trivial mathematical results which perform much better than the trivial brute force search (but they are still not fast enough).

From the point of view of foundations, the most important search problem is the proof search. It is the problem to find a proof for a given formula assuming that we know that there is a medium size proof.

Here is an example of a problem coming from practice. Suppose an agent needs to visit certain towns. There are airline connections between some pairs of towns but not between every pair. Can he travel so that he lands in every town exactly once and return to the town where they started? Assuming that every flight costs the same this would optimize the cost of his task.3 Formally, it is a problem about graphs. It is the question if a given graph has a Hamiltonian Cycle, where a Hamiltonian cycle is a cycle that goes through every vertex of the graph and passes every vertex exactly once.

### Example

The graph of the cube is Hamiltonian, as apparent in the second drawing in Fig.  on page 6.

This problem is simply called Hamiltonian Cycle. Again the trivial algorithm for this problem is to try all possible ways to go along the edges of the graph and return to the same vertex. The number of these attempts to find a Hamiltonian cycle can be extremely large even for fairly small graphs. The question is: can we do it essentially better?

A lot of problems of this type can be presented as a problem of finding a solution of one or a system of equations in some limited range. Integer factoring is the problem of finding a solution to the following simple equation
$$x\cdot y=N$$
for a given natural number N with the constraint x,y>1. Solving polynomial equations in finite fields is another important class of such problems. Yet another very important problem is called Integer Linear Programming. It is the problem of solving systems of linear inequalities in the domain of integers. One can find such problems in every branch of mathematics, as soon as one starts looking for algorithms.

### Polynomial Time and Nondeterministic Polynomial Time

I will now state the P versus NP problem more precisely in order to show that it is a concrete mathematical problem.

First we have to replace the vague concepts ‘small’ and ‘large’ by a precise one. To this end we have to return to considering all infinitely many inputs and the dependence of the time on the length of inputs. Then we will specify a class of problems according to the asymptotic behavior of the functions that bound the time and say that these problems are easy. This will be the class P, Polynomial Time.

Let us recall the concept of a decision problem. We usually think of a decision problem as a condition that specifies certain numbers, strings, or other finite structures. But since in mathematics we use set theoretical approach that identifies the set with the problem, a decision problem is simply a set of some finite structures. If we assume the standard computational model, the Turing machine, the inputs will be finite strings of symbols from a finite alphabet. So a decision problem is formally a set of strings and when we talk about sets of numbers, or a finite structure of some type, we are assuming that they are encoded by strings.

The class of functions that we use to define P are polynomials. Thus P is the class of decision problems that can be computed using at most polynomially many steps. Here is a formal definition.

### Definition 9

P is the class of sets of strings such that A is in P if and only if there exists a Turing machine M and a polynomial p such that
1. 1.

M stops on every input, and for inputs of length n, it uses at most p(n) steps before it stops;

2. 2.

M accepts the set A (which means that it prints 0) if and only if the input belongs to A.

Computations that use only a polynomial number of steps play a central role in complexity theory. So we will use this concept also in the context of functions. We say that a function f is computable in polynomial time if there exists a Turing machine that computes the function f within such a polynomial bound.

In general we can consider all polynomial functions (for example, 3x 2−2x + 9) but what really matters is their asymptotic growth. The asymptotic growth of a polynomial is fully determined by the leading term (which is 3x 2 in the example). Hence we could use a simpler class of function instead of polynomials, say, the class of functions of the form ax b , where a and b are positive constants.

We call sets such as P complexity classes because they define sets of decision problems of certain complexity. P is a mathematical approximation of decision problems that can be practically solved, the “easy” problems. In computability theory we say that a problem is decidable, or recursive, if there is a Turing machine that decides the problem in a finite number of steps. In complexity theory we have a better approximation: a problem is in P if it is decidable in a polynomial number of steps.

How good is this approximation? One can immediately give examples showing that it has little to do with practical solvability. The first example is when the constant at the leading term is large. Say, we have the polynomial function 101000 x. An algorithm requiring so much time is practically useless. If the exponent is large, say we have the polynomial function x 1000, it does not work either. We can solve some small instances with such an algorithm, but the large growth of this function prevents us from using it for just a little larger data.

Yet, in general, P works pretty well. For one thing, if we show that a decision problem is in P by finding a polynomial algorithm, then the constants in the polynomials are, as a rule, very small. For another, this definition has a very desirable property: if we combine several polynomial time algorithms, we obtain again a polynomial algorithm. Naturally, when combining a few simple things, we expect the result to be simple too. In fact, the class of functions bounded by polynomials is the smallest natural class that ensures this property.

Next we need to define the complexity class NP. In plain words, a set A is in NP if the membership in A can be characterized as follows. An element x is in A if there exists a witness of bounded size y that testifies that x is in A. To make this more precise, we require that
1. 1.

the size of y is at most polynomially larger than x, and

2. 2.

given an x and y, one can decide in polynomial time whether or not y is a witness.

Here is a formal definition.

### Definition 10

NP is the class of sets A that can be defined as follows. For some binary relation RP and a polynomial p,
$$x\in A\quad \mbox{if and only if\quad there exists a } y \mbox{ such that }|y|\leq p\bigl(|x|\bigr)\mbox{ and }(x,y)\in R.$$
(5.1)

Here |x|, |y| denote the lengths of the strings x and y. A binary relation R on strings is in P if its natural encoding R′ by a set of strings is in P. One natural encoding is obtained by taking an extra separating symbol # and defining R′={x#y;(x,y) ∈ R}.

Let A be in NP and let p be a polynomial that limits the size of witnesses. Given an input string x of length n, we can decide whether x is in A by trying all y such that |y|≤p(|x|). We can test each y in polynomial time—this is the condition that the relation R is in P. However, the number of potential witnesses that we need to test is huge: 2 p(n). In concrete examples, for every input string x, there is a natural set of potential witnesses B x . Although the lengths of the strings in a typical B x are slightly smaller than the length of x, the size of B x is still too large and we cannot apply brute force search. If one wants to prove that a problem in NP is actually in P, then trying to reduce the search space is not a good strategy.

Let us consider some examples.

### Examples

1. The first example is the problem to determine if a natural number is prime. This problem is mathematically represented by the set Primes, the set of all prime numbers. More precisely, Primes is the set of 0–1 strings that represent prime numbers in binary notation. It is easy to show that the complement of this set, the set Composites, is in NP. Given a number N, a witness of N being composite is a number M such that 1<M<N and M divides N. Since the division can be computed in polynomial time, the latter condition defines a binary relation in P. The bounding polynomial p(x) is simply x.

Note that the input length is n≈log2 N. Hence the number of potential witnesses, the numbers between 1 and N, is exponentially large. Nevertheless, there is an ingenious algorithm that decides primality and runs in polynomial time. This is a result of M. Agrawal, N. Kayal, N. Saksena [2]. So this is an example of a problem that is in NP by definition, but in fact it is even in P.

2. The problem Hamiltonian Cycle is represented by the set of all graphs that have Hamiltonian cycles, the Hamiltonian graphs. In this example, for a graph G, a natural set of potential witnesses is the set of all cycles C on the set of vertices of G. The relation R is: C is a Hamiltonian cycle in G. For given G and C, it is very easy to check whether C is a Hamiltonian cycle in G; in particular, one can do it in polynomial time.

The P vs. NP is the problem whether or not these two classes are the same, which can be written as the simple equation:
$$\mathbf {P}=\mathbf {NP}?$$
P is clearly a subclass of NP (use a dummy witness to prove it). Thus the open problem is whether NP is larger than P.

As stated, the P vs. NP problem is to determine, for every set in NP, whether it is in P. But this pair of complexity classes has a remarkable property that reduces this question to a single NP set. There is a set A in NP such that P=NP if and only if A is in P. In fact, hundreds of such sets have been found; one of them is Hamiltonian Cycle. Such results are proved using polynomial reductions between sets. A polynomial reduction of a set X to a set Y is a polynomial time algorithm that reduces the decision problem for X to the decision problem for Y. A set X is called NP-complete, if it is in NP and every set in NP can be reduced to it. NP-complete sets have the property mentioned above.

Proving NP-completeness is also important from a practical point of view. If we know that a set A is NP-complete, then it still may have a polynomial time algorithm, but since we know that the problem P vs. NP is open, we know that nobody knows a polynomial algorithm for the set A. This has become a standard way of showing that a problem is hard, though formally it is hard only if PNP. Thus in practice we use PNP as an axiom.

When talking about algorithms all the time, one may get the impression that such things are not relevant for classical parts of mathematics, which is not true. Imagine, for example, that you are a mathematician working in finite combinatorics and your favorite topic is Hamiltonian graphs. Then most likely the theorem that you would like to prove is a characterization of the graphs that are Hamiltonian. The theorem should say that a graph is Hamiltonian if and only if some condition (simpler than the one by which they are defined) is satisfied by the graph. To give an example of such a condition, consider Euler graphs. These are like Hamiltonian graphs, but instead of having a cycle that goes exactly once through every vertex, they contain a cycle that goes exactly once through every edge (and it may visit vertices repeatedly). You surely recall puzzles in which you should draw a given picture without lifting your pen and ending on the same point on which you started. The well-known characterization of these graphs is that in such a graph every vertex is incident with an even number of edges. (This condition is, clearly, necessary; the nontrivial part, though not very difficult, is to prove the converse.) There is no such theorem for Hamiltonian graphs. If P were not equal to NP, we would have an explanation for the absence of such a theorem. If, moreover, we specified precisely what type of characterization we wanted, we might be able to prove that there is no such characterization. Our experience confirms that there is a relation between computational complexity and characterizations: NP-complete problems typically do not have nice characterizations.

The problem P vs. NP was first explicitly stated by Stephen A. Cook in a paper published in 1971 [48]. The title of the paper is The Complexity of Theorem-proving Procedures, which hints that the paper concerns logic. The main result of the paper was the theorem that the problem of satisfiability of propositional formulas is an NP-complete set. Independently and approximately at the same time the same result was proved by Leonid Levin [182]. Though nobody stated the problem explicitly earlier, some researchers considered related questions before Cook and Levin. Often quoted is Gödel’s letter to von Neumann sent in 1956 in which he asked how difficult it is to decide, for a given first order formula ϕ and a number n, whether there exists a proof of ϕ of length at most n [100]. Specifically, he asked if the time complexity of this problem can be bounded by the function cx or cx 2 for a constant c. This is the proof-search problem mentioned above and we know that this is an NP-complete set. Another problem that he mentioned in his letter was the complexity of Primes.

The reason for Cook’s choosing P is obvious: it stands for polynomial. The NP comes from nondeterministic polynomial. A nondeterministic Turing machine is a modification of Turing machines in which the machine in some states can do one of several actions. Hence, for a given input string, the way how the machine computes is not uniquely determined, which means that several computations are possible. We say that a nondeterministic Turing machine accepts an input string if at least one of the possible computations ends in the accepting state. Using nondeterministic Turing machines we define NP as the class of sets that are accepted by nondeterministic Turing machines in polynomial time. Notice that the P vs. NP problem is a question about the possibility of eliminating an existential quantifier whose range is bounded. The concept of nondeterministic machines is just another way of expressing such an existential quantification.

The P vs. NP problem immediately drew attention of researchers in computer science. Many attempted to solve it, but soon it became clear that there were no mathematical means that one could use. A number of results related to this problem have been proved, but the general feeling is that we have not progressed very much during those more than 40 years. Gradually the P vs. NP problem became well-known in the whole mathematical community. Bets about when it will be solved and prizes for the solution have been proposed. In 2000 the scientific board of the Clay Mathematics Institute included this problem on the list of seven most important problems in mathematics. The institute offers the prize of one million US dollars for the solution of each of these problems. (In Chap.  I mentioned another problem that is on the list, the Riemann Hypothesis.)

### Complements of NP Sets

The existential quantifier plays a crucial role in the definition of the class NP. What happens if we replace it by the universal quantifier? We know that negation inverts the quantifiers; specifically, if we negate an existentially quantified formula, it becomes equivalent to the universally quantified negated formula. In symbols it is:
$$\neg\exists x\ \phi \equiv \forall x\ \neg\phi.$$
If we translate this relation into the set-theoretical language, we will see that replacing the existential quantifier by the universal quantifier in the definition of NP results in replacing the sets in the class by their complements. So we define coNP, co-Nondeterministic Polynomial Time, as the class of complements of sets XNP.

This duality can be extended to other concepts. For example, we can define coNP-complete sets. The basic NP-complete set is SAT, the set of satisfiable boolean formulas. By the duality, we obtain that TAUT, the set of propositional tautologies, is coNP-complete.

Since P is closed under complements, P=coNP if and only if P=NP. The question whether NP=coNP is different. One can easily see that P=NP implies P=coNP, but we do not know if the opposite implication holds true. So we cannot exclude that NP=coNP but PNP. The problem NP versus coNP plays the central role in proof complexity.

### Time Versus Space

When we define Polynomial Time, the complexity class P, it does not matter which computational model we use. Time measured by all reasonable models differs at most by a polynomial. When we want to have more precise estimates on time we must specify the model. In complexity theory we use Turing machines, which may seem not quite adequate. Indeed, the original Turing machines which have a single tape are not very fast. For instance, if such a machine has to determine if two strings on the tape are the same, it has to go back and forth between the two strings many times. Therefore we rather use multitape Turing machines, Turing machines with several tapes. In such machines every tape is equipped with one read-write head. The task in the above example is easy for a two tape machine: the machines first copies the first string from the input tape to the second tape by moving its heads simultaneously; then it puts the input head on the first letter of the second string and the head on the second tape on first letter of the string on the second tape. Then it can compare the two strings by moving both heads simultaneously.

Computers do not work like a multitape machine. A theoretical model that describes computers better is the random access machine. By the random access we mean the possibility to access any place in the memory directly. This means that when a machine reads (or writes to) a memory unit i and next it needs to read (or write to) the memory unit j, it does not have to pass through the intermediate positions between i and j and can jump directly to j. It is also much easier to program a random access machine than a multitape Turing machine, but surprisingly, they are not essentially faster. Thus in theoretical research we prefer the simpler concept, the multitape Turing machine.

Once we have specified the computation model we can measure the time complexity very precisely. One of the early results in complexity theory (from 1960s) was the Time Hierarchy Theorem proved by J. Hartmanis and R.E. Stearns [114]. It states that for every reasonable function bounding the time of Turing machines, there exists a set whose complexity is very close to this function. Thus we have a whole range of possible time complexities.

In order to state this result more precisely, one should describe the set of functions that we can use as time bounds in this theorem, the time constructible functions. Since this is a rather technical point I will postpone it to the Notes. Let me only mention that the class includes all polynomials and other functions defined by the commonly used functions, such as the exponential function.

### Theorem 36

(Time Hierarchy)

Let f(x) and g(x) be time constructible functions. Assume that f(x) grows faster than g(x) in the following sense:
$$\lim_{x\to\infty}\frac{g(x)\log g(x)}{f(x)}=0.$$
Then there exists a set that can be computed in time f(x), but cannot be computed in time g(x).
The proof of this theorem is, essentially, an adaptation of the undecidability of the halting problem (see page 301). As it is an important application of the diagonalization method, it is worthwhile to describe the proof in more detail. Let Time(g) denote the class of sets that are computable in time g(x). Our goal is to construct a set A such that
1. 1.

A is outside Time(g), and

2. 2.

A is computable in slightly larger time (the better upper bound we get on the time complexity of A, the finer hierarchy we obtain).

To construct a set A that is not in Time(g) by diagonalization is easy. We pick an input string w B for every set B in Time(g). We put w B in A if and only if w B is not in B. This guarantees that A is not in Time(g). Since we need A to be computable in limited time, we have to be more careful; we cannot simply enumerate all sets in Time(g). We also cannot enumerate all Turing machines that run in time g(x), since this property is (for most g) undecidable. So what we do is to enumerate all Turing machines. For every M, we pick an input string w M , say the code of M. Then we decide whether or not to put w M as follows. We simulate the computation of M on w M for g(n) steps, where n is the length of w M . If the machine does not stop within this time bound, we do not care, since it does not define a set in Time(g). If it stops, we put w M to A if and only if M does not accept this string.

Thus A is not in Time(g) and it remains to determine the time complexity of the set A. It turns out that we need just a little more than g(n)logg(n) to simulate g(n) steps of a given machine. The reason is that we have to do it on a machine with k tapes, where k is a fixed constant, whereas machines that we need to simulate have arbitrarily large (finite) number of tapes. Machines with more tapes are faster.

Let us consider a couple of applications of the Time Hierarchy Theorem. According to this theorem there are sets computable in quadratic time (with time bound cx 2, c constant), but not in linear time (with time bound cx). Let EXPTIME be the class of sets computable in time exp(x c ) for c constant. Clearly, PEXP. It is a simple corollary of the theorem that EXP contains more sets than P; thus PEXP.

The space complexity is informally defined as the amount of memory that is needed for computation. Again, we measure it as a function of the length of the input data. Formally, we say that a Turing machine uses space s on a given input, if its heads visit s squares on the tapes during the computation. In analogy with time one defines Polynomial Space, denoted by PSPACE, as the class of sets that can be computed with space bounded by a polynomial.

For space complexity, we have a similar hierarchy theorem. The Space Hierarchy Theorem can be proved in a form which is stronger than we have for time. It suffices to assume lim x→∞ g(x)/f(x)=0 in order to prove that there exists a set computable in space f(x) and not in space g(x).

The hierarchy theorems give us almost optimal information about time classes and space classes when time and space is considered separately. Problems start when we want to compare time classes with space classes. All we can do is to prove a couple of simple relations. The first one is that if a set is computable in time f(x), then it is also computable in space cf(x)+c, for some constant c. This is easy: if a Turing machine with c tapes makes only f(n) moves, then it can visit at most cf(n)+c squares. Thus, in particular, we have PPSPACE. It is also not difficult to show that NPPSPACE. Indeed, brute-force search needs only polynomial space. Thus PPSPACE would follow from PNP, but we also conjecture that NPPSPACE. The class PSPACE also possesses complete problems; typical PSPACE-complete problems are to decide who has a winning strategy in a combinatorial two-player game of bounded length. This is an explanation of why we do not have winning strategies for games such as chess and go. However, it is only a supporting evidence about the hardness of these concrete problems; we are not able to prove any nontrivial lower bounds on the complexity of these games.

### Examples

1. We may never find out whether or not white has a winning strategy in chess. Because of the 50-move rule, one could, in principle, analyze the complete tree of plays using 50 chessboards and determine who has a winning strategy or whether there is a forced draw, but the number of plays is enormous. So the problem is not space, but time.

2. The game of Hex is simpler and thus more amenable to be analyzed. A simple argument shows that the first player has a winning strategy, but winning strategies are known only for boards smaller than the standard one. The general problem to determine, for a position on an n×n board, who has a winning strategy is PSPACE-complete, which means that the existence of an efficient algorithm for this problem is very unlikely [244].

To bound space by time we need to know how many steps a machine can make if it uses space f(x). One can easily estimate this number by exp(cf(x)), for some constant c. This is the number of all possible configurations of the machine and the tapes that can appear if space is restricted to f(x). This is also an upper bound on how many steps the machine can make, since if the machine ran longer, it would go into an infinite loop, but we assume that it always stops and produces the correct answer. This, in particular, yields PSPACEEXP.

There is a small improvement of the simulation of time by space, but except for that nothing else is known.4 It is possible that these simulations are the best possible, but we are not able to prove it. The best way to state these open problems is in terms of complexity classes. We have
$$\mathbf {P}\subseteq \mathbf {PSPACE}\subseteq \mathbf {EXP}.$$
Thus PSPACE is somewhere between P and EXP, and this is essentially all we know about it! We believe that both inclusions are strict, but we are neither able to prove that PPSPACE, nor to prove that PSPACEEXP. It is interesting, however, that we know that at least one of them is strict. This follows from what we have observed above that PEXP.

For researchers working in computational complexity theory, it is important to know that at least one of the above inclusions is not equality. People from outside of this field sometimes say: “What if somebody finds a polynomial algorithm for an NP -complete problem and we thus get P=NP ? Then all the talk about deep problems will turn out to be only humbug!” It is true that finding such an algorithm would probably be the simplest way to solve the P vs. NP problem and it may not require developing a deep theory. This can happen essentially with any of the open problems about complexity classes, but we do have reasons to believe that it cannot happen for all. In the pair of problems P=PSPACE? and PSPACE=EXP? at least one has the negative solution. We may be able to prove, for example, PPSPACE by proving PSPACE=EXP, but there are more complexity classes between P and EXP and it is not likely that each collapses to P or to EXP. So, very likely, some inequalities between complexity classes cannot be proved by proving equalities.

Some basic complexity classes and relations between them are shown in Fig. 5.1.

### Circuits

In computational complexity we use another important model of computation—circuits. In circuits functions are decomposed into a combination of elementary functions. Circuits are mainly used to compute Boolean functions f:{0,1} n →{0,1} m . This means that the task is for an input string of zeros and ones of length n to compute the string of length m assigned to it by the function. In Boolean circuits the elementary functions are some simple Boolean functions, usually unary and binary Boolean functions. Such elementary Boolean functions are also called logical connectives; therefore some authors use the term logical circuits. We are now interested in complexity, so we do not care about the interpretation of the Boolean functions in logic; the main thing is that they define simple operations. Also the fact that we use two values is not important, it is just because this is the smallest value that we can use. In electronic circuits two values is also the standard, but the reason for that is the suitability for production, reliability etc. In theory we also use circuits computing with various other sets of values, in particular, algebraic circuits which compute with numbers.

Formally, a Boolean circuit is an acyclic graph in which some nodes are labeled by the input variables and the other nodes are labeled by elementary Boolean functions. Furthermore, some nodes are also labeled by the output variables. This is very much like in real circuits, thus we also often call the edges of the graphs wires and the nodes logic gates. Given an input string the computation proceeds as follows. First we assign the values to the input nodes and then we gradually compute the values on the nodes labeled by elementary Boolean functions. Eventually we read the output bits on the nodes labeled by the output variables. An example of a Boolean circuit is in Fig. 5.2.

The complexity of the circuit is the number of nodes. The circuit complexity of a Boolean function is the minimal complexity of a circuit that computes the Boolean function. Having a single number as the complexity of a Boolean function corresponds better to our intuition about what complexity is. However, soon I will talk about infinite sequences of Boolean functions and the asymptotic growth of the complexity, which is not much different from the time and space complexities defined by means of Turing machines.

The set of elementary functions is assumed to be finite and complete in the sense that it must be possible to express any Boolean function using the elementary ones. We call such a set a complete set of connectives, or simply basis. Circuit complexity depends on bases, but for different bases the value is the same up to a multiplicative factor. Hence, if we are interested in the asymptotic growth, the basis does not matter. The same is true if we replace the two values by some finite set of values; again the difference is only a constant factor.

In order to justify this complexity measure, we must show that there are functions of various complexities, small and large. Again, it is natural to call a function whose circuit complexity is polynomial ‘small’ and those whose circuit complexity is exponential ‘large’. To give examples of functions with small complexity is not a problem. The circuit complexity theory was founded by Claude Shannon in the 1940s [264]. One of the first things he observed was that there exist functions of complexity almost 2 n , in fact, the majority of all functions of n variables are so complex. The proof is based on a simple idea. Count the number of Boolean function of n variables (this is $$2^{2^{n}}$$) and count the number of circuits of size at most s. If the number of circuits is smaller, then there must be a function which cannot be computed by such circuits. One can compute that the number of circuits smaller than 2 n /cn, for some constant c, is less than $$2^{2^{n}}$$, hence there are functions whose complexity is larger than 2 n /cn.

Note that this is a nonconstructive proof, a proof which does not provide any explicit example of such a function. Seventy years later we are still unable to give a constructive proof. In fact, we are unable to prove a nonlinear lower bound for an explicitly defined function!

Such a statement may give the impression that nobody has worked on circuit complexity since Shannon. This is not true; there are numerous results in circuit complexity. Very efficient circuits have been found for many functions, lower bounds have been proved for some restricted classes of circuits and circuits play an important role in other parts of complexity theory. Lower bounds for general circuits is the only part where no progress has been achieved.

One may suggest that Turing machine complexity is related to the complexity of software and circuit complexity is related to the complexity of computer hardware. The truth is rather that these two concepts are just two facets of the same thing. Let me first give a brief intuitive explanation, based on real computers, why Turing machine and circuit complexities are closely related. Suppose we want to have a device for efficiently computing a function F. We can either construct a processor for F, or program a computer to compute F. Given an electronic circuit, we can program a computer to simulate the circuit; thus the algorithmic complexity is not larger than the hardware complexity. Vice versa, given a program for computing F, we can design a circuit by assigning a gate to each elementary operation needed to execute the program; thus we obtain a circuit computing the function F whose size is bounded by the time complexity of the program. So these two complexities are the same up to some factor.

I will now explain it in more detail using the matrix model of computation introduced in Chap. , page 137. On page 144 I described a transformation of a computation of a Turing machine T into a matrix M for inputs of a given length n. Recall that the rows of the matrix M correspond to the steps of the computation of the machine T and the columns of M correspond to the squares on the tape of T. (For the sake of simplicity, I will assume that T has only one tape.) The entry in the matrix should furthermore encode a bit of information about whether the head of T is present on this square or not, and about the state of the machine. Thus the entries in the matrix M are from some finite alphabet A which is big enough to encode a symbol on the tape, a bit marking the position of the head and a state of T. The symbols in the first row are determined by the input string. For i>1, the symbol in the row i and column j is uniquely determined by the symbols on positions (i−1,j−1),(i−1,j),(i−1,j+1), the three adjacent symbols in the row above. More precisely, if j is the first or the last column, then we consider only (i−1,j),(i−1,j+1), or (i−1,j−1),(i−1,j), the two adjacent symbols in the row above. In other words, the entry on (i,j) is a function of the previous two or three entries.

In this way the matrix can be viewed as a circuit C. The circuit computes with values in the set A and uses binary and ternary functions defined on A as the elementary functions. It has a rectangular form, in which one dimension corresponds to the time of the machine T and the other to space. We only take as many rows as is the maximal number of steps that T makes on inputs of length n and the number of columns is the maximal number of squares that T visits in some computation on inputs of length n.

Now suppose that T operates only on bits. Then for the fixed input size n, T computes a Boolean function. So, instead of the circuit C operating with symbols in the finite alphabet A, we may want to have a Boolean circuit which uses only binary Boolean function as the basis. To obtain such a circuit we encode the elements of A by binary strings. Thus the binary and ternary functions defined on A become some Boolean functions. For each such function g, we choose a Boolean circuit and replace every occurrence of the function g in C by this Boolean circuit. By this transformation, we increase the size of the circuit only by a constant factor. This shows that if the time complexity of T is t(x) and the space complexity is s(x), we get a circuit of size at most ct(n)s(n), for some constant c. If we are only interested in time, we can upper-bound the size of the circuit by ct(n)2, since we know that always s(x)≤t(x). In particular, if the time of T is bounded by a polynomial, then the size of the Boolean circuit is also bounded by a polynomial.

It is clear that the above transformation produces only circuits of a special form. In order to understand the relation of Turing machines to circuits, we should examine what is the difference between a general circuit and a circuit obtained from a Turing machine. To simplify this problem, let us only consider the circuits that were obtained in the first part of the transformation. In the notation used, it is the circuit C which uses values from the set A. Let us simplify it further and compare C only with circuits that have the same rectangular form with the same connections between the gates5 and which use the same set of values A. Let D be such a general rectangular circuit. As the two circuits have the same underlying graph, the difference between them is only in what functions at which nodes of the graph are used. One can show that in C there are only three different functions: one binary function assigned to all nodes that correspond to the first row, one ternary function assigned to all nodes that are neither in the first column nor in the last one, and one binary function assigned to all nodes corresponding to the last column. In D, however, various functions can be assigned to the nodes in an arbitrary manner.

Furthermore, for the Turing machine T, the three functions are not only assigned to a given input size n, but the same functions are also used in the circuits for all input sizes. Hence the circuits obtained from Turing machines have a very regular form; we say that they are uniform, in contrast with general circuits that are nonuniform (see Fig. 5.3). This naturally leads to the concept of nonuniform complexity classes and to the question what properties do they share with their uniform counterparts. I will define only the nonuniformP, the nonuniform version of P.6 This is the class of all sets of 0–1 strings with the following property. A set A is in nonuniform-P if and only if for some polynomial p(x), for every input length n, there exists a circuit C n of size at most p(n) which accepts exactly those strings of length n that are in A. In other words, we have a sequence of circuits of polynomial size that compute the sections of the set of A.

The transformation of a Turing machine computation into a circuit shows that P is contained in nonuniform-P. To determine how much nonuniform-P is different from P is another fundamental problem of complexity theory. We know for sure that the two classes are different. This follows from the fact that nonuniform-P contains sets that are not computable. The reason is that since we are free to choose a circuit for each n in an arbitrary way, we can take some trivial circuits that accept all inputs of length n for some numbers n, and take some circuits that reject all inputs of length n for the other numbers. In this manner we can encode an arbitrary subset of natural numbers into a set in nonuniform-P.

We seemingly deviate from our original goal to restrict the class of computable sets to a subclass which is closer to what is practically computable, but our intuition tells us that it is all right. In practice we consider only some small range of the input sizes, the ‘medium size’ inputs, and we do not care what happens for inputs whose size is ‘large’. Thus we may expend a lot of effort to design a special purpose circuit for a given task and then use it for efficient computation. From the theoretical point of view, there is no distinction between circuits and Turing machines if we have a fixed input size. We can equivalently say that we may need a lot of effort to produce a fairly complex program, but once we have it, it runs fast. For instance, such a program may use precomputed tables of numbers, tables that need a lot of time to be produced. To express it shortly, P is a class of problems efficiently computable by small programs, whereas nonuniform-P is a class of problems efficiently computable using programs that may be very hard to produce.

The main problem concerning nonuniform complexity classes is how big uniform classes are contained in nonuniform-P. We believe that none of the reasonable uniform extensions of P is contained in it. In particular we conjecture that NP is not a subset of nonuniform-P. That said, we are even unable to prove that EXP is not contained in nonuniform-P. We will see later that this is an important problem.

Another important application, actually the first one, of the transformation of Turing machine computations to circuits is in the NP-completeness theory. I will say more about it in Notes.

### How to Prove that P≠NP?

The first idea that comes to mind is to use the same method as is used in proving that some problems are algorithmically unsolvable. This is the method that we call diagonalization, whose origin goes back to Cantor’s proof that the set of real numbers (or equivalently, the set of all subsets of natural numbers) has larger cardinality than the set of natural numbers. This method works well, as we have seen, when one needs to separate two classes defined by restricting the same computational resource. Thus one can separate pairs of time complexity classes and pairs of space complexity classes. But when we need to separate two classes of different a type, it does not work. We even have an argument that shows that diagonalization cannot solve problems such as P vs. NP, which I am going to sketch now. (But one can never exclude that some very unusual application of the method will work anyway.)

If you look at a typical proof based on diagonalization, such as the proof of the Time Hierarchy Theorem, you will notice that it uses very little information about how Turing machines compute. In particular the proof does not use the fact that the steps of the computation are very simple operations (reading the input symbol, rewriting it and moving the head to an adjacent square). If instead of these simple operations, more precisely, if, on top of these, we also allow some complex ones, the proof goes through without a problem. So let us consider an alternative world in which Turing machines can use a certain complex operation. You can imagine it as if some company would succeed in producing a special purpose processor that computed some hard function. Then on computers equipped with this processor one could compute many things that otherwise would be too hard. Everything would be the same, except that when programming one would be allowed to use this special function like the usual functions present in all programming languages.

Let us now consider the classes P and NP in such a world. We call classes modified in this way relativized. We are not able to solve the P vs. NP for the original, unrelativized classes, but it is possible to decide the corresponding question for some relativized ones. T. Baker, J. Gill and R. Solovay found (already in 1975) relativizations for which P=NP and others for which PNP, [17]. This proves that:

It is not possible to solve the P vs. NP problem without using the fact that the basic operations are the simple ones.

In particular it demonstrates that:

No direct application of diagonalization can prove PNP .

So what else can we use? When looking for a method that essentially uses the fact that a single step of computation may only use an elementary function, we are naturally led to circuits. We know that in order to prove PNP, it suffices to find a set in NP whose sections cannot be computed by polynomial size circuits. Thus the computation-theoretical problem is reduced to a combinatorial problem. Given an explicit Boolean function, such as the function that computes whether the input string of length n is an encoding of a Hamiltonian graph, we need to prove that every circuit that computes the function must have a size bigger than f(n) for some function f which grows faster than any polynomial. Hamiltonian graphs are typical combinatorial concepts and the nature of circuits is also combinatorial. So the first reaction is that all this is just finite combinators, maybe a bit more difficult, but nothing more. Yet many problems, including very difficult ones, can be stated as problems in finite combinatorics; one should not judge the problem only by its appearance.

Our negative experience suggests that proving superpolynomial lower bounds on explicitly presented Boolean functions is such a difficult problem. We also have some mathematical results that support this presumption. The form of these results is that a certain method of proving such lower bounds on the circuit size, a method that we consider to be a natural way of solving such a task, in fact cannot work. These results are mainly due to A.A. Razborov. In the 1980s Razborov proved several important lower bounds on the size of circuits of a special type [238, 239]. This created a great excitement; we hoped that the new methods that he introduced could be used to prove PNP. However, nobody was able to apply these methods to general circuits. Later Razborov proved that his methods, unfortunately, cannot be applied to general circuits [240]. (Again it does not exclude the possibility that some particular generalizations of the method can still work.)

Let us fix some terminology, before explaining the first type of negative result. In a circuit an elementary Boolean function, a gate, is assigned to every node of the underlying graph. We will assume that these functions are arbitrary binary Boolean functions (put otherwise, the basis is fixed to be the set of all binary Boolean functions). Furthermore, we associate every node of the circuit with a Boolean function computed at this node. This is the function that expresses the dependence of the output bit of the node on the input variables of the circuit. If the node is the output node, then the associated function is the function that the circuit computes. In the sequel we will assume that we want to prove a lower bound on the size of circuits computing some Boolean function f.

The first result concerns lower-bound methods based on the idea of a progress in computing f. This is a very natural idea: if f has large complexity and every step of computation makes only small progress (which means that it can increase the complexity of the computed functions only a little bit), then there must be many steps in every computation of f. The computations that we have in mind are circuits. As each node of a circuit computes some function, this schema makes sense. Indeed, at each step the circuit complexity of the computed functions can increase only very little.

In this form the method is just a reformulation of the task of proving a lower bound; we have to replace the circuit complexity by something more specific. The most natural specification of this method, which is also what Razborov used in his lower bounds on monotone circuits, is to consider the distance from f as the measure of progress. Whereas the circuit complexity of a function is a concept that we are not able to handle, the distance is a very simple property. It is just the number of inputs on which a given function differs from f.

However natural this approach seems, the method with this specification does not work (in the case of general circuits). The essence of the argument showing that it does not work is in the following equation.
$$f=g\oplus (f\oplus g).$$
Here ⊕ denotes the Boolean function exclusive or, also called XOR, which can be interpreted as the addition modulo two. To evaluate such an expression it is better to use the latter interpretation. As we count modulo two, we have gg=2g=0, which proves the equation. (The order of summands and parentheses are irrelevant for the computation, but they are important for the argument that I am going to describe.)

Suppose we have a circuit C that has two parts, one computing some function g and the other computing some function fg, and the output node computes the exclusive or of these two circuits. Hence C computes f according to the above equality. Thus if our lower bound method worked, one of the functions g or fg must be close to f. But g can be an arbitrary function, hence should the method work for f, we must have, for every g: either g is close to f, or fg is close to f. Since the pairs {g,fg} are disjoint for distinct functions g, it follows that half of all Boolean functions must be close to f, which is impossible.

I have sketched only one basic idea of Razborov’s results. It shows that it is difficult to base a lower-bound method on studying what functions are computed locally at single nodes. One can use the progress in computing a function f for lower bounds, but it is necessary to treat it as a global property. Every single node, except for the output node, can compute a function that is completely unrelated to f.

The next result is, in a sense, more general, but it is based on an unproved assumption from complexity theory. As it may seem rather strange to use such an assumption to prove unprovability of another assumption, I will first explain why it is reasonable to use such an assumption. Let us assume the hypothetical situation that we can prove that a certain method $$\mathcal{M}$$ cannot be applied to prove PNP. Furthermore assume that for this very proof, we need to use the assumption PNP. In other words, we can prove the following statement:

If PNP , then it is not possible to prove PNP using method $$\mathcal{M}$$ .

But if P=NP, then no method can give a proof of the opposite. Thus in such a case we can eliminate the assumption PNP and we can conclude:

It is not possible to prove PNP using method $$\mathcal{M}$$ .

Now suppose that instead of PNP, we use a stronger assumption $$\mathcal{A}$$. Then we cannot remove it, but such a result is still interesting; for example, we can conclude that $$\mathcal{A}$$ itself is not provable using method $$\mathcal{M}$$, without using any unproved statement.

We will now consider a method of proving lower bounds based on the following idea. First characterize the Boolean functions that have large circuit complexity using a property P and then prove that a particular Boolean function, if possible a function from NP, satisfies P. This kind of proof is often used in mathematics; many important theorems are either explicitly stated as a characterization of some concept, or can be interpreted in such a way. What we consider to be a nice characterization depends on the particular field of mathematics, and on personal taste; there is no precise definition of it. Having an efficient algorithm to decide property P is not exactly what we mean by a nice characterization, but it is often a consequence of it, and conversely, if there is no such algorithm, it strongly suggests that a nice characterization is impossible.

Thus for this method, the crucial question is: can we decide in polynomial time if a given Boolean function f has complexity larger than S? When one speaks about computing properties of Boolean functions, one should make clear, how the functions are given. Here we assume that a Boolean function is given by its truth table, represented by a string of zeros and ones of length 2 n , where n is the number of variables of the function. The above question is more precisely:

### Problem 2

Is it possible to decide in polynomial time which strings of length 2 n encode Boolean functions of complexity larger than S?7

Now I can state a simplified version of a result concerning the above method. It uses an unproved conjecture about the existence of pseudorandom generators, which I will abbreviate as the PRG-Conjecture. I will discuss pseudorandom generators in the next section, for now it is not very important what exactly the conjecture says (for the exact statement see page 435). It suffices to say that although it is essentially stronger than PNP, it is considered very plausible.

### Theorem 37

The PRG-Conjecture implies that the answer to the above question is no, i.e., it cannot be decided in polynomial time whether a string encodes a Boolean function of complexity larger than S.

Thus assuming the PRG-Conjecture, it is unlikely that a suitable characterization of hard Boolean functions can be found. Therefore it is unlikely that we could first characterize hard Boolean function by a polynomial time property and then prove that a given function has the property. Such a method of proving that a Boolean function has large complexity is unlikely to work.

This is a weaker form of a result of A. Razborov and S. Rudich [243]. I will now describe their result in more detail. Observe that for proving a lower bound we do not need a precise characterization of functions whose complexity is larger than a given bound S. What we only need is to characterize a subset of hard Boolean functions and to show that our function belongs to this subset. The essence of the result can be briefly stated as follows.

Assuming the PRG-Conjecture, only small subsets of binary strings encoding hard Boolean functions can be computed by polynomial time algorithms.

Hence we cannot prove circuit lower bounds by first characterizing a large subset of hard functions and then showing that our function belongs to this set. This still leaves open the possibility of characterizing a small subset of hard functions. Why cannot we first characterize a small subset of hard functions and then prove that our function is in this set? Theoretically this is, of course, possible; theoretically our function may be the unique function with such a property, but then we can hardly speak about a method. The idea here is that a method must be based on a general combinatorial property, general meaning that many functions have it.

Razborov and Rudich proved more than just the above result. They analyzed all important lower bounds in circuit complexity (there is a number of such results for restricted classes of circuits) and they showed that all these proofs are based on characterizing a subset of functions that are hard for the class of circuits involved and the subset is both computable in polynomial time and large. (For restricted classes of circuits this does not contradict the PRG-Conjecture.) Thus the assumptions about the form of lower-bound proofs, including the largeness, are natural; we do not have other proofs. Therefore Razborov and Rudich proposed to call lower-bound proofs of this form natural proofs.

There is another argument that justifies the largeness condition. As we saw before, if we try to use the idea of the progress in computing a function f, then half of the functions are “close” to f. Hence this type of method also leads to a characterization of large sets of hard functions.

To recapitulate it, Razborov and Rudich defined a precise notion of a natural proof for a given class of circuits, and they proved:
1. 1.

Essentially all known lower bound proofs are natural.

2. 2.

Assuming a plausible conjecture, there is no natural proof of an exponential lower bound for general Boolean circuits.

We therefore have to look for a proof that is essentially different from the proofs that have been found so far.

The last remark about natural proofs concerns the possibility of removing the unproved assumption in the way I mentioned before. At least in one case this is indeed possible. A. Wigderson showed that one can prove without any unproven assumption that there is no natural proof of the hardness of the discrete logarithm. This function is conjectured to be hard; in particular, it should not have polynomial size Boolean circuits.

### The Problem of Proving Lower Bounds

To show that PNP we need to prove that for some set A in NP, the question “is a given x in A?” cannot be decided using a polynomial time algorithm. In other words, we need to prove that any algorithm for this problem must run in time t(n) that cannot be bounded by a polynomial in n. Thus we need to prove a lower bound on t(n). Essentially all problems about separating complexity can be stated as problems of proving certain lower bounds.

Proving lower bounds on the complexity of computations is also important if one needs to precisely determine the complexity of particular problems. When we have an algorithm A for some problem P, it is usually not very difficult to estimate its running time t(n) (or the space s(n)) it requires. Thus an algorithm gives us an upper bound on the complexity of P. Then we wonder whether the problem can be solved by a better algorithm. To show that there is no better algorithm requires proving a lower bound. The ideal situation would be when, for every problem used in practice, complexity theory were able to exactly determine the computational complexity of the problem. Presently this is only wish as one can prove precise bounds only in a few very special cases.

As usual, when a problem cannot be solved at once, people try to prove at least partial results. Let us focus on the problem of proving lower bounds on the size of Boolean circuits. There are two ways how one can make this problem simpler. First, one can try to prove just some nontrivial lower bounds. Second, one can restrict the class of circuits.

The first approach is the most frustrating area in computational complexity. One can say that almost nothing has been achieved here. At the time of writing these lines, the largest lower bound on circuits with n variables computing an explicitly defined Boolean function is only 3no(n),8 see [27]. Furthermore, these proofs use only elementary arguments, so they do not provide any insight into what the essence of computational complexity is.

In contrast to this, the area of lower bounds on circuits from restricted classes is one of the most interesting parts of theoretical computer science. I will describe one basic result of this kind. Consider the Boolean function PARITY(x 1,x 2,…,x n ) which computes the parity of the number of ones in the input. So it outputs 0 if the number of ones among x 1,…,x n is even, otherwise it will output 1. Now suppose we want to compute it as a disjunction of conjunctions of variables and negated variables. Formally, we want to express it as
$$\mathit{PARITY}(x_1,x_2,\dots ,x_n)=\bigvee _i\bigwedge_j z_{ij}$$
where each z ij is a variable or negated variable (x k or ¬x k for some k=1,…,n). If some conjunction ⋀ j z ij contained less than n variables, then the right hand side would accept inputs of both parities, so this is not possible. If a conjunction contains all variables, then it accepts exactly one input (assuming every variable occurs in it only once). Hence the number of conjunctions must be 2 n−1, the number of odd inputs. This is a very easy lower bound. Now consider circuits of depth 3, which means expressing PARITY as
$$\mathit{PARITY}(x_1,x_2,\dots ,x_n)=\bigvee _i\bigwedge_j \bigvee_k z_{ijk}$$
where each z ijk is a variable or negated variable. Then one can still prove an exponential lower bound $$2^{\sqrt{n}}$$, but the proof is not trivial. More generally, one can prove exponential lower bounds for any fixed depth. The proof is based on the random restriction method, which is briefly described in Notes.

It would take us to far afield to discuss the importance of this result and its relation to others. Let me only draw your attention to the striking simplicity of the function used in this lower bound. Naively, one would expect that it should be easier to prove such lower bounds for functions that are hard, rather than such a simple one. The point is that the lower bound proof uses the simple property that PARITY changes its value if we flip any of the input bits. This function and its negation are the unique functions that have this property.

This is one of the first nontrivial lower bounds. In more recent results a bit more complex properties are used, but they are still fairly simple. Our inability to use complex properties of hard functions shows that the field is still not mature enough. The state of affairs is comparable with number theory at the time when only the irrationality of $$\sqrt{2}$$ was known.

### Existence and Construction

When we prove that there exists a mathematical entity (a number, a mathematical structure, etc.) with some property, we usually give an explicit description of it. Sometimes, however, one can only prove that it exists without being able to exhibit a specific example of such a thing. The first type of proof is called constructive, the second type nonconstructive or purely existential. Purely existential proofs appeared in mathematics not so long ago. At first they were looked upon with great suspicion and some mathematicians even rejected them. Some mathematicians thought that they were the source of paradoxes and therefore in intuitionistic mathematics such proofs are not allowed. In 1889, Hilbert proved a fundamental result that the number of invariants of an n-form is finite [123], which generalized a former proof of Paul Gordan for binary forms. Hilbert used an indirect argument, and since it was one of the first purely existential proofs, Gordan’s reaction was: “this is not mathematics, it is theology”. Another well known nonconstructive result is Roth’s Theorem about approximations of irrational numbers by integers. It states that for a certain type of precision, there are only finitely many rational numbers that approximate a given rational number with that precision. (Weaker versions of this theorem were proved by Liouville, Thue and Siegel.)

Gradually mathematicians got used to it and in some fields there are many purely existential proofs. The reason is not that we like them, on the contrary, we always prefer a constructive proof, but sometimes this is the only way to prove a theorem. Quite often, an existential proof is easy and a constructive proof of the theorem is difficult and therefore found much later. But even nowadays we usually do not consider a purely existential proof to be a complete solution of a problem. Typically, when looking for a solution of equations, proving the existence of a solution and solving the equations are always treated as two separate things; solving means to describe a solution explicitly.

In the chapter about set theory we noticed that the Axiom of Choice has a very nonconstructive nature. Among its consequences, there are results stating the existence of very counterintuitive objects, such as nonmeasurable sets of reals, the paradoxical transformation of a ball into two balls, etc. But existential proofs are also used in finite mathematics where we do not need the Axiom of Choice. They were introduced to finite graph theory by Erdős in the 1940s. Probably the first paper in which his probabilistic method was used concerned estimates of the Ramsey numbers [67].

Let us briefly recall the Finite Ramsey theorem and the Ramsey numbers (defined on page 16, see also page 328). The theorem states that for every natural number n, there exists a number R such that, for every graph on R vertices, there exists a set of n vertices such that either every pair of them is connected by an edge, or no pair is. We call such sets of vertices monochromatic. The minimal R such that this holds is the Ramsey number R(n). We do not have a precise formula for the numerical function R(n), but already back then Erdős established that it grows exponentially.

To prove such a result one has to show an exponential upper bound and an exponential lower bound. The upper bound R(n)≤U means that every graph with U vertices contains a monochromatic set of size n. It was proved constructively (for $$U={{2n-2}\choose{n-1}}$$), in the sense that, for every graph of size U, a monochromatic subset of size n was constructed by an efficient algorithm. To prove the lower bound L<R(n) one has to prove that there exists a graph on L vertices which does not contain a monochromatic set of size n. This is where Erdős used a purely existential proof (for L=2(n−1)/2 n/e).

His argument can be explained in two equivalent ways. (1) He counted the number of all graphs on L vertices that contain a monochromatic subset of size n and showed that it is smaller than the number of all graphs on L vertices. (2) He counted the probability that a random graph on L vertices contains a monochromatic subset of size L and showed that the probability is less then 1. Since the probability that a random graph has a property P is the number of graphs having property P divided by the number of all graphs, (1) and (2) are the same, except that they use different language. In both cases we immediately conclude that there exists a graph on L vertices that does not contain a monochromatic subset of size n, without being able to exhibit a single example of such a graph. More than sixty years after Erdős asked this question, we are still unable to construct explicitly a Ramsey graph, a graph that would show an exponential lower bound on R(n).9

Problems of this type are clearly related to the P vs. NP problem. Since the number of graphs on L vertices is finite and the condition that we are interested in is algorithmically decidable, we can search all such graphs and find one. The problem is that for medium size L the number of such graphs is too large, hence this algorithm is too slow. If P were equal to NP we could construct such a graph in time bounded by a polynomial in L. What is more important is that the brute-force algorithm above does not yield any additional knowledge about such graphs. Why is an explicit construction better? Consider the problem of finding more precise estimates on R(n). Erdős’s lower bound is asymptotically 2 n/2 and his upper bound is 22n . These bounds have been improved only in lower order terms, so the situation now is essentially the same as sixty years ago. A likely explanation of our failure to close the gap is the following: the actual value of R(n) is close to the upper bound 22n , whereas the value for random graphs is close to 2 n/2 and the probabilistic method is only able to determine the value for random graphs. Apparently we need to find explicit constructions in order to be able to determine the Ramsey numbers. This is not a sheer speculation; in Notes I will mention two combinatorial problems that were solved by means of explicit constructions.

Probabilistic proofs show the existence of graphs with typical values of parameters which is bad for problems such as the problem of computing the Ramsey numbers. On the other hand, this approach opened a completely new area of research, the study of properties of “random graphs”, or more generally any “random structures”. Using the word ‘random’ is only a façon de parler about properties that are shared by most structures of a given type. Saying that ‘a random graph has property P’ is just a convenient abbreviation of the longer ‘if we randomly choose a graph with N vertices, then the probability that it has property P goes to 1 as N goes to infinity’. That said, I should stress that a random graph is not abstract nonsense that we can never observe in reality. On the contrary, we can very easily produce such a graph: for each pair of vertices decide whether to draw or not to draw an edge between them by tossing a coin. Such a graph will be for all practical purposes random. At the end of this chapter I will mention a possibility how to define formally a random graph as a specific graph.

Shannon’s proof that there exist Boolean functions whose circuit complexity is exponential appeared only two years after Erdős’s paper. I do not know if Shannon was aware of Erdős’s result, but his proof is based on the same idea—comparing the number of all n variable Boolean functions with the number of circuits of some size S, which bounds the number of Boolean function of circuit complexity S. Thus we know that “random Boolean functions” have complexity of the order 2 n /n, but we are even not able to construct functions that have nonlinear complexity. The words ‘construct’ and ‘explicitly define’ are not precisely defined. In complexity theory we, of course, would interpret them as giving an algorithm of certain low complexity. If we succeed in constructing a function with, say, nonlinear circuit complexity, then depending on how strong constructibility condition we ensure we get a correspondingly strong separation result. One of the best would be to find a sequence of functions that compute sections of an NP problem A and which have superpolynomial circuit complexity. Superpolynomial circuit complexity means that A is not in nonuniform-P. So this would give us $$\mathbf {NP}\not \subseteq \mathbf {nonuniform}\mbox {-}\mathbf {P}$$, from which PNP follows. Another natural interpretation of being explicit is to require that explicit Boolean functions should be constructed in polynomial time in 2 n (which is the size of the truth tables). This is much weaker but it would still give the interesting consequences, for example, the separation $$\mathbf {EXP}\not \subseteq \mathbf {nonuniform}\mbox {-}\mathbf {P}$$.

Another example of a structure whose existence can be proved easily, but it is nontrivial to construct it explicitly are expander graphs (see page 431). These are graphs that, roughly speaking, have few edges, but where every set of vertices has many neighbors (unless the set is too large). Explicitly defined expander graphs are very useful components of many constructions and derandomization techniques.

Let us pause and compare the problem of constructive proofs in finite combinatorics and computational complexity with similar problems in other fields of mathematics. There are numerous examples of problems for which we have only existential proofs. Typically one has a criterion for solvability of equations of some type, but the result says nothing about how to find an explicit solution.

I have already mentioned that integer factoring is a problem of this type. A particular instance is the famous problem to determine which Mersenne numbers are primes. A Mersenne number is a number of the form 2 p −1, where p is a prime. The well-known Lucas-Lehmer test can be used to prove primality and non-primality of very large Mersenne numbers. If such a number is not a prime, it means that xy=2 p −1 has a solution with x,y>1, but finding the factors is much harder.

Likewise, theorems about the existence of solutions of differential equations can be applied to many types of equations, but only in some special cases it is possible to describe a solution explicitly. In such problems a solution is an infinite object, thus we do not have a precise definition of being explicit. If the solution is unique, we can often compute it with arbitrary precision, but this is not enough.

It is also interesting to compare the methods used to prove such results. Cantor proved that the cardinality of the set of all real numbers is bigger than the cardinality of the set of all algebraic numbers. As we have observed in Chap. , a simple corollary of this result is that there exist non-algebraic, i.e., transcendental numbers. It had been proved before that particular numbers are transcendental, but Cantor’s proof is much simpler. In circuit complexity we use finite cardinalities to prove the existence of Boolean functions with large complexity, but unfortunately, we still do not have methods to give specific examples of such functions. The method used in these two cases is the same, both proofs are fairly easy and both do not provide specific examples.10

The lack of progress in solving fundamental problems of the complexity theory suggests that we must first learn how to use deeper mathematical methods. Recent development shows that explicit constructions of finite structures with certain useful properties are often the key to making progress where we had been stuck for a long time. Finding explicit constructions where there are only existential proofs is a very interesting topic. However, we should not expect that every existential proof can be replaced by explicit constructions. For example, it is not excluded that there are no polynomial time constructions of Ramsey graphs.

### The Complexity of Algebraic Computations

Computing with digital devices requires encoding numbers by strings of symbols, say zeros and ones. What if we ignore this problem and assume arithmetic operations to be elementary? Given an algebraic function, which is a function defined by a polynomial, it is natural to ask how many basic algebraic operations we need to compute the function. This is algebraic complexity.

It should be stressed that if we determine the algebraic complexity of an algebraic function, we still do not know how difficult it is to compute it using a general computational model such as the Turing machine, or Boolean circuits. In the algebraic setting the cost of computing the product of two numbers is one because it requires only one step, but when the numbers are encoded in binary, it is a nontrivial task (and we do not know the precise cost). What is more important is the opposite: for some algebraic functions an algebraic program may require much more time than a Turing machine. The point is that a Turing machine can compute basic arithmetic operations fairly fast, but on top of that it can also compute a lot of operations that are not algebraic. For example, it can flip the binary string that represents the number. However, algebraic programs have the big advantage of being more versatile. We can often use the same algorithm for various fields, sometimes even for rings.

One of the central problems in algebraic complexity is the complexity of matrix multiplication. Let A=(a ij ) and B=(b ij ) be two n×n matrices. Their product is defined by
$$AB = \biggl(\sum_k a_{ik} b_{kj} \biggr).$$
If we use this formula to compute the product we need n 3 multiplications and n 2(n−1) additions. It is tempting to conjecture that this is the best possible, but it isn’t.

The fact that it suffices to use essentially fewer operations was discovered by Volker Strassen. Sometimes around 1969 he decided to determine the complexity of matrix multiplication in the simplest possible case. It was the case of two by two matrices in the two element field. The two by two is the smallest nontrivial dimension of matrices and the two element field $$\mathbb{F}_{2}$$ is the simplest field since it consists only of 0 and 1. Thus in this field the number of possible terms is very limited. Strassen found a very remarkable way of computing such a product [286].

The product of two such matrices is
$$\left (\begin{array}{c@{\quad }c} a_{11} & a_{12}\\ a_{21} & a_{22} \end{array} \right ) \left (\begin{array}{c@{\quad }c} b_{11} & b_{12}\\ b_{21} & b_{22} \end{array} \right ) = \left (\begin{array}{c@{\quad }c} c_{11} & c_{12}\\ c_{21} & c_{22} \end{array} \right )$$
where the numbers c ij are, according to the definition,
$$\begin{array}{l} c_{11}=a_{11}b_{11}+a_{12}b_{21}\\[3pt] c_{12}=a_{11}b_{12}+a_{12}b_{22}\\[3pt] c_{21}=a_{21}b_{11}+a_{22}b_{21}\\[3pt] c_{22}=a_{21}b_{12}+a_{22}b_{22}. \end{array}$$
Thus the task is to compute these four bilinear forms in a ij , b k,l . Using this definition we can compute them with 8 multiplications and 4 additions. Strassen found the following algorithm.
$$\begin{array}{l} d_1=(a_{11}+a_{22})(b_{11}+b_{22})\\[3pt] d_2=(a_{21}+a_{22})b_{11}\\[3pt] d_3=a_{11}(b_{12}-b_{22})\\[3pt] d_4=a_{22}(-b_{11}+b_{21})\\[3pt] d_5=(a_{11}+a_{12})b_{22}\\[3pt] d_6=(-a_{11}+a_{21})(b_{11}+b_{12})\\[3pt] d_7=(a_{12}-a_{22})(b_{21}+b_{22})\\[3pt] c_{11}=d_1+d_4-d_5+d_7\\[3pt] c_{21}=d_2+d_4\\[3pt] c_{12}=d_3+d_5\\[3pt] c_{22}=d_1+d_3-d_2+d_6. \end{array}$$
(5.2)
This enables one to compute the product with 7 multiplications and 18 additions. As it is often the case, it turned out that what he learned in the special case could be widely generalized. Namely:
1. 1.

Although Strassen’s algorithm uses 25 operations whereas the definition gives only 12, the important thing is that the number of multiplications has been reduced. This is crucial in the recursive application of this formula, which enabled him to show a better asymptotic upper bound on the number of operations when the dimension of matrices goes to infinity.

2. 2.

The algorithm works not only in the two element field; it works generally in every field (in fact, it also works in rings, which is needed for recursive applications to higher-dimensional matrices).

It is not difficult to explain the generalization to matrices of any dimension. An important ring is the ring of n×n matrices. One can apply the above formula to this ring as follows. Take two 2n×2n matrices A and B and divide them into n×n blocks A ij , B ij , i,j=1,2. It is not difficult to see that the product AB can be computed using the decomposition into blocks, as if we had two 2×2 matrices with entries A ij , B ij , i,j=1,2. In this computation the operations of additions and multiplication are the addition and multiplication in the ring of n×n matrices. Hence using formulas (5.2) we can reduce the computation of the product AB to 7 products and 18 additions of matrices of twice smaller dimensions. Thus if we can multiply n×n matrices using μ(n) multiplications and α(n) additions, we can multiply 2n×2n matrices using 7μ(n) multiplications and 7α(n)+18(2n)2 additions.

If n is a power of 2, n=2 k , then the recursive application of this reduction yields an algorithm with at most c⋅7 k multiplications and additions, for some constant c. This is asymptotically $$n^{\log_{2} 7}=n^{2.8073\dots}$$, which is better than the asymptotic complexity of the defining formula, which is n 3. The current best algorithm gives n 2.3727.

The lesson from these results is that our initial intuition about complexity may be very poor. The defining formulas of the matrix multiplication are esthetically much more pleasing than algorithms such as Strassen’s, so we tend to conjecture that they should be the optimal ones. But the “natural” way of computing a certain problem may be far from the optimal one.

The experts on matrix multiplication conjecture that it should be possible to reduce the exponent in the upper bound arbitrarily close to 2. Still it seems unlikely that the number of operations needed for matrix multiplication is linear in n 2, more likely it is cn 2logn or higher, but we are not able to prove it (notice that here n 2 is the input size). As in Boolean circuit complexity, also in algebraic circuit complexity we are not able to prove nonlinear lower bounds on any explicitly defined algebraic function.

### Notes

1. 1.

Time constructible functions. Time constructible functions are a special concept that is needed for the Time Hierarchy Theorem. A function f defined on natural numbers is time constructible if there exists a Turing machine M which for every n, stops after exactly f(n) steps on every input word of length n.

Here is why we need such functions. When we diagonalize the sets computable in time f(x), we have to consider all Turing machines, since it is undecidable if a machine stops within such a limit. Therefore we have to truncate the simulated computation after f(n) steps. If f(x) were not time constructible, we could not do it.

All the functions that naturally appear as time bounds are time constructible. However, for some artificially defined functions the Time Hierarchy Theorem fails.

Everything can be very easily adapted to space, so I will not discuss space constructible functions.

2. 2.

The complexity of multiplication. In order to add two n-bit numbers, we need, for every position i, to add the ith bit and a carry, which requires a constant number of bit operations. Hence the asymptotic time complexity of addition is linear: cn for some constant c.

The school algorithm for multiplication of two n-bit numbers is based on computing the table of the products of the ith bit of the first number and jth bit of the second one, for all i,j=1,…,n, then one adds diagonals of the table. Thus we get a quadratic bound, dn 2 for some constant d. In 1971 A. Schönhage and V. Strassen found an algorithm that needs time only dnlognloglogn, for some constant d′, [257]. This algorithm is not used in computer hardware, as d′ is fairly large and dnlognloglogn beats dn 2 only for fairly large numbers, but it is useful in experimental mathematics and cryptography.

3. 3.

Factoring numbers and testing primality. Deciding if a given number is prime and finding its nontrivial factors are closely related problems, but it seems that they have different complexities. We know that it is possible to decide in polynomial time if a number is prime, which is formally expressed by PrimesP, but most researchers believe that finding factors is much harder.

For testing primality, several probabilistic polynomial time algorithms were found in the 1970s. More recently, in 2001, M. Agrawal, N. Kayal and N. Saxena found a deterministic polynomial time algorithm [2].

The fastest factoring algorithms are probabilistic. The number field sieve algorithm seems to run in time bounded by
$$\mathrm{e}^{cn^{1/3}(\log n)^{2/3}},$$
where n is the number of digits of the number to be factored and c is a constant. Thus it runs approximately in time exponential in the third root of the length of the number, which is much more than polynomial. This bound has not been proved formally; it is only based on a heuristic argument. The best bound proved formally is only exponential in the second root
$$\mathrm{e}^{(1+o(1))(n \log n)^{1/2}},$$
where o(1) denotes a function such that o(1)→0 as n→∞. See [156] for a presentation of these algorithms.

4. 4.

Search problems. We started the informal description of the P vs. NP problem with search problems, whereas the definition speaks only about decision problems. It is not difficult to prove that if P=NP, then these problems also have polynomial time algorithms. I will show it with the example of the integer factoring problem. Consider the following decision problem derived from integer factoring: for a given number N and a string of zeros and ones w, decide if there exists a proper divisor M of N whose binary representation starts with the string w. This is clearly a decision problem and it is in NP, since we can guess such a divisor and easily verify the correctness. If it were P=NP then we would have a polynomial time algorithm A for this problem and we could apply it to factor a composite number N as follows. First use A to determine the first bit of a proper divisor of N. When we already know that there exists a proper divisor that starts with w, we can determine if there exists a proper divisor starting with 0w or 1w by a single application of A. Hence we can find a divisor by repeating this k times, where k denotes the number of binary digits of N.

Say that a binary relation B(x,y) defines an NP search problem, if BNP and for some b, for every x and y, if B(x,y), then |y|≤|x| b . It is easy to generalize the example above and prove that P=NP if and only if NP search problems possess polynomial time algorithms.

5. 5.

Sets, languages and Boolean functions. We often speak about the complexity of sets, but to be quite precise we should consider only sets of strings in a finite alphabet. Recall that in computer science we call such sets languages. So for example, when I spoke about the complexity of the set of all Hamiltonian graphs, I implicitly assumed that we take a natural encoding of all graphs by strings in a finite alphabet. It is difficult to define what are natural encodings. Intuitively an encoding should be compact, i.e., should not use excessively more bits than needed and should not contain information about the encoded entity that is not easily computable. For example, an encoding of graphs in which the first bits determines whether the graph is Hamiltonian is not natural. Apparently the only way to do it formally is to define one particular encoding and say that a natural encoding is an encoding that is equivalent to a fixed one in the sense that one can be computed from the other in polynomial time. In particular, we encode numbers by their binary representation and graphs by their incidence matrices.

If we want to study the relation between languages and Boolean function, it is best to focus on languages in the two element alphabet {0,1}. For a language L, we consider its sections L∩{0,1} n for n=0,1,2,… . Each such section is associated with the Boolean function f n :{0,1} n →{0,1} defined by
$$f_n(x)=1\quad \mbox{if and only if}\quad x\in L,$$
for x∈{0,1} n . Thus L defines uniquely the sequence of Boolean functions {f n }, and vice versa any sequence of Boolean functions {f n } such that f n :{0,1} n →{0,1} defines uniquely a language L in the alphabet {0,1}.

In the subsection about circuits I mentioned that Turing machine computations can be simulated by circuits whose size is not much larger than the time complexity of the machine. A consequence of that can be stated formally as follows.

Theorem 38 [48, 182] If LP , then for the corresponding sequence of Boolean functions {f n } there exists a sequence of circuits {C n } such that C n computes f n and the sizes of C n are bounded by a polynomial in n.

One can prove a quantitatively more precise theorem saying that if the time complexity of L is bounded by a function t(x), then the size of the circuits can be bounded by ct(n)logt(n) for some constant c depending on L.

6. 6.
NP -completeness. A set A is polynomially reducible to set B if there exists a function f computable in polynomial time such that for every x,
$$x\in A\quad \mbox{if and only if}\quad f(x)\in B.$$

A set B is NP -hard (intuitively, at least as hard as every set in NP) if every set ANP is polynomially reducible to B.

A set B is NP -complete, if it is in NP and it is NP-hard.

Call a circuit C satisfiable, if there exists an assignment to the input variables a such that C(a)=1.

Theorem 39 The set of all satisfiable circuits is an NP -complete set.

Proof-sketch The fact that the set is in NP is easy: we can guess the satisfying assignment and check it in polynomial time by simulating the computation of C.

Now let A be an NP set. Suppose A is defined by the formula (5.1) on page 373. Consider an inputs x of length n. The witnesses are bounded by p(n) for some polynomial p. Without loss of generality, we can assume that all have the same length mP(n). Let M be a Turing machine computing the relation R in polynomial time. By transforming M into a sequence of circuits we get a circuit C(x,y) with x representing n input bits and y representing y input bits, with the following property. If u has length n, then uA if and only if there exists w such that C(u,w)=1. Given u, let C u (y) be the circuit C with the inputs x fixed to u. Then we have uA if and only if there exists w such that C u (w)=1, which is by definition if and only if C u is satisfiable. Thus the mapping uC u reduces the question if uA to the question if C u is satisfiable. One can check that construction of C u can be done in polynomial time, hence the function is a polynomial reduction. This finishes the proof. □

This is the basic NP-complete set from which the NP-completeness of all other known sets has been derived. When we have one NP-complete set, we can prove NP hardness by defining polynomial time reductions from this set. This is conceptually much simpler, since we only need to find an algorithm. Nevertheless, many polynomial reductions are quite difficult to define and to prove their correctness.

In the same way one can define completeness for other classes. We know, for example, that there are PSPACE-complete problems and EXP-complete problems.

7. 7.

The Polynomial Hierarchy. We have already met the Arithmetical Hierarchy (see page 141). Let us recall that it is a hierarchy of arithmetically defined subsets of natural numbers defined as follows. The level Σ n are the sets definable by formulas with n alternations between existential and universal quantifiers, where the formula starts with an existential quantifier. The level Π n is defined in the same way, except that we require the defining formulas to start with a universal quantifier.

The Polynomial Hierarchy is a feasible version of the arithmetical hierarchy where we only allow quantifications limited to finite domains. The finite domains consist of numbers, or of strings of polynomial length. The levels of the Polynomial Hierarchy are denoted by $$\varSigma_{n}^{p}$$ and $$\varPi_{n}^{p}$$ (the superscript p standing for ‘polynomial’).

Let us consider a couple of examples.

1. The class $$\varSigma_{1}^{p}$$ is the familiar class NP. We know that a set ANP can be defined by a formula of the following form.

xA if and only ify(|y|≤p(|x|)∧B(x,y)).

In this formula, |⋯| denotes the length and p denotes some polynomial, so the first part says that the length of y is polynomially bounded by the length of x. The binary relation B(x,y) is assumed to be computable in polynomial time.
2. The class $$\varPi_{1}^{b}$$ is defined by formulas of the form

xA if and only ify(|y|≤p(|x|)→B(x,y)).

This class is dual to the class NP, in the sense that it is the class of complements of sets in NP. Therefore, it is also denoted by coNP.
3. The class $$\varSigma_{2}^{p}$$ is the class of all sets A that can be defined as follows.

xA if and only if (∃y 1,|y 1|≤p 1(|x|))(∀y 2,|y 2|≤|p 2(x)|)C(x,y 1,y 2),

where p 1 and p 2 are polynomials, and C is a ternary relation computable in polynomial time. Here I am using a shorter notation for bounded quantifiers.

An example of a class in $$\varSigma_{2}^{p}$$ is the set of all pairs (G,k) such that the clique number of G equals to k. To define such pairs we must say that there exists a clique of size k and for every subset X of vertices of size k+1, X is not a clique in G. (Since we can switch the quantifiers in this example, the set is also in $$\varPi_{2}^{p}$$.)

If we view NP as an extension of P by adding one existential quantifier, the Polynomial Hierarchy is the hierarchy obtained by further extensions of these classes by adding more quantifiers. It has been conjectured that all the classes $$\mathbf {P}, \varSigma_{n}^{p}$$ and $$\varPi_{n}^{p}$$, for n=1,2,,…, are distinct, that is, the Polynomial Hierarchy is a proper hierarchy. One can show that if $$\varSigma^{p}_{n}=\varPi^{p}_{n}$$, then all the classes with higher indices collapse to $$\varSigma^{p}_{n}$$.

One reason for believing that the Polynomial Hierarchy does not collapse is that the analogous relations do in fact hold for the class of recursive sets Rec and the levels Σ n and Π n of the Arithmetical Hierarchy. This is a rather weak argument, because some analogies apparently fail. While Rec=Σ 1Π 1, we conjecture that P is a proper subclass of NPcoNP. If P=NPcoNP were true, one would be able to invert any polynomial time computable length preserving bijection.

Several conjectures can be reduced to the conjecture that the Polynomial Hierarchy does not collapse. In particular, if NPnonuniform-P, then $$\varSigma_{2}^{p}=\varPi_{2}^{p}$$. Hence, if Polynomial Hierarchy does not collapse, then $$\mathbf {NP}\not \subseteq \mathbf {nonuniform}\mbox {-}\mathbf {P}$$.

8. 8.
Nondeterministic space. Using nondeterministic Turing machines we can define nondeterministic space classes. The relations between deterministic and nondeterministic space classes are different from the corresponding relations between time classes. One can prove that if a set is computable by a nondeterministic Turing machine in space s(x), then it is computable by a deterministic Turing machine in space s(x)2. Thus if we denote the class of sets computable in nondeterministic polynomial space by NPSPACE, then we have
$$\mathbf {PSPACE}=\mathbf{NPSPACE}.$$
But this does not mean that we know all about the relation of nondeterministic and deterministic space classes. In fact the general feeling is that the above equation is not the one that corresponds to P vs. NP. The essential question, which is open like all essential questions in complexity theory, is whether more than linear increase of space is needed to eliminate nondeterminism.

9. 9.
Proving disjunctions of conjectures. It is an interesting phenomenon that in complexity theory we are able to prove several disjunctions of statements that we conjecture to be true. I have already shown examples based on sequences of inclusion where we know that the extreme terms are distinct. But that is not the only way to prove such disjunctions. Here is one based on a different argument.
$$\mathbf {P}\neq \mathbf {NP}\quad \mbox{or} \quad \mathbf {EXP}\not \subseteq \mathbf {nonuniform}\mbox {-}\mathbf {P}.$$
We believe that both are true, but we are not able to prove either of the two (though the second one seems much easier than the first one).

Here is the idea of the proof. We know that there are Boolean functions whose circuit complexity is exponential. Suppose we could, for every n, construct a truth table of such a function f n of n variables in time 2 cn for some constant c. (As 2 cn =(2 n ) c , it is in polynomial time in the size of the truth table.) Then we could define a language in EXP which is not in nonuniform-P as follows. For an input word w of length n, first compute f n and then accept w if and only if f n (w)=1.

So we only need to show that one can get such functions if P=NP. Let S(n)=2 n/2. This function grows faster than all polynomials and one can show that there are Boolean functions whose circuit complexity is larger. Given a truth table of a function f, to decide if f has circuit complexity less than S(n) is a problem in NP. Indeed, the input length is 2 n , a circuit of size ≤S(n) can be encoded by a string of length ≤2 n and to check that such a circuit computes f we only need to evaluate it on 2 n inputs and compare it with the truth table. Now consider the problem of finding such a function. This is an NP search problem. Thus if P=NP we can find f polynomial time, which is 2 cn . This proves the disjunction. (In fact, the proof shows more than promised: either P=NP or there is a language computable in time 2 cn whose circuit complexity is as large as 2 n/2.)

10. 10.

Some lower bounds methods for restricted classes of circuits. In the quest for proving PNP and showing other separations of complexity classes many interesting circuit lower bounds have been proved. But all the methods introduced to date have only a restricted range of applications. They are incapable of proving nonlinear lower bounds on general Boolean circuits. So is it worth taking time to survey them? In set theory all the major problems had been widely open until the forcing method was discovered. Then everything dramatically changed and since then proving independence became a matter of routine. This may well happen in complexity theory too and then the current weak results will be forgotten as they were in set theory. Yet there are some basic ideas that may play an important role in future lower bound techniques and, after all, the main purpose of this paragraph is to show you what a lower bound proof can look like.

(i) The method of random restrictions was introduced independently by M. Ajtai [1] and by M. Furst, J. Saxe and M. Sipser [83] in the early 1980s. It is based on the following idea. Suppose a Boolean function f is computed by a small circuit C. Pick randomly a small subset of the input variables and assign randomly zeros and ones to the rest. What we want to achieve is that after this substitution we can reduce C to a substantially simpler circuit C′. At the same time we want the restricted function f′ to be still hard. If C′ is very simple, we can see that it cannot compute f′, thus we get a contradiction with our assumption that f can be computed by a small circuit. One possible realization of this idea is to show that C′ computes a constant function (0 or 1) whereas f′ is not constant.

The reason why some circuits tend to shrink when we apply random restriction is that if we have a conjunction, then it suffices to have one of the inputs to be fixed to 0 and it becomes 0. The same is true about disjunction and the value 1. The hope is that fixing some inputs creates a chain reaction resulting in trivializing a lot of gates in the circuit.

This indeed works very well if the circuit uses conjunctions, disjunctions and negations and the number of alternations between different operations is small. But if the basis contains also the parity, this method completely fails. It is because fixing one input in xy, say x, does not fix the gate to a constant; the bit y remains intact or is flipped. This method also does not work if the depth of the circuits is not bounded.

(ii) The most useful is the approximation method introduced by Razborov in the mid-1980s. Let $$\mathcal{F}_{n}$$ denote the set of all n variable Boolean functions. Let $$\mathcal{S}$$ be a proper subset of $$\mathcal{F}_{n}$$; I will call $$\mathcal{S}$$ simple functions. The idea of the approximation method is to choose $$\mathcal{S}$$ so that
1. a.

it contains the initial Boolean functions;

2. b.

the Boolean operations used by circuits can be “well approximated” by some operations on the set $$\mathcal{S}$$;

3. c.

the function f for which we are proving a lower bound has only “poor approximations” in $$\mathcal{S}$$.

Then one can prove a lower bound as follows. Let C be a circuit for f. We can use the same circuit to compute in the domain $$\mathcal{S}$$ using the approximate operations. The initial functions are the same and at each step of computation the error of the approximation changes very little. Then there must be many steps in the computation because the output functions of the computation in $$\mathcal{S}$$ approximates f poorly.
To describe the method in more detail, we need some more notation. The initial functions are the functions computed at initial nodes of circuits. If an initial node is labeled by a variable x i , it is the function that outputs the ith bit of the input string. We can use other simple functions as the initial ones; one often uses ¬x i . For every operation o from the basis of operations K that the circuit uses, we have an operation $$\bar{o}$$ defined on $$\mathcal{S}$$. For the sake of simplicity we will assume that K contains only binary operations. A subtle point is how to measure how good an approximation is. For $$f\in\mathcal{F}_{n}$$ and $$g\in\mathcal{S}$$, we define the error
$$\delta(f,g)=\bigl\{ x; f(x)\neq g(x)\bigr\}.$$
So the error is the set of input assignments for which the two functions disagree. For every pair of functions $$g_{1},g_{2}\in\mathcal{S}$$ and every operation oK, we define
$$\delta_o(g_1,g_2)=\delta(g_1 o g_2,g_1\bar{o} g_2),$$
the error produced when we use $$\bar{o}$$ instead of the operation o. Let
$$\varDelta = \bigl\{\delta_o(g_1,g_2); o\in K,g_1,g_2\in\mathcal{S}\bigr\},$$
the set of all errors that can occur on the operations oK and elements of $$\mathcal{S}$$. Finally define
$$\rho(f)=\min\Biggl\{t; \delta(f,g)\subseteq \bigcup_{i=1}^t \delta_i \ \mbox{for some }\delta_1,\dots ,\delta_t\in \varDelta \Biggr\},$$
the distance of f from $$\mathcal{S}$$. The following simple proposition is a formalization of the argument sketched above:

Proposition 10 The circuit complexity of f is at least ρ(f).

The general framework is simple, but applications of this method require ingenious choices of the components. Also I did not present it in the most general form. Some applications require to generalize it further, but it is not difficult to imagine such extensions.

The method was first applied to circuits with the basis consisting of ∧,∨. Such circuits compute only some Boolean functions, namely, they compute all monotone Boolean functions. A function is monotone if x 1y 1,…,x n y n implies f(x 1,…,x n )≤f(y 1,…,y n ). Razborov first proved a superpolynomial lower bound on an NP-complete function. It was very exciting, but soon he published another paper in which he proved such a bound also for a function in P. This proved that monotone circuits (circuits with only ∧,∨) are weaker than general circuits. So it is necessary to use a different proof.

The status of this method is that although theoretically it is a universal method by which it is possible to prove exponential lower bounds on the size of general circuits, such proofs must be rather unnatural. The natural way to apply the method is the following. Let e be the minimum of the cardinalities of δ(f,g) over all $$g\in\mathcal{S}$$, let d be the maximum over all cardinalities of δ over all δΔ. Then e/d is clearly a lower bound on ρ(f), hence also a lower bound on the circuit complexity. But Razborov showed that, unlike in the monotone case, d is always large if we have a complete basis. Thus applying the method in this way we even cannot get a nonlinear lower bound.

(iii) If we only want to prove more than linear lower bounds, the variety of methods is even larger. One of the methods that was used to prove nonlinear lower bounds for a class of circuits uses the following idea. Since we know that there are hard functions, we can find one such function with a small number of variables by brute force. Then we will define our hard function by taking several independent copies of the small hard function. In typical implementation of this idea the first n/2 input bits are interpreted as a truth table of a Boolean function of log(n/2) variables. The remaining n/2 bits are split into blocks of size log(n/2) and we compute the function defined by the first n/2 on every block. Among the assignments to the first n/2 there are surely those that are the truth tables of the hardest log(n/2) variable functions.

The reason why this does not work for general circuits is rather counterintuitive (I mentioned this “paradox” on page 42). For this method to work, we need to show that when computing independent copies of the same function the complexity adds up, but one can show that this is not always true. There are Boolean functions f(x) such that computing simultaneously f(x) and f(y) for two independent inputs x and y needs only a little bit more than computing f(x) for a single input (see [295]).

(iv) The communication complexity method uses the following idea. Given a circuit C for a Boolean function f, cut the circuit into two parts each containing one half of the input bits and see how much information must be exchanged between these two parts. If we can prove that a lot of communication must be done in order to compute the function, we get a lower bound on the number of edges of the circuit that we cut into halves.

Formally the communication complexity of a function for a given division of the input bits is defined by means of a two player game. In this game the two players cooperate in order to compute the function. Each player knows only his half of the input bits. The communication complexity is the number of bits that they exchange in the worst case when they use the best possible strategy. Notice that we completely ignore the complexity of computing the exchanged bits; we are only interested in the amount of information exchanged.

The advantage of communication complexity is that it eliminates circuits and reduces lower bounds to a combinatorial property. Unfortunately the combinatorial problems that one needs to solve in order to prove interesting separations of complexity classes are still too difficult. Before we solve the problem of explicit construction of Ramsey graphs we have little chance to solve these problems, which have the same nature, but are more difficult.

Furthermore, also in communication complexity counterintuitive things happen, as exemplified on page 42.

11. 11.

Probabilistic proofs in finite combinatorics and explicit constructions. We will start with Erdős’s lower bound on the Ramsey numbers [67], which is a prototype of all probabilistic proofs in finite combinatorics.

Consider graphs on R vertices. The number of all graphs is $$2^{R\choose 2}$$ because there are $${R\choose 2}$$ pairs of vertices and for every pair, we have two possibilities. Let X be a subset of vertices of size n. Then the number of graphs in which X is monochromatic is $$2^{{R\choose 2}-{n\choose 2}+1}$$ because on X we have only two possibilities, either to take all edges, or none. Since the number of subsets of size n is $${R\choose n}$$, we can estimate from above the number of graphs that contain a monochromatic subset of size n by
$${R\choose n}2^{{R\choose 2}-{n\choose 2}+1}.$$
Hence if
$$2^{R\choose 2}>{R\choose n}2^{{R\choose 2}-{n\choose 2}+1},$$
then there exists a graph on R vertices with no monochromatic subset of size n. Using the Stirling formula, one can show that the last inequality is implied by
$$R>2^{(n-1)/2}n/e.$$
Hence if this inequality is satisfied, such a graph exists.

I will now describe two results in which explicit constructions improved the previous bounds obtained by nonconstructive means. The first one is closely related to the Ramsey number problem. The problem is: given numbers n, s and t, how many edges can a graph on n vertices have without containing two disjoint sets S and T of sizes s and t with all the edges between S and T. This is a typical problem from the extremal graph theory. In this branch of graph theory we often ask how many vertices or edges can a graph have without containing certain prohibited configurations. In the Ramsey number problem the prohibited configurations are complete and empty graphs of a given size.

It was proved that for constants st such graphs cannot have more than cn 2−1/s edges, for some constant c depending on s and t. Using a probabilistic argument, it was also shown that there exist such graphs with cn 2−2/s edges, for some constant c′>0. This is the best one can get by the probabilistic method, since for random graphs the true value is around n 2−2/s . Much later Kollár, Rónyai and Szabó found an explicit construction for t=s!+1 which have asymptotically n 2−1/s vertices [159]. So they were not only able to give an explicit construction for these values of s and t, but their construction also matches the upper bound up to a multiplicative constant. Their proof is highly nontrivial and uses results from algebraic geometry. (The extremal problem for t=s is still open.)

The second example is from a different field, the theory of error correcting codes. A code C is, by definition, a subset of strings of length n with elements in a finite set A; n is the length of the code and A is the alphabet. The elements of C are called codewords. If M is the number of codewords of C, then we can use C to code at most M different messages. The purpose of encoding is to introduce some redundancy which can be used to recover corrupted messages. It would take us too far afield to explain how this works, so I will introduce only the concepts necessary to describe the mathematical result.

The main two parameters of a code are the rate and the minimal distance. The rate is ρ(C)=log q M, where q denotes the size of the alphabet. The distance of two codewords is the number of positions in the strings in which they differ. The relative minimal distance δ(C) is the minimal distance d of pairs of distinct codewords divided by n. The basic observation is that if an arbitrary word w has distance less than (C)/2 from some codeword u, then u is the unique codeword with this property. This enables us to uniquely decode messages in which less than (C)/2 letters are wrong.

It is desirable to have both the rate and the relative minimal distance as large as possible because a large rate enables one to send more messages and a large relative minimal distance is good for correcting large errors. But there is a trade-off between the two parameters: if one is large the other cannot be. A central problems of this field is to determine this dependence precisely. More specifically, we are interested in the maximal rate that can be achieved for large code lengths as the function of the relative minimal distance.

To prove a lower bound on this function means to prove the existence of codes with such parameters. A classical result, the Gilbert-Varshamov lower bound, is proved nonconstructively and, again, it is a fairly easy counting argument. We can interpret it also as giving an estimate on random codes. Several upper bounds have been proved, but they do not match the lower bounds in any part of the range of parameters (except for the extreme points 0 and 1) for any alphabet size. In 1981 V.D. Goppa introduced algebraic geometry codes and a little later M.A. Tsfasman, S.G. Vlăduţ and T. Zink constructed certain algebraic geometric codes that beat the Gilbert-Varshamov bound for certain alphabet sizes [292]. They used an advanced part of algebraic geometry, the theory of modular curves. Different constructions were found later, but they also use a fair amount of theory.

So this is another example of proving the existence of structures with parameters better than the parameters of random ones. In both cases this was achieved by explicit constructions. In both case, however the problems are not completely solved. The most tantalizing question about codes is if it is possible to break the Gilbert-Varshamov bound for the alphabet size two (the smallest size for which this is known is 49).

12. 12.

Algebraic complexity classes. Algorithms such as Strassen’s matrix multiplication can be presented as algebraic circuits. They are a natural model of nonuniform algebraic complexity. It is interesting that there is also a natural model of uniform algebraic complexity. This approach was pioneered by L. Blum, M. Shub and S. Smale [25]. In fact, they define a computation model not only for algebraic structures (such as fields and rings), but for any (first order) structure. Let A=(A;F 1,…,F n ,R 1,…,R n ) be a structure, where A is the universe, F 1,…,F n are some functions and R 1,…,R m are some relations defined on the set A. We can associate a class of machines with A as follows. The machines are like Turing machines that work with elements of A instead of a finite set of symbols. A machine has a finite number of registers that can store elements of A and a tape for writing and reading elements of A. The program that controls the machine makes decisions using the relations of A and computes new elements using the functions of A. When the universe of the structure A is infinite, some nontrivial algorithms can be performed even without the tape.

For example, the machines that one obtains from the structure $$(\mathbb {R};+,\cdot,<)$$ can be used to formalize several basic algorithms working with real numbers. The difference between this model and Turing machines is that a Turing machine uses only a finite number of symbols, hence we can only approximate real numbers. In the Blum-Shub-Smale model, one can add, multiply and compare real numbers with infinite precision. The input and output data for such a machine are finite lists of real numbers. Thus the machine computes a function from a finite Cartesian product of real numbers into another finite Cartesian product of real numbers (provided that the machine always halts).

The Blum-Shub-Smale model is also a proper generalization of the original Turing machine. We get the original Turing machine if we take a sufficiently complex finite structure, e.g., the two-element field.

Furthermore, for every structure A, one can define the complexity classes P A and NP A that correspond to the classical P and NP. The question whether P A =NP A has been resolved only for some simple structures. For the most interesting structures, $$\mathbb {R}$$ and $$\mathbb {C}$$, it is still open. It is interesting that these versions of the P vs. NP problem seem to be very much related to some classical problems in number theory. Blum, Cucker, Shub and Smale stated the following conjecture [26].

Conjecture 1 Let f(x) be a nonzero polynomial in one variable x with integral coefficients and suppose that it can be computed by an algebraic circuit of size t. Then the number of integral zeros of f(x) is at most (t+1) c , where c is a universal constant.

In other words, the conjecture says that the number of integral zeros is polynomially bounded by the circuit complexity of the polynomial function. They proved that the conjecture implies $$\mathbf {P}_{\mathbb {C}}\neq \mathbf {NP}_{\mathbb {C}}$$.

Mentioning circuits in the conjecture gives the impression that it is more related to computational complexity than classical problems in number theory. But there are indications that the relation to number theory is much tighter than it may seem at first glance. One of such results is due to Qi Cheng [40]. Cheng considered a related conjecture and proved that it implies a classical deep result, the Strong Torsion Theorem for elliptic curves qi-cheng. Presently we only have circumstantial evidence that such versions of the P vs. NP problem must be difficult. We have implications, but we do not have a proof of an equivalence with a difficult number-theoretical problem. Thus it may still happen that the problem will turn out to be easy. Most researchers believe, however, that both the original P vs. NP problem and the algebraic versions are difficult.

13. 13.

Geometric complexity theory. In the late 1970s, L. Valiant proposed another algebraic version of the P vs. NP problem. He showed that his version of the problem can be reduced to a conjecture that can be stated in purely algebraic terms [297]. If the conjecture is true, then Valiant’s versions of P and NP are different.

Let Det n and Per n denote the determinant, respectively, the permanent, of n×n matrices, where we view Det n and Per n as homogeneous polynomials in n 2 variables.

Conjecture 2 (Valiant) There exists no polynomial p such that for all m,
$$\mathit{Per}_n(y_{ij})=\mathit{Det}_{p(m)}\bigl( \sigma(x_{ij})\bigr),$$
where σ is a projection.

A projection is a substitution of variables and constants. This problem is still open; the best lower bound on p(m) is only quadratic.

More recently, K. Mulmuley proposed an approach to Valiant’s conjecture which in principle could also lead to the proof of PNP in the original formulation [203]. He calls his approach geometric complexity theory. The basic idea is to replace projections by invertible linear mappings. These mappings form an important group, the general linear group GL n . Thus substitutions are replaced by actions of the elements of this group on the polynomial Det n and one can use group representation theory.

In order to transform the problem in this way, one has to make two modifications. First we need polynomials of the same dimension. To this end, it is sufficient to replace Per m by z nm Per m , where n=p(m). The second obstacle is that, although projections are linear, they are not invertible in general. This is solved, roughly speaking, by computing the permanent only approximately. In terms of representation theory it means that we ask whether z nm Per m is in the closure of the orbit (with respect to GL n ) of Det n . Thus the following conjecture implies Valiant’s conjecture.

Conjecture 3 (Mulmuley) There exists no polynomial p such that for all m and n=p(m), z nm Per m is in the closure of the GL n orbit of Det n .

This conjecture has attracted a number of mathematicians working in group representation theory and related fields. Mulmuley and other mathematicians working on this project have proved theorems that apparently go in the right direction. Unfortunately, none of the results proved so far can be interpreted as a theorem about algebraic complexity classes.

14. 14.

The “simplest” problem in algebraic complexity theory. One can easily prove the following simple fact.

Proposition 11 For every n≥1, there exists a real n×n matrix A such that for every factorization A=XY into a product of two n×n real matrices X and Y, the total number of nonzero elements in X and Y is at least n 2 .

The argument used in the proof (based on elementary facts from algebraic geometry) is that one cannot parameterize the variety of n×n matrices by less than n 2 numbers.

The open problem is to define such matrices explicitly. The problem is simple to state, but it seems very difficult to solve. To see how little we are able to prove, let me mention that the best lower bound that one can prove for explicitly defined matrices is only of the form cn(logn/loglogn)2, for some constant c>0, [87].

15. 15.

A message encoded in π. What if a picture of a circle is indeed encoded by the digits of π (as in Carl Sagan’s novel Contact)? What if some other message (of a super-civilization that can alter laws of mathematics) is encoded there? Can we find it?

The answer to this hypothetical question depends on where the message is supposed to be. If it is right at the beginning, encoded by segment of ‘small’ digits, we will certainly notice it. We can also compute all digits of any short segment in the medium range, since the digits of π can be computed very efficiently, but we must know where the segment with the message is; we cannot search the whole medium range, or we must be extremely lucky. Thus such an experimental approach will almost certainly fail. Since the digits of π are not random numbers, it is not excluded that we can determine such a segment indirectly, using mathematical reasoning. This may be very hard, especially if we did not know the content of the message. If the message is encoded by digits that are on large number positions, we can only do it using theory; no experimental search can help.

## 5.2 Randomness, Interaction and Cryptography

When writing about time and space of computations I presented it as limitations: the computation is constrained by the fact that it has to finish in certain time and must use only given space. But one can present it also positively as using computational resources. When talking about randomness, it is quite natural to think about it as a resource. A randomized computation uses random numbers, or random bits, which we can visualize as coin-flips. While in the case of space, we can reuse the same memory locations several times because we can erase the information that we do not need anymore, random bits cannot be used repeatedly. Once a random bit is used, it is no more random because the data that we process may depend on this bit.

Randomness is one computational resource that I am going to deal with in this section; another is interaction. It is much more difficult to classify and quantify interaction, as it has many forms. Interaction is present whenever a person performs computations on a powerful computer. The person gives the data to the computer and, after the computer processes it, receives the output. This is a completely trivial example, nevertheless, already in this simple case we can ask: How does the person know that they received the right answer? Can he verify the answer, and if so, why did he have to use the computer to get the answer? etc.

This example also shows the typical assumption that one of the interacting entities has small computational power and the other has a big one. We think of these two entities as players who cooperatively perform computation in order to determine whether a given input belongs to a given set, or to determine the value of a function on a given input. The one with little computation power is usually called Verifier; the one with strong computation power, sometimes even unlimited, is called Prover. The computation is done according to some rules called the protocol. The best way to understand it is to imagine it as a game in which the goal of Prover is to persuade Verifier about a correct answer. We assume that Prover may cheat and the protocol must be designed so that Verifier can detect the lie. (Therefore Prover must cooperate.)

The complexity class NP can be presented as an extension of P by a basic form of interaction. The idea is to interpret what I called ‘guessing’ by an action of Prover. Recall that a set A is in NP, if there is an associated binary relation R such that x is in A if and only if there is a y such that R(x,y). Thus we can compute A using the following simple protocol:
1. 1.

Verifier and Prover get the input string x;

2. 2.

Prover produces a string y;

3. 3.

Verifier checks if R(x,y);

4. 4.

if so Verifier declares x to belong to A.

To get NP, one has also to specify that the string y can be only polynomially longer than x and that Verifier is, in fact, a polynomial time algorithm. On the other hand, we do not limit Prover in any way.

This is, of course, a trivial reformulation of the definition of NP, but viewing it from this perspective, generalizations of this concept come immediately to mind. For instance, what happens if we allow more rounds of interaction? Well, if we just allow more rounds in the protocol above, nothing happens; we get NP again. The reason is simple: since Verifier is a deterministic algorithm, Prover can compute all communication before it starts. Thus instead of communicating in rounds it can send all its answers at once. If, however, Verifier can toss coins, Prover cannot predict the communication anymore and then things become much more interesting.

This is just one case that shows that randomness is important in interactive computations. Another area of communication protocols where randomness is essential is cryptography—secure transmission of data. One of the basic tasks in cryptography is to choose keys that cannot be predicted. Without randomness it is impossible.

Cryptography is the field of applied science that is most tightly connected with complexity theory. Users of an ordinary program are happy with the information that the program is based on the fastest known algorithm for the given task. They do not care whether or not it has been proved that no better algorithm exists. In contrast to this, the information that so far nobody has broken the protocol is not satisfactory for users of a cryptographic protocol. They would very much like to have a mathematical proof that the protocol is secure. This amounts to proving that the task of breaking the protocol is computationally infeasible, which means that it requires so much time that nobody can solve it. Unfortunately, this is what theory is still not able to provide them with. Still, theory is very useful for cryptography: we cannot prove security, but we can determine reasonable assumptions which imply it. The most common of these assumptions is the conjecture that integer factoring is a computationally hard problem.

Theoretical cryptography is an extremely interesting field, but this is not the main reason for including it in this book. It turns out that the concepts and results of cryptography are also very important for complexity theory itself. In particular they are relevant for the question whether randomness in computations is helpful.

### How Can Randomness Be Helpful?

We view computers as instruments that are able to achieve formidable results due to their high speed and extreme precision. Computers are indeed very precise; if programmed correctly, they almost never err. Randomness seems to be in conflict with precision. We associate randomness with disorganized behavior which is apparently not good for solving hard problems. So, how can one make any good use of it?

One explanation is that randomness enables us to easily construct complex structures. Consider for example a finite random graph. It is the result of a random process, where for each pair of vertices, we decide randomly (we flip a coin) whether or not they are connected by an edge. The resulting graph has no regular structure; it is complex. Such a structure may have properties that we are not able to ensure using a deterministic algorithm. The structure may be a solution to our problem, or it may be a tool that we can use to solve the problem.

If we want to obtain a structure with a particular property by a random process, such structures must be abundant, otherwise the probability of finding one would be small and we can find it only by many trials. Fortunately, quite often the structure that we need to find occurs with high probability.

An example of this are quadratic nonresidues. Let p>2 be a prime. An integer r, 0<r<p, is called a quadratic residue modulo p if it is the remainder of a square (a number of the form x 2) divided by p. Formally, it means that the equation
$$x^2\equiv r\ \mathbin {\mathrm {mod}}p$$
has a solution. An integer n, 0<n<p, that does not satisfy this condition is called a quadratic nonresidue modulo p. It is an easy exercise to prove that half of the numbers between 0 and p are quadratic residues and half are quadratic nonresidues. It is very easy to find residues: clearly, 12≡1 modp is a residue, and we can generate others by taking an arbitrary number x and computing the remainder of x 2 divided by p. To find a quadratic nonresidue is more difficult. There is an algorithm that given a number y, 0<y<p decides in polynomial time whether or not y is a quadratic residue. The algorithm is based on computing the Legendre-Jacobi symbol
$$\biggl(\frac{~y~}{p} \biggr),$$
which, for p prime, is equal to 1 if y is a quadratic residue and −1 if it is a quadratic nonresidue. Thus we have a simple probabilistic algorithm for finding a quadratic nonresidue.
1. 1.

choose randomly y, 1<y<p;

2. 2.

compute $$(\frac{~y~}{p})$$;

3. 3.

if $$(\frac{~y~}{p})=-1$$ output y, otherwise go to 1.

This algorithm finds a quadratic nonresidue in the first round with probability 1/2; in general, the probability that it will need more than m rounds is 1/2 m . So the algorithm finds a quadratic nonresidue after a few rounds with very high probability.

What about finding a quadratic nonresidue by a deterministic algorithm? (I am now using the word ‘deterministic’ to stress that the algorithm does not use randomness.) A straightforward adaptation of the above algorithm is to replace the random choice of y by the systematic search y=2,3,4,… . In this way we certainly find a quadratic nonresidue; the question is only when. If p is of medium size, we cannot search all numbers between 1 and p; there are too many of them. If we use the concept of polynomial time, instead of small, medium and large numbers, then an algorithm that searches all numbers up to p is not a polynomial time algorithm. As the length of the input is logp, a polynomial time algorithm must run in time polynomial in logp. So for this algorithm to be polynomial time, we would need to prove that there is always a quadratic nonresidue below f(logp), where f(x) is some polynomial. It is quite likely that this is true, but there is no proof of it. We only know that a certain conjecture, the Extended Riemann Hypothesis, implies the stronger statement that there exists a constant c such that for every prime p, there exists a quadratic nonresidue below clogp. Thus if the Extended Riemann Hypothesis is true, we can find a quadratic nonresidue in polynomial time, but the hypothesis may be false. (The Extended Riemann Hypothesis is, as the name suggests, a certain generalization of the Riemann Hypothesis.)

So in spite of the fact that half of the numbers are quadratic nonresidues, to find them by a deterministic polynomial time algorithm is an open problem.

In the next example of a problem solvable by a probabilistic algorithm in polynomial time we even do not have a conjecture how to solve it deterministically. Let s(x 1,…,x n ) and t(x 1,…,x n ) be two algebraic formulas and suppose that we want to decide whether
$$s(x_1,\dots ,x_n)=t(x_1,\dots ,x_n)$$
(5.3)
is a true identity in integers. The standard algorithm that mathematicians use is to rewrite the expressions into a sum of monomials. Then it is an identity if and only if we get the same expressions on both sides. However, this procedure may result in an exponential blow up, so it is not a polynomial time algorithm. We can also test the identity by choosing some numbers a 1,…,a n , evaluating both sides and checking if they give the same value. If we do not get the same value, the identity is refuted, but to prove that it is an identity we would need to use, in general, exponentially many samples. If, however, we are satisfied with showing that it is an identity with high probability, we can use an efficient randomized algorithm.

The idea of the randomized algorithm for the identity testing is based on the observation that if the equation (5.3) is not an identity, then it has a bounded number of solutions [259, 320]. The bound is exponential, but it is good enough for our purpose. It is possible to take M that is not much larger than the size of the formulas such that if we choose a 1,…,a n randomly from the interval [0,M], the probability that the two formulas evaluate to the same number will be less than 1/2. If we repeat the test several times and it will always pass, we will know with high probability that it is an identity.

### Holographic Proofs

Holographic proofs are a nice example of how randomness can help in other situations, not just for computing. It concerns the problem of verifying the correctness of documents. Signing a contract without reading it through is very dangerous—a single sentence that you miss may have disastrous consequences for you. The same concerns mathematical proofs. A referee of a mathematical paper is supposed to check the proofs. If only one step in the proof is wrong the proof is invalid. Verifying documents and proofs is time consuming and often boring, so we wish there were a way to do it more efficiently without compromising too much precision.

There exists, indeed, a method that at least in theory can achieve this goal. The essence is to use encoding of the documents in a certain way which spreads any possible error in the document over a large part of it. Then it is possible to test the correctness by looking at a constant number of randomly chosen bits of the document. The correctness is verified, of course, with some error, but the error decreases exponentially with the number of bits tested; hence one can get very high certainty with a relatively small number of bits.

Obviously, more details are needed to explain this vague idea. First, it should be stressed that the task is not only to check that the format of the document is correct, but we also need to check that it is a correct document for a given purpose. It is best to think of it as a proof P of a theorem ϕ. Then the theorem ϕ is given and we need to check that P is a proof of ϕ (by reading a constant number of bits of P), not only that P is some proof. Another important thing one should realize is that there may be many proofs of ϕ and the constant number of bits that we get does not tell us which of them we have checked. Thus it is more correct to say that we are testing the existence of a proof of ϕ.

Holography is a method of making images of 3-dimensional objects. It uses usual light-sensitive films, but it needs coherent monochromatic light, the best source of which are lasers. The image is encoded in the two dimensions available on the film and needs a similar arrangement of coherent light in order to create a 3-dimensional image of the object visible to the human eye. The relation with proofs has nothing to do with dimensions. In calling the proofs holographic people refer to the fact that in a holographic photograph the details of the objects are not encoded in localized regions. So one can take a part of the photo and it will still produce the same 3-dimensional image of the whole object, only the quality will decrease. It should be pointed out that the concept of a holographic proof is more complex; it is not just a robust way of encoding. If we only needed to be able to reconstruct the information from each sufficiently large part, we could use a much simpler means, the error correcting codes.

The existence of holographic proofs is one of the most remarkable results in complexity theory. It has important applications in proving that some NP-complete problems not only cannot be solved in polynomial time, but it is impossible even to solve them approximately (provided that PNP). The precise statement of the result is rather technical, so I defer it to the section Notes.

### Interactive Proofs

Interactive protocols are used to define new complexity classes and give an alternative definition of some standard classes. In the two examples below we will see how one can compute more by combining polynomial time computations with interaction than one can do without it.

The first example concerns the Graph Isomorphism Problem. It is the problem to decide whether two given graphs are isomorphic. As isomorphic graphs must have the same number of vertices, we can assume that the two given graphs have the same set of vertices. The problem then is whether there is a permutation of vertices that transforms one graph onto the other (see Fig. 5.4). This is, clearly, a problem in NP for we can guess the permutation and check if it works. Note that we can ask the same problem about isomorphism for other kinds of structures. We take graphs as the canonical example because they are simple and other structures can be coded by them.

Our aim is to study not the proofs that two graphs are isomorphic, but proofs that they are not isomorphic. Proving nonexistence is usually harder, as we know, and here we want, moreover, to find a method that works for all nonisomorphic pairs. It is possible that there exists a small number of invariants that completely determine the isomorphism type of a finite graph and that can be computed, or at least guessed and checked in polynomial time. Then such sets of invariants can be considered proofs of non-isomorphism. Such sets, however are known only for restricted classes of graphs, so we have to use something else, which is interaction.

The two players that appear in the interactive protocol introduced by L. Babai [11] are called Arthur and Merlin (the king and his wizard). Arthur is a man without any supernatural abilities, thus he represents an entity that can only compute in polynomial time, but he does have an advantage: he can toss coins. Merlin, in contrast, is omnipotent, which means that his computational power is unlimited. In particular, he can easily recognize whether or not two graphs are isomorphic. Unfortunately, he cannot always be trusted (after all, he is a son of a devil) and that is the problem. Thus Arthur can ask Merlin any question, but he can learn something only if he can somehow verify Merlin’s answer.

Once Merlin comes to the king with two graphs G and H and wants to persuade him that they are not isomorphic. As Arthur does not trust him, they need a schema how to do it. Here is how it works. After receiving the two graphs, Arthur sends Merlin away. He tosses a coin to choose one of them and then tosses the coin again several times to randomly permute the vertices of the chosen graph. Let the resulting graph be called F. Then Arthur calls Merlin back, presents F, and asks Merlin to tell which of the two graphs, G or H, he used to construct F. His reasoning is: if G is not isomorphic to H and as Merlin can recognize isomorphic graphs, he can easily answer this question correctly; on the other hand, if Merlin is trying to cheat him and the graphs are isomorphic, then F is a permutation of both graphs G and H, hence the probability that he answers correctly will be 1/2. The last probability can be reduced to (1/2) k if Arthur presents k such graphs. Thus Arthur can easily and with very high probability verify that the graphs are not isomorphic.

The Graph Isomorphism Problem is a rare example of a set in NP which is neither known to be in P nor to be NP-complete. Although a general theorem says that if PNP, then there are many such sets [176], it seems hard to find such a set that would be defined in a natural way. The protocol for non-isomorphism presented above is evidence that the Graph Isomorphism Problem is not NP-complete. Now we only need evidence that it is not in P.

### Proofs that Convey no Information

A proof conveys the fact that the theorem is true. Can this be the only knowledge that it reveals? Consider a very familiar situation: you went to a meeting to discuss a problem, you talked with people, you spent an hour or more, but after the meeting you had the feeling it was completely useless—you did not learn anything that you hadn’t known before. From your point of view it was “zero-knowledge” communication. But was it really useless? Maybe you do not realize it, but your opinions about the subjects discussed at the meeting have changed and maybe this was exactly the purpose of the meeting.

What we are actually interested in is a more concrete situation. Suppose you succeed in finding a solution to an important practical problem and you want to sell your solution. To this end you must persuade a potential buyer that you really have such a solution. But if you show the solution to somebody, you risk that they steal the idea without paying anything. Therefore, you need to be able to present some evidence that you have a solution, but the evidence should not reveal any details about the solution. Are such zero-knowledge proofs possible?

Let us start with a more basic question: what does ‘zero knowledge’ exactly mean? Surprisingly, this can be defined quite easily. Suppose A and B communicate by sending some strings of bits. When they finish, we say that it was zero-knowledge communication for A if all the strings that A obtained A could have computed without the help of B. This is fine, but why should A bother at all to talk with B in such a case? The point is that in spite of being able to compute all that A gets from B, A does learn something: A learns that B is able to compute these strings. So it is better to interpret such communication as A testing B. Think of B as a student taking an oral exam with professor A. Then A, of course, knows the answers to his questions; what A does not know is if B knows the answers.

The protocol for proving non-isomorphism of graphs described above is an example of a zero-knowledge protocol. In that protocol Arthur is only testing Merlin whether he indeed knows that the graphs are not isomorphic. This gives such a protocol for the complement of one set in NP. There is a general theorem saying that the complement of every set in NP has such a protocol. This theorem is proved by presenting a zero-knowledge protocol for the NP-complete problem of coloring graphs by 3 colors and then referring to the fact that other problems in NP are polynomially reducible to it. I will sketch the protocol.

Recall that a graph G is 3-colorable if one can assign one of three colors to every vertex so that no two vertices connected by an edge have the same color. The idea of the zero-knowledge protocol is that Merlin encodes the colors of a 3-coloring so that they can be determined only using secret keys; every vertex has a different key. After receiving this code, Arthur selects randomly two vertices that are connected by an edge and asks Merlin for the keys. Thus Arthur can verify that the two vertices have different colors. They repeat this protocol many times, but each time Merlin permutes randomly the colors and uses new keys. Thus the colors that Arthur learns in different rounds have nothing in common. Hence the only information he obtains is what he expects: that Merlin knows a proper 3-coloring.

In order to prove this theorem, one needs to have the means to produce secure encoding. The existence of such encodings has not been proved formally, but we believe that they do exist. It is the basic assumption used in cryptography.

### Cryptography

Cryptography is the science of secure information transmission. It studies means to efficiently transmit information to authorized subjects while preventing an unauthorized subject from learning even small parts of the transmitted messages. This field of science studies many practical problems as well as theoretical ones. The theory of cryptography is closely related with computational complexity and can be considered part of it.

Preventing an unauthorized person from reading a message can be done by arranging physically that the message can only reach the authorized people. The essence of cryptography is, however, to arrange it so that even if an unauthorized person gets hold of the message, they should not be able to get the information that the message contains. There are basically two ways how to do it. One is to prevent an eavesdropper from decoding the message by hiding the information needed to decode it. This means that without some specific knowledge, a key, it is impossible to decode the message, or it is possible to do it only with extremely small probability. Such protocols are studied using information theory. The second way is to make it impossible to decode the message using the available computers and limited time. Thus the barrier is the computational complexity of the task that an eavesdropper has to solve. In practice one has to combine both approaches.

Thus cryptography bears a special relation to computational complexity. In all other applications, large computational complexity of a problem is viewed negatively, whereas cryptography without computationally hard problems is impossible. Furthermore, for cryptography it is very desirable to actually prove the hardness of particular functions. Ideally, a cryptographic protocol should always be accompanied with a proof of security, a proof that the task of breaking the code is computationally infeasible. We know that proving computationally hardness of specific Boolean functions is still beyond the reach of our mathematical means; so we are very far from this ideal. The special nature of functions needed for secure encoding makes the task even harder. We are not able to reduce these problems to the standard conjectures of computational complexity such as PNP. The security of all known protocols is based on (apparently) much stronger assumptions.

As it happens so often in science, the concepts introduced in cryptography for practical reasons turned out also to be important for theory. They play an important role in the study of fundamental problems in complexity theory, for example, in the study of randomized computations. These concepts turned out to be very useful also in the more remote area of proof complexity, which is the subject of the next chapter. Let us have a look at the most important ones.

### One-Way Functions

We know very well that there are things that are very easy to do but very difficult, or impossible to undo. In physics there is a law that explains this phenomenon. It is the Second Law of Thermodynamics that asserts that the entropy in an isolated system can only increase with time. In plain words, if we need to restore order in some system, we must pay for it by taking away order from another system.

In mathematics a one-way function is, roughly speaking, a function f which can be efficiently computed, but its inverse cannot. The connection with thermodynamics is not superficial. Should the computed function be one-way, we must erase information during the computation of the function and that is only possible by dissipating heat. If we erase some information, then it looks very likely that it should be difficult to reverse the computation, but in fact it may be very difficult to prove it formally. Let us try a bit more precise definition using polynomial time computability.

A possible definition is that f is one-way-function if xf(x) is computable in polynomial time, but there is no polynomial time algorithm which from a given y computes some x such that f(x)=y, provided such an x exists. In order to avoid trivialities, we will assume that the lengths of x and y are polynomially related—one can bound the length of x (respectively of y) by a polynomial depending on the length of y (respectively of x).

The existence of such functions follows easily from the (unproved) assumption PNP. Indeed, if PNP, then there is a search problem R(x,y) such that finding y for a given x is not possible in polynomial time. (I am tacitly assuming the same conditions that were used to define the class NP: R is computable in polynomial time and the length of y is bounded by a polynomial in the length of x). Define f(x,y)=x if R(x,y) and f(x,y)=0 otherwise. Clearly, if we were able to recover x, y from the value of the function f(x,y) in polynomial time, then we also would be able to solve the search problem in polynomial time.

The definition above is not very useful. Therefore the term ‘one-way function’ has been reserved to a more useful, but also more difficult concept. To understand the motivation behind this concept, we have to look at the applications of one-way functions in cryptography. In the basic model one has an encoding function F which, given a encoding key e and a message x called a plaintext, produces the encoded message y=F(e,x) called the ciphertext. There is also a decoding function G which from the ciphertext y and the decoding key d associated to e produces the original plaintext x=G(d,y). The secret-key systems are systems where both keys have to be secret. In such systems often the decoding and the encoding keys are the same. In public-key systems, the encoding key is public because it is assumed that the decoding key cannot be computed from it.

If one could use keys of the same length as the messages and each time use a new pair of keys, everything would be simple. In such a case there are simple coding and decoding functions that are provably secure. Their security is based only on information-theoretical arguments, which is good because we do not have to use any unproven assumptions, but such systems are highly impractical. Therefore the main problem in cryptography is to design encoding and decoding functions with short keys that can be reused many times.

When researching the security of a code, one has to take into account various possible scenarios. For us, it suffices to consider the most basic type of attack. This is an attempt to break the code using pairs of several plaintexts and the corresponding ciphertexts, say, (x 1,y 1),(x 2,y 2),…,(x n ,y n ). One can show that if these pairs are randomly chosen, then already for a small n both keys of the system are uniquely determined.11 In practice no messages are random; they are messages in some natural language about things currently happening. Nevertheless, what is not random for us may still look random from the point of view of the coding system. In any case, we cannot exclude that the keys are determined from a small sequence of samples. Now, since the information about the keys is present in the sequence (x 1,y 1),(x 2,y 2),…,(x n ,y n ), the only way that the system can be resistant against this attack is if it is computationally infeasible to extract this information.

The task of computing the keys from samples of plaintext and ciphertext is a typical search problem. Thus one may naively propose to justify the security of a system by the commonly accepted conjecture PNP. Why is it naive? Firstly, we cannot use any asymptotic statements, such as PNP, to make claims about algorithms that only work with inputs up to some concrete finite size. Secondly, and this is the gist of this subsection, PNP is a conjecture about the worst-case complexity, which does not suffice for cryptographic security. PNP means that some X in NP is not in P. But if $$X\not\in \mathbf {P}$$, we only know that for every polynomial algorithm A, A is wrong on at least one input. Indeed, for many NP-complete sets, which are the hardest sets in NP, the decision problem is simple for random inputs. Thus such sets are easy on average, while they still may be hard in the worst case. What is needed for good cryptographic functions is that they are hard almost always; this means that for every algorithm A running in short time, the probability that it solves the search problem for a random input is negligible.

Another important thing to realize is that enemies may try to use arbitrary means to break the code. Here we are only interested in the computational aspects, so the question is what kind of computations they can use. Presently we only know about one additional means that may possibly extend the class of computations in limited time—randomness. It is possible that some problems may be solvable using randomized computations, whereas every deterministic algorithm for them would run for too long. Therefore we have to assume that the attacks on the code will be done using randomized algorithms.

After this digression to cryptography the definition of one-way functions makes much more sense. It is a theoretical concept, so one uses asymptotic bounds: polynomial upper bounds on the time of the machines and bounds on the probability of the form one over a polynomial. It is worth stating the definition quite formally, since the concept is rather subtle.

Let p 1,p 2,… be an infinite sequence of positive reals. We will say that they are negligible, if there exists a function γ(n) that grows faster than any polynomial such that 0≤p n ≤1/γ(n) for every n.

### Definition 11

Let f be a function mapping binary strings into binary strings such that the lengths of x and f(x) are polynomially related (as explained above). Then we say that f is a one-way function, if
1. 1.

f is computable in polynomial time (without using randomness);

2. 2.

for every randomized Turing machine M running in polynomial time, the probability, for a random x of length n, of the following event is negligible: given y=f(x) as the input, M finds a pre-image of y (an xsuch that f(x′)=y).

The complicated condition 2. formally expresses that any polynomial time randomized algorithm inverts function f with negligible probability.

It should be noted that although the motivation presented above was from cryptography, the concept is important also in other parts of complexity theory.

### A Complexity Class Defined Using Randomness

When studying the relation of randomized computations to non-randomized ones (deterministic computations) it is good to have a benchmark to which one could refer. As usual in complexity theory, it should be a complexity class. Probabilistic complexity classes are almost exclusively defined using the worst-case complexity, like the non-probabilistic ones, which means that we require that a probabilistic Turing machine must use limited time or space on each input. Further, for each input we specify with which probability it should be accepted.

A probabilistic Turing machine is simply a nondeterministic Turing machine (as defined on page 375); what is different is only the way we interpret it. Whereas when defining that a nondeterministic Turing machine accepts an input we only ask if there is an accepting computation, in the case of probabilistic machines we count the probability that they accept a given input.

The best is to give an example. It is the most important probabilistic class, the Bounded error Probabilistic Polynomial time, BPP.

### Definition 12

A set X is in BPP if there exists a probabilistic Turing machine M that stops on every input after polynomially many steps and such that for every input a,
1. 1.

if aX, then M accepts a with probability at least 2/3;

2. 2.

if $$a\not\in X$$ then M accepts a with probability at most 1/3.

If we have a set X and a machine M satisfying the definition, then M can make errors on both kinds of inputs, those that are in X and those that are not in X. The machine accepts elements of X with higher probability than non-elements. It is important that there is a gap between the two probabilities. This gap enables us to show that BPP is very close to P.

The argument is very simple. Given an input a, run M on a several times, say m-times. Then accept a using “the majority vote”, meaning that we accept a if at least in m/2 cases M accepted a. Then the claim is that we will accept every aX with probability at least 1−ε and reject every $$a\not\in X$$ with probability at least 1−ε where ε>0 is exponentially small (with the exponential function only depending on m). Indeed, suppose first aX. Let p a be the probability that M accepts a. By definition, p a ≥2/3. The expected number of times that M accepts a is p a m. By the law of large numbers, the distribution of this random variable is sharply concentrated around the expected value (most likely it will be $$p_{a} m\pm\sqrt{m}$$ and the probability drops exponentially when we go outside this region). Thus, in particular, the probability that M will accept less than m/2-times is exponentially small. Whence the majority vote will be correct with probability exponentially close to 1. The proof for $$a\not\in X$$ is completely symmetric.

The bottom line is that for all practical purposes BPP is as good as P because by using only a few repetitions we can make ε so small that an error will never occur in practice. Hence the class of feasibly computable sets is not P , but BPP! That said, most likely the following is true:

### Conjecture

BPP=P.

However, if the conjecture is true, it does not mean that randomness is completely useless. The conjecture only says that we can eliminate randomness by at most polynomially increasing the time. It is likely that some problems need, say, quadratic time if computed deterministically, but only linear time when computed probabilistically. In large scale computations done in practice this can make a big difference.

To understand the reasons why most researchers believe in this conjecture, we must learn about the ways how to eliminate randomness from computations.

### Pseudorandomness—Imitation of Randomness

Consider a problem for which we only have a probabilistic polynomial time algorithm. For such a problem, there is a good chance that one can find a deterministic polynomial time algorithm because it has happened so in many cases. But the deterministic algorithms are as a rule based on different ideas than original probabilistic algorithms. For example, the recently discovered polynomial time algorithm for testing primality uses an idea different from all previously found probabilistic algorithms. So our experience does not suggest any general method of eliminating randomness. There is, however, a simple idea that potentially may work in general. It is based on the assumption that it is possible to replace random bits by bits generated deterministically in a suitable way, by pseudorandom bits.12

In fact, if you program your computer to run a probabilistic algorithm, it will do almost exactly that: it will use the current time as a random seed, but then it will compute the required random numbers deterministically. For most applications, it works well, but it cannot be used if security is at stake. The random number generators used in most implementations are very simple; they pass simple statistical tests, but a bit more sophisticated test can easily discover that the sequence of numbers is not random.

The basic question is whether pseudorandomness is a mathematical concept, i.e., whether there is a precise mathematical definition of this concept, or it is only a vague intuitive concept. If we consider sequences of n zeros and ones, then probability theory does not distinguish between them—the sequence of n zeros is as random as any other. When we talk about one random sequence in probability theory, we in fact refer to all sequences. Thus the first important thing to learn is that using complexity theory one can define pseudorandomness.

Before stating the definition, I will present some arguments that more or less uniquely lead to the definition. Let us focus on sequences of n bits. First we observe that we need at least a little bit of true randomness. If we produced a sequence of bits without any random seed, it would always be the same sequence, thus one would easily recognize that it is not a random sequence. Therefore we do not talk about pseudorandom sequences of bits, but rather pseudorandom generators. A pseudorandom generator produces a sequence of n bits from a sequence of m bits for some n>m. We say that a generator stretches m random bits to n pseudorandom bits. I will explain later how much a generator should stretch a given sequence of random bits.

The main idea of the definition of pseudorandom generators is to replace simple statistical tests by all tests that can be computed in polynomial time. Thus a test is any polynomial time algorithm that outputs zero or one. When a test is simple, we know what we should obtain when testing a pseudorandom generator. For example, if we just test how often a given subsequence of k bits occurs in the sequence produced by the generator, we know that the frequency should be 1/2 k . But what should we do with a test that is based on an algorithm that we do not know? The solution is simple: we do not need to be able to compute the probability that the test accepts a random sequence of length n, we just test pseudorandom sequences and random sequences, and then we compare the frequencies that we have obtained. If we have a good pseudorandom generator, the frequencies should be very close. It is quite natural to use polynomial time algorithms as tests, but for certain technical reasons it is better to define pseudorandom generators using polynomial size Boolean circuits.

Having defined tests we can now define that P is a pseudorandom generator, if it is computable in polynomial time and it passes all tests. To pass a test defined by a circuit C means that there is a negligible difference between the probabilities with which C accepts the outputs of P and with which it accepts random sequences. Here is a slightly more formal definition.

### Definition 13

A pseudorandom generator is a function P computable by a deterministic polynomial time Turing machine such that for every m,
1. 1.

P maps the set of zero-one strings of length m into the set of zero-one strings of length n, for some m<n;

2. 2.

for every polynomial size circuit C, the difference between the probability that C(P(x))=1 for a random string x of length m and the probability that C(y)=1 for random string y of length n is negligible.

(The meaning of ‘negligible’ is the same as in Definition 11.)

Let us play with “our new toy”. Let XBPP and let M be a probabilistic Turing machine that accepts X as required by the definition of BPP. In general, M gets random bits (‘tosses a coin’) during the computations, but it is not difficult to see that equivalently, M can get all random bits at the beginning, store them on the tape and then use them, one by one, as needed during the computation. Thus M is in fact a Turing machine that gets two inputs, a and the random bits. Suppose aX and M needs n random bits for the computation on a. By definition, the probability that M accepts a is p≥2/3.

Now suppose that we have a pseudorandom generator P that stretches m bits to n bits. Then we can try to run M using the pseudorandom bits produced by P. If we think of a being fixed, then we can think of M as a test, a test that checks pseudorandomness of the bits. If p′ is the probability that M accepts a with pseudorandom bits from P, then p′ must be very close to p, when n goes to infinity. Thus p′≥2/3−ε, where ε→0 when n→∞. The argument is completely symmetric for an input $$b\not\in X$$. So, if we denote by q′ the probability that M accepts b using pseudorandom bits, then q′≤1/3+ε, where ε→0 when n→∞. For n sufficiently large, we get q′≤1/3+ε<2/3−εp′; so we have a constant size gap between q′ and p′. Hence pseudorandom bits are as good as random ones!

But is it of any practical interest? According to the definition, we only know m<n, so it can be just m=n+1. In such a case we have saved only one random bit—not a big deal! If we want to get rid of random bits, the crucial question is how much can pseudorandom generators stretch the input bits. Fortunately, there is a very simple way to get larger stretching: to compose the generators. If we compose k pseudorandom generators, each stretching only by one bit, we get a generator that stretches the input by k bits. This works as long as k is bounded by a polynomial in m. In this way we can save a substantial part of random bits.

If we want to show that randomness can be eliminated, we have to get rid of all random bits. This means that eventually we have to get rid also of the random seed of the generators. Here the idea is also simple: run the algorithm for all random seeds. To see how it works, consider again the same situation as above, where aX and we used a pseudorandom generator P stretching m bits to the number of random bits that the machine M needed. We concluded that M accepts a with probability p′≥2/3−ε. This means that of all the 2 m strings produced by P, at least (2/3−ε)2 m will let M accept. Similarly, if $$b\not\in X$$, the number of strings that will let M accept b will be at most (1/3+ε)2 m . Hence the simple majority vote rule will decide whether we should accept the input.

In order to obtain a polynomial time algorithm, we would need to have 2 m to be bounded by a polynomial in the input length. Let n denote the input length, then it means that m should be bounded by clogn, for a constant c. A probabilistic polynomial time algorithm on such inputs may need polynomially many random bits. Thus we need a generator that stretches the input exponentially, from clogn to p(n), for some polynomial p. Such generators do not satisfy our definition of pseudorandom generators. The problem is that we measure the time as the function of the input length. Then no function that stretches the length exponentially can be polynomial time computable. Nevertheless, this approach still makes sense; only the definition is too restrictive. For the purpose of derandomization, it suffices to bound the time by a polynomial in the output length. After we change the definition in this way, we have to make another change—we must bound the size of circuits used as a test by a specific polynomial.

The technical details of the more general definition of pseudorandomness are not important. It suffices that we know that it can be done and that such generators can derandomize BPP. Thus, in particular, the problem of proving P=BPP can be reduced to a construction of pseudorandom generators of a certain type.

### Derandomization and Proving Lower Bounds

We do not know if pseudorandom generators exist. A necessary condition for their existence is that PNP, but we also do not know if the existence of pseudorandom generators can be proved from this conjecture. It seems that the conjecture that they exist is stronger than the conjecture that PNP. But we do know some interesting facts about them.

The first remarkable fact is that the existence of pseudorandom generators is equivalent to the existence of one-way functions (in the sense of Definitions 11 and 13). One of the implications is very simple: every pseudorandom generator that stretches m bits to 2m bits is a one-way function. Here is a sketch of a proof. Suppose the contrary, that a pseudorandom generator P is not a one-way function. It means that we can invert it with non-negligible probability. Stated formally, one can compute in polynomial time a function g such that
$$P\bigl(g\bigl(P(x)\bigr)\bigr)=P(x)$$
is true with non-negligible probability. But if we take y a random string of 2m bits, then the probability that it is in the range of P is at most 2m , whence the probability that
$$P\bigl(g(y)\bigr)=y$$
(5.4)
is also at most 2m . Thus (5.4) defines a test that can distinguish between random strings y and P(x); so P is not a pseudorandom generator.
To construct a pseudorandom generator P from a one-way function f is much more difficult, so I will consider only a special case, where the one-way function satisfies a couple of additional conditions. The first condition is that f is a one-to-one mapping from the set of n bit strings into itself. In other words, f is a permutation of the set {0,1} n . The second condition is that for x∈{0,1} n , the first bit of x, denoted by x 1, is not predictable from f(x). (This can be defined precisely, but I will omit this definition in the informal presentation here.) If these two conditions are satisfied, we can define a pseudorandom generator by
$$x_1 x_2\dots x_n \mapsto x_1 y_1 y_2\dots y_n,$$
where y 1 y 2y n =f(x 1 x 2x n ). So it is a pseudorandom generator that stretches n bits to n+1 bits. The assumption about unpredictability of the first bit is quite natural. A cryptographically good one-way function should hide as much as possible about the input, in particular, all input bits should be unpredictable from the output.

The fact that the existence of pseudorandom generators is equivalent to the existence of one-way functions shows that these concepts are good and that it is natural to conjecture that they exist. Yet, we would prefer to find a relation to a conjecture about basic complexity classes. A lot of progress in this direction has been achieved in the last two decades. I will describe what I think is the most interesting one of these results. This result gives a statement that seems very plausible and which implies that BPP=P. Pseudorandom generators are not explicitly mentioned, but they play the crucial role in the proof.

Recall that we view circuit complexity as the nonuniform version of Turing machine time complexity. We know that sets in P can be computed by polynomial size circuits. We have also observed that in general, if A can be computed in time t(n), then it can be computed using circuits of size ct(n)2, where c is a constant. It seems that this relation cannot be substantially improved. The result says that if this is indeed the case, then BPP=P. The precise statement is in the following theorem of R. Impagliazzo and A. Wigderson [136].

### Theorem 40

If there exists a set A of 01 strings such that
1. 1.

A is computable in time 2 cn , where c is a constant, and,

2. 2.

for every n, the circuit complexity of the set A∩{0,1} n is at least 2 δn , where δ>0 is a constant,

then BPP=P.

We can view this theorem also from the perspective of proving lower bounds. Most research on lower bounds focuses on circuit complexity of Boolean functions. Since we know that there are Boolean functions that have exponential circuit complexity, the problem is to prove it for explicitly defined Boolean functions. Explicitly defined is an intuitive concept that we can interpret in various ways, but it always means some restriction on the complexity. If, for instance, we interpret it as NP, then proving superpolynomial lower bounds would imply PNP. If we interpret ‘explicitly defined’ as ‘computable in time 2 cn ’, then proving exponential lower bounds on an explicitly defined function would give BPP=P.

Thus this result shows what most researchers agree on: proving lower bounds on the circuit complexity of Boolean functions is the central problem in complexity theory. Some recent results confirm it even more. Although the converse to the implication in the theorem above is not known, there are some partial results that show that BPP=P implies certain lower bounds on circuit complexity [145]. Thus it is impossible to derandomize BPP without proving some nontrivial lower bounds.

### Two Important Functions

The methods developed for derandomization enable us to construct one-way functions and pseudorandom generators from any sufficiently hard functions. Although we cannot prove their hardness formally, to find suitable candidates is not a problem. For example, a natural candidate for a set A that satisfies the conditions of Theorem 40 is the well-known NP-complete set SAT, the set of satisfiable Boolean formulas. That said, this is completely useless for practical applications, especially for cryptography. When a construction is based on a series of complicated reductions, its desirable properties will manifest only for very large input lengths, which occur rarely in practice. What is needed are functions that exhibit these properties for inputs of fairly small lengths. There are two functions that are prominent among all proposed candidates of one-way functions. They are multiplication of integers and exponentiation modulo a prime.

The inverse problem to multiplication is factoring, which I have already mentioned several times. For multiplication to be a one-way function, we need not only to know that the factoring problem is hard, but also that it is hard for almost all inputs. This is clearly not true: every other number is even, hence the product of two random numbers of a given length n is even with probability 3/4; to find a factor of an even number is trivial. Therefore, it has been proposed to restrict the domain to pairs of prime numbers. Let an input length n be given, then one considers the function:

input: p, q primes of length n

output: pq

In all known algorithms the most difficult numbers to factor are those that are products of a small number of large primes whose differences are also large. In particular, the most difficult seems to be to factor the product of two large primes p and q, with |pq| large. It is possible that multiplication restricted to pairs of primes is indeed a one-way function.

The second function also comes from number theory. If p be a prime number, then there exists a number g such that every number 1≤yp−1 can be uniquely expressed as the remainder of g x when divided by p for some 0≤xp−2; we write y=g x modp. Such a g is called a generator of the multiplicative group modulo p. Since this is a one-to-one mapping from {0,1,…,p−2} onto {1,2,…,p−1}, the inverse function is defined, and since it is the inverse to exponentiation, it is called the discrete logarithm; we write x=log g y.

It is not quite obvious that exponentiation modulo a prime is computable in polynomial time. One cannot do it by first computing g x and then computing the remainder. It is necessary to use modular arithmetic, namely, to take the remainder after each step of computation because otherwise one would have to use exponentially long numbers. Also we cannot perform x multiplications; one has to use the trick of ‘repeated squaring’ (see page 431). Finding a generator g is another interesting problem that one has to solve in practical applications; unfortunately, it would take us too far afield to discuss it.

What is more important is the complexity of the inverse problem, the discrete logarithm. Again, some algorithms are known, but all need exponential time. Since in number theory it is a well-known problem, studied for a long time, and we still do not have efficient algorithms, it is reasonable to conjecture that there is no polynomial time algorithm.

The most interesting property of the discrete logarithm function is that when it is hard in the worst case, then it is hard also on average. This is an extremely desirable property and in all other proposed candidates it is difficult to find a good justification for the hardness on average. In the case of the discrete logarithm we can actually prove it, provided we have hardness in the worst case, moreover the proof is very easy. Here is a sketch of the proof.

Suppose M is an algorithm that solves the discrete logarithm for random inputs. Thus for a randomly chosen y, 1≤yp−1, it computes log g y with probability at least n c , where n is the input length (the number of bits of p) and c is a constant. We define M′ which solves with the same probability the problem for every given y′, 1≤yp−1. M′ first takes a random r, 0≤rp−1 and computes y=yg r . Then applies M to y; let x the number computed by M. Then M′ outputs xr.

To show that M′ computes log g y′ with probability at least n c , suppose that in the subroutine M succeeds to find x=log g y. Then x=log g yg r =log g y′+log g g r =log g y′+r. Thus, indeed, M′ finds log g y′. To get a probability arbitrarily close to 1, we only need to repeat the subroutine M polynomially many times.

Our only reason to believe that factoring and the discrete logarithm are hard functions is the lack of efficient algorithms in spite of the deep insight of number theorists into these problems. Unlike in the case of NP-complete problems, where we know that they are the most difficult ones in the class NP, we do not have any result of this kind for these two functions. Some classes, in particular complexity classes of randomized computations, do not seem to have complete problems; the same is probably true for one-way functions and pseudorandom generators.

### Notes

1. 1.
Probabilistic primality tests. As the polynomial bound in the Agrawal-Kayal-Saxena test is still quite large, in practice it is better to use faster probabilistic tests discovered before. One of these tests is due to Solovay and Strassen [282]. Given an N which is to be tested, we randomly choose an a, 1≤a<N and test
1. a.

(a,N)=1 (are a and N coprime)?

2. b.

$$a^{(N-1)/2}\equiv (\frac{a}{N} )\ \mathbin {\mathrm {mod}}N$$?

If N passes both tests then it is a prime with probability at least 1/2. The precise meaning of this statement is: if N is not a prime, then at most 1/2 of all numbers 1≤a<N satisfy both equations, and if N is a prime, then all a satisfy the two conditions. In this test the only random part is the choice of a, the expressions in the equations can be computed in polynomial time. (For computing the Legendre-Jacobi symbol it suffices to recursively use the well-known relations, one of which is the Quadratic Reciprocity Law.)
If we repeat the test twice we get:
1. a.

if N is a prime then it is accepted with probability 1;

2. b.

if N is not a prime, then it is accepted with probability at most 1/4.

This formally proves that the set of primes is in the class BPP (we know that it is, in fact, in P which is contained in BPP. Notice that the first condition is much stronger than it is required in the definition of BPP because in the first case there is zero error. Many concrete problems that are in BPP satisfy this stronger condition.

Similarly, as in the case of finding quadratic nonresidua, one can turn some probabilistic primality tests into deterministic polynomial time algorithms if certain number-theoretical conjectures are true. Such a test based on the Extended Riemann’s Hypothesis was proposed by G.L. Miller [198]. The proof of the Extended Riemann’s Hypothesis, if ever found, will surely be much more difficult than the proof of the correctness of the Agrawal-Kayal-Saxena algorithm.

2. 2.

Probabilistic Turing machines and hardness on average. An alternative way of defining probabilistic Turing machines is to equip deterministic machines with an additional tape for random bits. Then we count the probability of acceptance of a given input when the tape contains random bits.

Sometimes people confuse this model with a different problem. Suppose a set A and a deterministic machine M is given. Determine, for a random input, the probability that the machine correctly decides if the input is in A. The same problem is more often studied for Boolean circuits and Boolean functions. We say that a function f:{0,1} n →{0,1} is hard on average, if for every circuit C of subexponential complexity, the probability that C accepts or rejects a random input correctly is less than 1/2+ε, where ε→0 as n→∞.

3. 3.

Randomized circuits. It is also possible to study randomized circuits. Such circuit C has input variables of two types. One serves as the ordinary input variables x 1,x 2,…,x n , the others are for random bits r 1,r 2,…,r m . For a given assignment to values to x 1,x 2,…,x n , we count what is the probability that C accepts (= computes 1) for a random string of r 1,r 2,…,r m . If C computes a Boolean function f with bounded error (as in the definition of BPP: if f(x)=1 then C(x,r)=1 with probability ≥2/3 and if f(x)=0 then C(x,r)=1 with probability ≤1/3), then we can amplify the probability exponentially by taking several circuits and computing the majority vote. In this way one can construct a circuit whose probability of error is less than 2n . When the error is so small, there exists one string r 1,r 2,…,r m with which C computes correctly for all 2 n inputs. Thus we can eliminate randomness by only polynomially increasing the size of the circuit.

Therefore, randomness is not interesting in the context of non-uniform complexity.

4. 4.
More on holographic proofs. Here is the definition of a holographic or polynomially checkable proofs. I continue with the example of proofs and formulas. A system of holographic proofs consists of a suitable definition of proofs and an algorithm A. The algorithm is probabilistic and has two parameters: a constant 0<α<1 and an integer c≥1. Given a formula ϕ and a text P, A computes c bit positions in P and reads these bits, say b 1,b 2,…,b c . Using only these bits and formula ϕ, it either accepts or rejects. Furthermore,
1. (a)

if P is a correct proof, then A always accepts P;

2. (b)

if there exist no proof of ϕ, then A rejects P with probability at least α.

Such proofs can be constructed not only for the particular case of formulas and their proofs, but for any problem in NP. Recall that a set A is in NP if it is defined using a search problem, namely, xA if the associated search problem has a solution. Testing the existence of a certain object is the heart of the matter. So one first proves that an NP-complete problem has polynomially checkable proofs and then uses reductions to prove that all NP problems have such proofs. The fact that every set in NP has polynomially checkable proofs is the famous PCP Theorem [7, 9].

5. 5.

Public key cryptography. The most interesting concept in cryptography is a public key system. These systems are not only very interesting from the point of view of theory, but they are also the most useful applications of cryptography.

Let us recall the standard setting. We have an encoding function F(e,x) and a decoding function G(d,y), where x is the plaintext, y is the ciphertext, e an encoding key and d is the decoding key corresponding to e. It is assumed that the functions are publicly known. In the case of a public key system, Alice knows both e and d and publicly announces the encoding key e. Then everybody can send messages to Alice, but only she can decode them. Thus for a fixed e, the function xF(e,x) is not one-way because it has an easy inverse G(d,y), but it is hard to find the inverse, unless one knows d. Such functions are called trapdoor functions.

The most common public key system is the RSA, invented by R.I. Rivest, A. Shamir and L. Adleman [246]. In RSA Alice needs a sufficiently large composite number N and its factorization. The best seems to take two large primes p and q whose difference is also large and let N=pq. The number N is public, but p and q must be kept secret. An encoding key is a randomly chosen e, 1<e<ϕ(N), where ϕ(N)=(p−1)(q−1) (the Euler function of N). Knowing ϕ(N), Alice can compute the decoding key 1<d<ϕ(N) such that ed≡1 modϕ(N), (this is done using the well-known Euclid algorithm).13 The encoding function is
$$F(e,x)=x^e\ \mathbin {\mathrm {mod}}N;$$
the decoding function is, in fact, the same
$$G(d,y)=y^d\ \mathbin {\mathrm {mod}}N.$$
In this way we can encrypt any number from 1 to N−1 except for p and q. If N is sufficiently large and messages are more or less random, the probability that a message will be p or q is negligible. Using the well-known Euler’s Theorem which says that z ϕ(N)≡1 modn, we see immediately that d is the decoding key for e:
$$\bigl(x^e\bigr)^d=x^{ed}=x^{1+c\phi(N)}=x \cdot \bigl(x^{\phi(N)}\bigr)^c\equiv x\cdot 1^c=x\ \mathbin {\mathrm {mod}}N.$$

If one can factor N, then one would be able to compute the decoding key and thus break the system. It seems that this is the only way how one can successfully attack it, but there is no proof of it. There is, however, a public key system, for which one can prove that it is secure if and only if it is hard to factor a randomly chosen product of two primes. Such a system was designed by M.O. Rabin [234]. The encoding function in his system is simply xx 2 modN, where N is as in RSA.

However, RSA has some additional good properties which make it more attractive than other systems. In particular, the keys commute: if we need to apply two keys then it does not matter in which order we apply them. This has a number of useful applications.

6. 6.
Modular exponentiation is needed in many cryptographic protocols. Suppose we want to compute A B modulo N. The input size is the sum of the lengths of the numbers A, B, N. Since B can be exponential in the input size, we cannot simply multiply AB-times. Therefore, we use the trick of repeated squaring. It is best seen in an example. Let B=1011001 in binary. First express B as follows
$$1011001 = 1+2\bigl(0+2\bigl(0+2\bigl(1+2\bigl(1+2(0+2\cdot 1)\bigr)\bigr)\bigr) \bigr).$$
(Notice that 0’s and 1’s occur in the reverse order on the right-hand side.) Of course, we can delete all 0’s and the last 1,
$$1011001 = 1+2\cdot 2\cdot 2\bigl(1+2(1+2\cdot 2)\bigr).$$
Now we can compute A B efficiently:
$$A^{1011001} = \bigl(\bigl(\bigl(\bigl(\bigl(A^2 \bigr)^2\cdot A\bigr)^2\cdot A\bigr)^2 \bigr)^2\bigr)^2\cdot A.$$
Furthermore, we must not do the operations in $$\mathbb {Z}$$ because the sizes of numbers would be exponentially large. Instead, we compute in $$\mathbb {Z}_{N}$$; technically, it means that after each operation we replace the result by its remainder modulo N.

7. 7.

Expander graphs. Explicitly constructed expander graphs are probably the most useful structures in complexity theory. Expander graphs are graphs satisfying a certain property which is typical for random graphs. Since we have an explicit construction, which means that we can construct them efficiently without using randomness, we can use them to simulate randomness in some specific situations.

The graphs that are of interest for us are graphs with bounded degree. This means that the degree of every vertex (the number of edges incident with the vertex) is bounded by a constant d. Often we can assume the stronger condition that the degree of every vertex is equal to the constant d. Such graphs are called d-regular.

The property by which expander graphs are defined is, roughly speaking, that every subset of vertices X is connected with many vertices outside of X. Obviously, if X is the set of all vertices or it contains almost all vertices, then there are not many vertices left. Therefore one has to state the condition so that this case is avoided. Here is the definition.

Definition 14 A finite d-regular graph G on a set of vertices V is called an ε-expander, for ε>0, if for every subset of vertices X that contains at most one half of all vertices, the number of vertices outside of X connected with vertices in X is at least ε times the size of X.

The name expander is used here because the set of all vertices that are in X or connected to a vertex in X has size at least (1+ε) times the size of X. Thus every X which is not too large “expands” at least by factor 1+ε. In applications we need infinite families of graphs that are d-regular ε-expanders for some constants d and constant ε>0. To prove that such graphs exist by non-constructive means is easy—a random d-regular graph is an ε-expander for some ε>0 with high probability. To construct a family of explicit expanders is, however, a non-trivial task. The graphs may be quite simple to describe, but to prove that they are expanders is difficult. Although the condition by which expanders are defined is purely combinatorial, the best way to prove nontrivial bounds on the expansion rate is to apply algebra (to show a gap between the largest and the second largest absolute values of eigenvalues of the adjacency matrix of the graph).

Example The first explicit construction of a family of expanders was given by Margulis [191]. For every n>0, he defined a graph on the set of vertices $$\mathbb {Z}_{n}\times Z_{n}$$ in which every vertex (x,y) is connected by an edge with (x+y,y),(xy,y),(x,x+y),(x,xy),(x+y+1,y),(xy+1,y),(x,x+y+1),(x,xy+1) where addition is modulo 2. Some edges my actually be loops. This is in order to formally satisfy the condition that the graph is 8-regular.

Let us now consider a simple application of expanders. Let G be a d-regular expander on a set of vertices V of size N. A random walk on G means that starting in a random vertex v 0, we randomly choose v 1 to be one of the d neighbors of v 0 and go to v 1; then we choose v 2 to be a random neighbor of v 1 and so on. One can show that the sequence of vertices chosen in this way behaves like a genuinely random sequence of vertices. Namely, suppose that we need to find a vertex in some subset U of vertices whose size is 1/2 of the size of V. For example, let V be numbers 1,2,…,p−1 for some prime p, and our task is to find a quadratic nonresidue. Then the probability that we hit a nonresidue will increase exponentially with the length of the walk.

What is the advantage of using a random walk instead of random vertices? If we pick a random vertex, we have N possibilities, so we need log2 N random bits. In a random walk on a d-regular graph we need only log2 d, i.e., a constant number of random bits, to get the next vertex. Thus random walks on expanders enable us to save many random bits when we need to amplify the probability in problems such as finding a quadratic nonresiduum.

There are numerous other applications of expander graphs. To mention at least one, expanders can be used to construct error correcting codes with very good parameters.

8. 8.

Derandomization by means of hard functions. I will sketch two main ideas of the proof of Theorem 40. The first one is the Nisan-Wigderson generator of N. Nisan and A. Wigderson [209]. This is a construction that is used in many proofs in this field of research.

Let 0<α,β,γ,δ<1 be such that β+γ<δ<α. Let n (an input size) be given. In order not to overload notation with additional symbols I will assume that αn and γn are integers. To construct a Nisan-Wigderson generator we need furthermore a suitable set system $$\mathcal{S}$$ and a hard function f.

The set system $$\mathcal{S}$$ contains 2 γn subsets of the set of variables {x 1,x 2,…,x n }; each subset is of size αn and the intersection of every pair of different elements of $$\mathcal{S}$$ has size at most βn.

The Boolean function f maps {0,1} αn →{0,1} and we will assume that the circuit complexity of f is at least 2 δn .

The set system $$\mathcal{S}$$ and the function f determine a Nisan-Wigderson generator—the function G(x 1,x 2,…,x n ) defined by
$$G(x_1,x_2,\dots ,x_n)=_{\mathit{def}} \bigl\{ f(Y)\bigr\}_{Y\in\mathcal{S}}.$$
On the right hand side, f(Y) denotes f computed on a subset of variables Y⊆{x 1,x 2,…,x n }. Thus G maps bit strings of length n on bit strings of length m=2 γn , which gives exponential stretching, as needed for derandomization.

In order to get intuition about how it works, assume that the bound βn on the sizes of the intersections of subsets is much smaller than their size αn. Then we can say that they are “almost disjoint”. If the sets in the set system $$\mathcal{S}$$ were indeed disjoint, the output bits would be independent and we would get truly random bits, but no stretching. We hope that if they are “almost disjoint”, then the output bits will be “almost independent”. Furthermore, there are families of almost disjoint sets that have exponentially many members.

Recall that we need to show that the generator produces bits that look random to circuits of small size. From the statistical point of view the output bits are very dependent, so any formal proof has to focus on the complexity of computing some properties of the output bits. One can show that everything boils down to proving that any of the output bits cannot be computed from the remaining ones by a small circuit. More precisely, we need to show that the bit cannot be predicted by a small circuit with non-negligible probability.

Let Y 0 be an arbitrary element of the set system $$\mathcal{S}$$. Let us simplify the task and only prove that f(Y 0) cannot be computed by a small circuit from the values f(Y), where $$Y\in \mathcal{S}$$ and YY 0. So we will assume that C is a circuit that computes f(Y 0) from these values and prove a lower bound on its size. We can express our assumption by the following equality:
$$f(Y_0)=C\bigl(\bigl\{ f(Y)\bigr\}_{Y\in\mathcal{S},Y\neq Y_0}\bigr).$$
Let a be an arbitrary assignment to the variables outside of Y 0. For $$Y\in\mathcal{S}$$, we will denote by Y| a the string in which the variables of YY 0 are set according to a, and the variables in YY 0 remain unchanged. Since the equation above holds generally, it must also hold true after this substitution. We thus get
$$f(Y_0)=C\bigl(\bigl\{ f(Y|_a)\bigr\}_{Y\in\mathcal{S},Y\neq Y_0} \bigr).$$
Since YY 0 has size at most βn, the restricted function f(Y| a ) depends on at most βn variables. Therefore it can be computed by a circuit of size at most 2 βn . Combining all these circuits with C we get a circuit of size at most
$$2^{\beta n}2^{\gamma n}+|C|,$$
(where |C| denotes the size of C) and C computes f(Y 0). Since the circuit complexity of f(Y 0) is ≥2 δn and the term 2 βn 2 γn =2(β+γ)n is asymptotically smaller, the size of C satisfies the inequality
$$|C|\geq (1-\varepsilon )2^{\delta n}=(1-\varepsilon )m^{\delta/\gamma},$$
where ε→0 as n→∞.

We do not get a superpolynomial lower bound on the size of C, but one can show that it is possible to choose parameters and construct the set system so that the exponent δ/γ is an arbitrary large constant. In other words, we can beat any polynomial bound.

On the other hand, we can also compute the Nisan-Wigderson generator G in time that is polynomial in m. To this end we need the assumption that f is computable in exponential time; more precisely, f is a restriction to {0,1} αn of a function computable in time 2 cαn . In terms of m the bound is m /γ , i.e., polynomial in m.

The above analysis demonstrates the basic idea of Nisan-Wigderson generators, but the property that we have shown does not suffice for derandomization of BPP. To get the stronger property of the generator (that the bits cannot be predicted with non-negligible probability from the remaining ones) we also need a stronger property of the function f used there. Instead of only assuming the worst-case complexity of f to be large, we need the average-case complexity to be large. We need that for some ε>0, every circuit of size at most 2 δn disagrees with f on the fraction ε of all inputs. Since there is a way to construct a function hard in the average from a function that is only hard in the worst case, Theorem 40 only assumes the existence of functions that are hard in the worst case. So the second idea, which I am going to explain now, is how to get average-case complexity from worst-case complexity.

I described error correcting codes a few pages back (page 407). Let Γ be a binary code with the minimal distance d. Recall that if u is a codeword and we change less than d/2 bits then we still can uniquely decode the resulting word. This suggests the following idea.

Think of a Boolean function f of n variables as a binary string w of length N=2 n where the bits of the string are the values of f for all possible inputs. Using a suitable error-correcting code Γ, encode w by a longer string uΓ of length M=2 m . Now we can interpret u as a Boolean function g of m variables. Let εM be the minimal distance of Γ, for some ε>0. Let D be a Boolean circuit with m variables which approximately computes g. If D outputs a wrong value on less than $$\frac{\varepsilon }{2} M$$ inputs, then it defines a string that is in distance less than $$\frac{\varepsilon }{2} M$$ from u. Thus in such a case we can completely recover g and f, in spite of the many possible errors that D makes. Put differently, given such a circuit D we can compute f(x) for every input x, in particular also for the “hard inputs”.

It is natural to expect g to be hard on average, but in general this is not the case. It can happen that whereas f is hard, g can be very simple. To make the idea work, we must use coding that preserves complexity. More specifically, we need a code that has the following properties:
1. a.

It is possible to decode locally and efficiently. This means that to compute a value of f(x), we only need a few values g(y 1),g(y 2),…,g(y k ) and we need a small circuit for computing f(x) from these values; moreover this circuit should give the correct value of f(x) also if part of the values are wrong.

2. b.

It should be possible to choose the inputs y 1,y 2,…,y k randomly so that if we use D(y 1),D(y 2),…,D(y k ) instead of g(y 1),g(y 2),…,g(y k ), most of the values are correct.

Having such a circuit C for decoding Γ, we can combine it with circuit D and we get a circuit computing f precisely. If C were small, then the resulting circuit would also be small. But we assume that such a circuit does not exist. Hence it is not possible to compute g by a small circuit even on average.

Constructions of locally decodable codes are very interesting, but it would take us too far afield. Let me only say that they are based on interpolation of low degree polynomials.

9. 9.

Pseudorandom generators and natural proofs. The PRG-Conjecture mentioned on page 388 says that there exist pseudorandom generators that satisfy a stronger requirement than the one stated in Definition 13.

The PRG-Conjecture There exist an ε>0 and a polynomial time computable function P such that for every n,
1. a.

P maps the set of bit strings of length n into the set of bit strings of length 2n;

2. b.
for every circuit C of size at most $$2^{n^{\varepsilon }}$$,
$$\big|\mathit{Prob}\bigl(C\bigl(P(x)\bigr)=1\bigr)-\mathit{Prob}\bigl(C(y)=1\bigr)\big|< 2^{-n^{\varepsilon }},$$
where Prob is probability with respect to the uniform distribution on bit strings of length n.

This means that P is a very strong pseudorandom generator—it cannot be distinguished from a truly random source by exponentially large circuits even with exponentially small probability.

Such a pseudorandom generator P is then used to construct a pseudorandom generator Q that stretches n bits to 2 m bits, where m=n δ for some positive constant δ. This is similar to what one needs for derandomization of BPP, but now we need a stronger condition on the computability of Q, namely, we need that for a given M, 1≤M≤2 m , it is possible to compute the Mth bit of the output of Q(x 1,x 2,…,x n ) in time polynomial in n. The generator Q is constructed by suitably composing P with itself.

Let us denote the function that computes the Mth bit of the generator Q by F(x 1,x 2,…,x n ,y 1,y 2,…,y m ), where x 1,x 2,…,x n are the seed of the generator Q and y 1,y 2,…,y m are the bits encoding the number M. So this function is computable in polynomial time. Such a function is called a pseudorandom function generator F because it can be used to imitate random functions. If we randomly choose an assignment of zeros and ones a 1,a 2,…,a n to the variables x 1,x 2,…,x n , then F(a 1,a 2,…,a n ,y 1,y 2,…,y m ) is computationally indistinguishable from a truly random function of m variables.

Recall that the nonexistence of natural proofs is essentially the statement that one cannot define a large set of hard Boolean functions by a polynomial time computable condition. (For explaining the argument it is not necessary to specify what hard and large mean; you can think of it as superpolynomial circuit complexity and a positive fraction of all functions, respectively.) Assuming we have a pseudorandom generator described above, it is easy to prove it. Let A be a polynomial time algorithm that defines a large subset of hard Boolean functions of m variables. Think of this algorithm as a test of the pseudorandom generator Q. As Q is pseudorandom, A should also accept a large subset of all possible outputs of Q (what we actually need is that it accepts at least one). Since F(a 1,a 2,…,a n ,y 1,y 2,…,y m ) is computable in polynomial time, every string produced by Q codes a Boolean function that has a polynomial size circuit. This proves that A also accepts a string computable by a polynomial size circuit, which is a contradiction. In this way the Razborov-Rudich result is proved.

10. 10.

The game morra. If you think that generating random numbers is easy, try morra. Morra is an ancient game that is still played in Italy and Spain. Two players use their voices and each one hand. They simultaneously show numbers, 1–5, on their hands and shout a number that is a guess of the sum the shown numbers. If one player guesses correctly the sum shown on the hands, he scores a point; if both guess wrong numbers or both guess correctly, none gets a point.

The game seems trivial from the point of view of classical game theory—the best strategy is to show a random number n uniformly distributed between 1 and 5 and shout a random number m uniformly distributed between n+1 and n+5. But when the game is actually played, due to high rhythm, players have very little time to think up a number that the opponent cannot predict. It requires skill and practice to be good at morra. One has to be able to produce unpredictable numbers and recognize regularities in the opponents play. Furthermore one has to do all that in a short interval between the rounds, including the addition of his number with the predicted number of the opponent.

11. 11.

Which graphs are isomorphic in Fig. 5.4 ? B is isomorphic to C; they are drawings of the Petersen graph.

## 5.3 Parallel Computations

When computing difficult problems we can often speed up the computation by using several processors which compute certain parts of the computational task simultaneously. We call it parallel computations in contrast to sequential computations in which we perform only one action at a time. There are problems whose computation can be sped up by using parallelism and there are others for which parallelism does not help. This phenomenon is well known in manufacturing. If a product consists of many components, one can produce the components independently of each other and then quickly assemble them to the final product. But if we are to build, say, a house, we must first build the foundations, then the first floor, and so on, the roof coming last.

A prime example of a mathematical problem that is amenable to parallelization is the problem of factoring a natural number N. The most trivial algorithm, based on trying to divide N by all numbers up to $$\sqrt{N}$$, can easily be split into independent tasks. For example, we can divide the interval $$[1,\sqrt{N}]$$ into as many segments as we have processors, and let each processor try the numbers in its segment. Other, more sophisticated and faster algorithms have this property too. One of the first spectacular applications of the Internet for solving a large scale problem was the project of factoring the RSA-129 number. This 129 decimal digit composite number was proposed in 1977 by Rivest, Shamir and Adleman, the authors of the most popular cryptographic protocol. They challenged everyone to factor this number. Their aim was to test the security of their system for keys of this length. In summer 1988 Arjen Lenstra launched a project to factor this number using computers connected to the Internet. The following April they announced the solution. In course of computation more than 600 people took part in solving some of the tasks to which the problem was split on their computers. (The RSA-129 number and its factors are on page 368.)

Since then not only have computers became much faster, but also new algorithms have been discovered and the old ones have been improved. In 2005 a 200 decimal digit number RSA 200 was factorized using many computers working in parallel. The computation ran for one and a half years. The authors estimated that it would take about 55 years if it were done on a single machine. More recently, in 2009, a number with 232 decimal digits was factored. (Check the Internet for the present integer factorization record.)

### The Ideal Parallel Computer

In many situations already reducing the time by a factor of two may be important. For example, in weather forecasting the computations are so time consuming that sometimes they would be finished only after the time for which we needed the prediction. In theoretical investigation we rather study the distinction between polynomial and exponential. To this end we need a mathematical model of an idealized parallel computer. Such a computer looks formally very much like an ordinary personal computer running modern operating systems such as Linux. It can run several processes in parallel and every process can start another new process. In a personal computer, however, all this is done by a single processor which alternates between all the processes.14 Thus it only seems that the processes run in parallel, but in fact at each moment the computation progresses only in one of the started process. Truly parallel machines have been built, but the number of processors in them is always limited. In the ideal parallel machine we imagine having an unlimited number of independent processors that can work simultaneously on the processes assigned to them.

Having an infinite number of processors does not give us infinite computational power. Recall that a Turing machine also has an infinite tape, but still it can only compute some functions (the computable functions) because at each step of computation it only uses a finite part of the tape. Similarly, a parallel machine can only use a finite number of processes at each step. The advantage of the parallel machine is that the number of processors can increase exponentially. For instance, if at each step every process starts another new process and no processes terminate, we will have 2 t running processes at time t.

With exponentially many processors at our disposal, we should be able to compute more in polynomial time than we can on sequential machines. Let us see what we can compute on a parallel machine in polynomial time. One can easily check that we can compute all NP problems: every NP problem is associated with a search problem and to solve the search problem on a parallel machine in polynomial time, we can activate an exponential number of processors, each processor working on a single item of the search space. But it is possible to do more than NP. One can prove that the sets accepted by parallel machines in polynomial time are exactly the sets computable by Turing machines in polynomial space. We can express it as an equality between complexity classes by
$$\mathbf{parallelP}=\mathbf {PSPACE}.$$
But, alas, whether or not P=PSPACE is one of the big open problems in complexity theory! Hence it is also an open problem whether parallel machines are essentially more powerful than sequential machines. Since PNP implies that PPSPACE, it seems very likely that they really are, but we are not able to prove it.

It should be stressed that parallel machines could compute more problems in polynomial time than sequential machines only because we allow them to use more than a polynomial number of processors. A parallel machine with a polynomial number of processors can be simulated sequentially with only a polynomial increase in time. From the practical point of view it is not realistic to assume more than a polynomial number of processors. Even if we could build a parallel machine with infinitely many processors, we would still be able to use only polynomially many of them in polynomial time because the universe has three dimensions. The number of processors that such a machine could use in time n would be of the order of n 3 and thus we could simulate such a machine in polynomial time. Thus parallelP is not a realistic model of what is efficiently computable.

This is the theoretical point of view; in practice parallel machines are useful because even a relatively small speed-up may play a significant role.

### Interlude 1—Parallel Computations in the Brain

I have made a lot of digressions from the main topic, so there is no need for giving an apology for the next one. Indeed, the main reason for including the next two subsections is to show some applications of the concept of parallel computing in life sciences. On the other hand, when talking about the foundations of mathematics we have to take into account human abilities. I strongly believe that mathematics describes some fundamental principles of nature that are independent of human beings. But as we are limited beings, only some of these principles are within our reach and we have to represent them in a way comprehensible to us. The limitations are of various natures; for instance, we have a built in framework for using languages which may limit the way we use logical deduction. The most important restriction is, however, the limited computational power and limited memory. Complexity theory is the theory that enables us to quantify and compare this type of limitations.

In this subsection I want to make two claims. The first is:

In order to perform complex tasks, our brains have to work like parallel machines.15

This is a generally accepted fact, nevertheless, it is worthwhile to consider arguments that support this claim.

It seems to us that at each instant we are thinking only about a single idea, but in fact a lot things are going on at the same time in our brains. When we are speaking, we are pronouncing a sentence, but we are already thinking about the next one. Not only are we thinking about the next one, we are also contemplating how to go on afterwards, we are looking for suitable arguments to make our point and so on. At the same time we watch our audience to see their reactions, or drive a car; some people are even able to talk about one thing and type an unrelated message on the computer.

That people can do several things at the same time is a well-known fact. What is a more intricate question is whether our brain uses parallelism when solving problems. There are several observations that suggest that the brain is working sequentially when solving problems. One of these is the fact that we present the solution as a sequence of words, either spoken or written. In particular mathematical proofs and computer programs are given as text. But often one can present solutions of problems in the form of diagrams more efficiently, so this is not a very strong argument. Another argument is based on watching eye movements. When we let people solve a problem that involves a picture or a geometric configuration, such as a chess problem, we can get some information about the way they solve the problem by observing their eyes. The rapid movement of eyes from one place on the picture to another suggests that there is a long sequence of deductions made one after another. If the problem is simple enough, one can explain the problem solving process by a series of steps. But if it is difficult, it seems that there is not enough time to do it in this way. For example, in a typical situation chess players have to consider such a large number of possible moves that it is not possible for them to do it in the short time that is available by contemplating one alternative after another.

With open eyes our brain processes a huge amount of information at every moment. So let us close our eyes and observe what is going on in our mind. What we would like to find out is what happens in a single moment, namely, whether we are thinking about one idea, or there are more. I cannot speak for others, but it is natural to expect that what I observe is the same as others do. When I try to recall my state of mind, I find out that there is one leading idea, but there are also several ideas around that are less stressed and less clear. I feel that those ideas around are somehow connected with more ideas, but I cannot tell if the latter are really present. Thus I surmise that our consciousness has a hierarchical structure: there is one leading idea and a hierarchy of ideas which we realize less and less. Probably, we do not realize at all the ideas in the lowest part of the hierarchy, or at least we are not able to report about them later. Of course, more elaborate experiments are needed to find out what is really going on. Also there is an obvious problem involved that the brain puts only certain things to memory; thus we are never sure that we report everything that we experienced. But at least one thing we can confirm: when thinking about a certain subject, we are able to recall related things faster, than if we were asked to recall them without preparation. Our brain, sort of, has always the related data at hand.

When comparing brains with computers the most striking fact is how slow the components of brains are compared to the electronic components of computers. A neuron can fire at most about one thousand times per second, which is 1 kHz in terms of frequency. Present computers run on frequencies of several GHz, thus they are more than several million times faster. Yet people, animals and birds can outperform computers in many tasks. Imagine the computations done by the brains of a table tennis player or a swallow catching insects in flight. In order for the computations to be done on time, the information has to go through a very small number of layers of neurons. The study of how visual information is processed shows that some abstract concepts already appear after a few layers of neurons. The slowness of neurons implies that when working on tasks that require complex and fast reactions the brain must use massive parallelism; most likely it uses a lot of parallelism all the time.

My second claim is that parallel processing is also sufficient for an explanation of the astounding abilities of human brains. More specifically:

It is consistent with our present knowledge that the information processing in human brains can be approximated by the parallel computational model of threshold circuits.

A threshold circuit is a network of elements that compute threshold functions. Threshold functions are Boolean functions of several input variables and one output variable. A threshold function t of n variables is determined by n real numbers a 1,a 2,…,a n , called the weights, and one real number b, called the threshold. The function is defined as follows. Given input bits x 1,x 2,…,x n ,
$$t(x_1,x_2,\dots ,x_n)=1\quad \mbox{if and only if}\quad \sum_{i=1}^n a_ix_i \geq b.$$
In the simplest case all the weights are the same and equal to 1 and the threshold is a natural number b. Then t outputs 1 if and only if at least b input bits are ones. In particular the AND n , the and-function of n variables, is the threshold function with unit weights and the threshold n, and similarly, the OR n , the or-function, is the threshold function with unit weights and the threshold 1. Another important threshold function is MAJ n , the majority function, defined by unit weights and b=n/2. In general threshold functions use arbitrary real weights and the weights can also be negative.

A threshold function is a (very simplified) model of the electric activity of a neuron. We distinguish two states of a neuron: the inactive state, represented by 0, and firing of the neuron, represented by 1. A neuron receives signals from other neurons (via dendrites) and according to the weighted sum of these signals it decides to fire or not to fire. When it fires, this signal is sent to other neurons (via the axon). Negative weights are possible because some connections between neurons (synapses) use inhibitory neurotransmitters.

Formally, a threshold circuit is a Boolean circuit in which each gate computes some threshold function. Such circuits are rather different from those that we considered before. The circuits that I dealt with before used only a few very simple gates. Now I am allowing more complicated gates and they may have a large number of inputs. In threshold circuits we usually do not bound the number of inputs of a gate (a real neuron may be connected to several tens of thousands other neurons by synaptic connections). Instead we usually impose another restriction: we bound the depth of the circuits. The depth is the length of the longest path from inputs to the outputs of the circuit. We speak about bounded depth circuits when the depth is bounded by a constant. This, of course, makes sense only when considering infinite families of circuits. When we have a single circuit, such as the brain, we should think of the depth bound to be a very small number. Imagine a circuit consisting of a few layers of threshold gates between the inputs and outputs.

The bounded depth restriction corresponds to what we know about the anatomy of the brain. We also have an indirect evidence that the signal processing takes place only on a small number of layers of neurons: it is the fact that people are able to react very fast in spite of the relative slowness of neurons. It is also interesting to compare the volume occupied by somas, the central parts of neurons on the one side and by axons, the threads that transmit electric signals, on the other side. Somas are in the thin layer of the gray matter on the surface of the brain, whereas most volume is occupied by the inner part, the white matter, through which neurons are connected by axons. This shows that the large number of connections between neurons must play a very important role. These connections make it possible to spread information to a large number of neurons in a small number of steps.

Bounded depth threshold circuits have been studied extensively, so we have some idea how strong they are, which may help us to assess whether or not this model of the brain is oversimplified. Bounded depth and polynomial size threshold circuits have been constructed for several nontrivial functions. We have such circuits in particular for arithmetic operations and various analytical functions. There are some functions computable in polynomial time for which we do not have such circuits, but in general bounded depth polynomial time threshold circuits seem fairly strong. One indirect evidence of the strength of bounded depth circuits is that we are not able to prove for any explicitly defined function that it cannot be computed by such circuits. In other words, our lower bound methods fail for such circuits even though they seem much more restricted than general Boolean circuits. We cannot exclude that threshold circuits of depth 3 and polynomial size can compute the same class as general polynomial size Boolean circuits (the class nonuniform-P).

What consequences for the brain can we draw from what we know about threshold circuits? Let us look at how much faster threshold circuits can be than the ordinary circuits that use gates with only two inputs. One can show that a single threshold function can be computed by a polynomial size16 logarithmic depth Boolean circuit. It is also obvious that for a nontrivial threshold function the depth of such a circuit must be at least log2 n, where n is the number of input variables. Hence, if a threshold gate needed a unit time for its computation, then we could compute some functions in constant time on threshold circuits, while Boolean circuits would require at least logarithmic time. It is, of course, unrealistic to assume that a device with a large number of inputs would work as fast as a simple device with only two inputs. However, for adding many electric potentials we do not need more time than for adding two. Thus the delays would be caused rather by the larger distances from which we need to get the electric charges. It seems likely that a living cell cannot produce signals with very high frequency and there must be some intrinsic limitation. When it was not possible to produce components of higher speed, nature decided to use components that have more inputs. Having a component with more inputs certainly helps to compute some functions faster, but the speed-up factor is not very big; in the case of the brain it could be in the order of tens, which is far from several millions, the factor by which current transistors are faster than neurons. Thus the likely reason why our brains still surpass computers in some tasks is rather in the ability of the brain to get large numbers of neurons involved into computations, in other words, to use massive parallelism.

Another indication that this kind of circuit is quite powerful is that it is as powerful as its probabilistic version. From the previous section we know that randomness often helps to get faster and simpler algorithms. Given a randomized bounded depth threshold circuit, one can construct a deterministic circuit that computes the same function with no error, has size only polynomially larger and depth only larger by one.

Apart from being a fairly strong computational device, bounded depth threshold circuits have an important practical advantage: it is possible to design fairly reliable circuits from somewhat unreliable elements. This may be another reason why the brain uses elements similar to threshold functions.

Some researchers suggested that the great power of the human brain cannot be explained by such simple models as threshold circuits. They say that the extraordinary abilities of the brain can only be explained if we assume that single neurons are able to perform complex operations. I agree that a cell is a very complex system and more research should be done concerning the complexity of the tasks that single neurons can do. But I think that so far we do not have a reason to doubt that threshold circuits of the size and the speed of a brain can actually compute what the brain does.

Finally, one should not forget that the brain evolved during hundreds of millions of years. Thus the complexity of the brain is not determined only by the components and their number, but also by its design, “the computer architecture”. When neuroscience progresses to the stage that we will be able to decode some algorithms used by the brain, we may be quite surprised by their intricacy. From the point of view of complexity theory, we should view the brain as a nonuniform device with the extra power that non-uniformity gives.

So let us now look at the ways how evolution could have produced such complex structures.

### Interlude 2—Computation and Life

It is not a coincidence that computers and brains use electric charges to encode bits of information. If we need components that are fast and simple to produce, it is probably the best solution to use components that process and send electric charges. But we should not conclude that computations are possible only in systems composed of small devices sending electric impulses. When high speed is not needed, it is possible to use other means. A human body uses a lot of chemical signals to control organs. It is much slower than using nerves, but in many cases it suffices and it is an efficient way to send the signal to all parts of the body.17 Biochemical processes in a living organism are so complex and interwoven that we can also consider them as kind of computation.

This kind of complexity is present already on the level of a single cell. A cell controls its chemical processes by producing proteins that are either directly involved in the processes or function as enzymes. Proteins are produced by transforming the information about a protein encoded on DNA into an actual protein. The pieces of the DNA string that code a protein are called genes. Usually, only some genes are active, those that the cell actually needs. The control of a gene may be simple—a lack of a certain molecule triggers production of an enzyme that enables the production of this molecule. In some cases, however, the mechanism is very complex. In particular, when a cell of a developing multicellular organism differentiates, some genes must be switched off and some must be turned on, and this has to be done without much external influence. Such a complex control is achieved by means of regulatory genes. The role of a regulatory gene is not to produce an enzyme that will be used directly, but to enhance or inhibit the activity of another gene. One gene can be controlled by several other genes. In such a case the activation of the gene is a Boolean function of the activation of the controlling genes.

Several different Boolean functions were found in such regulatory systems. This includes negation (one gene suppresses the activation of another), conjunction (a gene is activated if two other genes are active) and disjunction. Conjunction and negation (or disjunction and negation) is a basis of all Boolean functions. Thus we have components to build, in principle, an arbitrary Boolean circuit. A number of such genetic regulatory circuits have been described (e.g., the circuit of the bacteriophage lambda). These circuits are not exactly Boolean circuits as I have defined them. The difference is that regulatory circuit may have a large number of feedbacks. This makes the description of the computation more complicated and they may be unstable and may oscillate. However, from the point of view of complexity of computations, they have the same computational power as circuits without feedback.

Computations can also occur on a much higher level. Consider the process of adaptation of organisms of some species to a changing environment. We can view such a group of organisms as a system that receives information from the environment in which the organisms live and reacts to it by adapting the organisms to the particular state of the environment. Such a system, viewed as a computational device, works as follows. It stores information in the strings of DNA which are present in each individual organisms. The information from the environment enters the system by means of removing certain strings from the population (expressed more poetically, Nature tells the system which genomes she likes). The system processes information by editing the strings of DNA.

Let us recall that there are two basic ways of editing genetic information.
1. 1.

Mutations produce some small random changes of the strings of DNA. They occur with low frequency.

2. 2.

In crossing over, or recombination, pairs of DNA strings exchange some segments.18 This is the essence of sexual reproduction.

A possible explanation of the role of mutations is that they are useful for producing new genes which are not present in the DNA strings and which are more beneficial for the survival than those that they replace. The role of crossing over is assumed to be in spreading the genes in the population. When crossing-over is present, a beneficial gene spreads quickly and before long almost every organism has this gene. If several new beneficial genes appear in the population, an organism that has them all will appear fairly soon. If there were no crossing-over, it would take a much longer time to select organism with many good genes.

The adaptation is viewed as the process that reduces the number of occurrences of genes that are bad and increases the number of beneficial ones. But now suppose that the expression of a gene is controlled by a complex circuit of control genes. Then the selection mechanism acts on all genes from the circuit. The resulting process is thus more complicated. It is quite possible that this gives much more computational power to the system, but I am not aware of any research done in this direction. In any case this gives at least one advantage: the possibility of switching off the expression of a gene in the population without removing the gene from the DNA strings. This is important because crossing over does not produce new genes and it would take a long time to re-create the gene by mutations once it is lost.

What has been researched is the power of the editing operation of crossing over [226]. Assuming crossing over is not a completely random process, such a system can have very strong computational power. This can be shown in the following computing system motivated by genetics.

The system consists of many strings of the same length, where a particular type of strings may occur several times. It operates in discrete steps; in every step we randomly form pairs of strings and perform the crossing-over operation described below. I will assume that the number of (copies of) strings is even, so that it is possible to match all strings.

In every crossing over the two strings involved are cut in only one spot, the same in both strings. Then two corresponding pieces are switched and reconnected. (Thus the first segment of the first string is connected with the second segment of the second string and the first segment of the second string is connected with the second segment of the first string.) The key assumption is that the spot where the strings are cut is determined by a short context around it. This means that we have a set of rules that determine where the crossing occurs. Such a rule can be, for example:

If the first string contains a segment AAAGGG and the second contains TTTCCC, then cut the first string between AAA and GGG and the second between TTT and CCC. (Then connect AAA to CCC, and TTT to GGG.)

Thus the system is completely determined by the length of strings, the number of strings and a set of crossing-over rules. In order to perform a computation, one sets all strings to a particular form (or forms) which encode the input data. Then we let the system evolve and observe what happens. If the rules are properly chosen and sufficiently many steps are done, the majority of all strings may encode the output value of the function that we wanted to compute.

One can show that in this way the system can simulate Turing machines. But not only that, one can also show that if the number of strings is exponential, then the system can simulate a parallel machine. It is, of course, not realistic to assume populations of exponential size, so the relevant conclusion is rather that the system has the potential to work as a parallel machine. This means that if the population is large then it may solve some problems faster than a sequential machine can.

Apparently not much is known about which are the spots where crossing over happens and how random this process is. It has been observed however that crossing over is not completely random. Some short sequences of nucleic bases have been determined that make the possibility of crossing over more likely than crossing over elsewhere. What is much more understood are the transcription processes from DNA to messenger RNA and from messenger RNA to proteins. Since these are complex processes controlled by what is written on the strings of DNA and RNA, it is conceivable that the crossing over process could be a highly complex mechanism as well.

In conclusion I should mention, at least briefly, attempts to use biological elements such as DNA molecules, for computing. Though interesting, I do not find this subject relevant enough to be discussed here at length. The reason why this approach may help compute faster is only because the components used in the devices are smaller than in the current electronic computers. Thus it is only a different way of miniaturization, not a conceptually new type of computer. It is likely that further miniaturization of electronic components will produce components comparable in size to such macromolecules.

### Notes

1. 1.

Parallel machines. There is a host of models of parallel computers in the literature. Most of them are (most likely) weaker than what I considered on previous pages, but some are even stronger. The one that I briefly described is essentially the machine proposed by W.J. Savitch and M.J. Stimson [255]. There is agreement among researchers in parallel computations that the true parallelism should enable machines to compute in polynomial time what is computed in polynomial space by sequential machines (Turing machines). This is called the Parallel Computation Thesis.

Such strong models of parallel computers are, however, interesting only from the theoretical point of view. For designing efficient algorithms, we rather need machines that can solve nontrivial problems in sublinear time. For Turing machines, sublinear time is uninteresting—the machine cannot even read the whole input. Thus we need different models. The focus is not on enlarging the class of efficiently computable sets and function, but in finding faster algorithms for what we already know to be computable in polynomial time.

The standard model used for this purpose is PRAM, the Parallel Random Access Machine. I will not introduce this concept, since one can estimate the power of PRAMs by Boolean circuits. If we ignore uniformity (PRAM is a uniform model of computation, circuits are nonuniform), then the time of PRAMs roughly corresponds to the depth of circuits, and the number of processors to the size of circuits (more precisely, size divided by depth). The depth of a circuit is the length of the longest path from an input node to an output node. Thus the problems for which one can gain a substantial reduction of time by using parallel computers are those which can be computed by polynomial size circuits with sublinear depth. The smallest nontrivial depth restriction is logarithmic (c⋅logn, for c a constant, n the length of the input). We conjecture that there are functions computable by polynomial size circuits, but not computable by polynomial size circuits of logarithmic depth. Again, this a widely open problem.

2. 2.

Threshold circuits. For a lot of functions it has been shown that they can be computed by threshold circuits of constant depth and polynomial size. These functions include the common numerical functions, such as addition, multiplication, division, sine, exponentiation, square roots and others. When the output should be a real number, it is computed with exponential precision. Furthermore, such circuits also compute some functions that are conjectured to be pseudorandom. Thus an exponential lower bound on the size of threshold circuits computing an explicitly defined function seems a very difficult problem, as it would either refute some conjecture about pseudorandomness, or it would be a non-natural proof in the sense of Razborov-Rudich. There are also some fairly simple functions that we do not know how to compute by bounded depth and polynomial size threshold circuits; in particular, it is the connectivity of graphs.

If we want to explain the function of the brain using threshold circuits we have to take into account the problem of precision. We cannot expect any actual device to be able to compute with precise real numbers. Let us see what precision is needed in threshold circuits.

The first observation is that given a definition of a threshold function, we can replace real weights and a real threshold by rational weights and a rational threshold. To see it, imagine the 2 n input strings of the threshold function as the vertices of the n-dimensional cube in $$\mathbb {R}^{n}$$. If the function is defined by
$$\sum_{i=1}^n a_ix_i\geq b,$$
(5.5)
then think of the function as separating the accepted inputs from the rejected inputs by the hyperplane $$\sum_{i=1}^{n} a_{i}x_{i}= b$$. Thus we only need to move the hyperplane slightly so that it still separates the same vertices and has rational coefficients. Once we have rational coefficients, we can multiply the inequality (5.5) by a suitable natural number and we get all coefficients integral. Having integral coefficients, we can measure the precision required by the gate by the size of the coefficients (their absolute values).

In general the coefficients have to be exponentially large. It is unrealistic to assume that neurons can compute with such a precision. However, one can show that such high precision threshold gates can be avoided. One can show that a given circuit with general threshold gates can be transformed into a circuit with threshold gates that use only coefficients of polynomial size and this transformation increases the size of the circuit only polynomially and the depth only by one. If we allow a slightly larger increase of depth, by a constant factor, then it suffices to use very special gates. These gates are the majority functions of possibly negated inputs. In terms of coefficients it means that we take a i =±1, for i=1,…,n, and b=n/2.

Let me now address the question of reliability, which is another thing that could negatively influence the power of actual threshold circuits. I will show that with threshold circuits we can also cope with this problem very well.

It impossible to construct circuits that could tolerate arbitrary errors; one has to make some assumptions about the nature of errors. One reasonable assumption is that a threshold gate makes an error with probability at most 1/4 and if the inputs are such that they sum to value far from the threshold, it makes an error with very small probability. This means that if gate g is defined by the inequality (5.5), then it operates reliably on inputs (x 1,…,x n ) for which either ∑ i a i x i b or ∑ i a i x i b.

In such a case we can replace every threshold gate by a small circuit that operates always with high reliability. The circuit is simply
$$\mathit{MAJ}_m(g_1,\dots ,g_m),$$
where MAJ m is the majority gate and g 1,…,g m are copies of the simulated gate g. Suppose g makes an error with probability ε, 0<ε<1/2. Then if we have an input (x 1,…,x n ) for which g should be 0, then in the average εm of the gates g 1,…,g m will output 0 and the rest will output 1. If we have an input (x 1,…,x n ) for which g should be 1, then it will be the other way around. More importantly, the law of large numbers ensures that in the first case the probability that m/2 or more gates g 1,…,g m will output 1 will be exponentially small, and symmetrically in the second case the probability that less than m/2 gates will output 1 will be exponentially small. (Exponentially means exponentially in m.) Thus if m is sufficiently large, we can replace gate g by this circuit, and it will make errors only with slightly larger probability than the gate MAJ m itself. The error will add up with more gates to simulate, but starting with a very small probability of error we will still get a reasonably reliable circuit.

3. 3.

Neural networks. Threshold circuits can be viewed as a very special kind of neural networks. In neural networks the gates are, as a rule, smooth approximations of threshold functions. The research into neural networks focuses on learning algorithms, which are algorithms that adapt the weights of the gates using examples of input and output values of an unknown function f in order to train the network to compute f. From the point of view of complexity theory, they are not more powerful than threshold circuits, unless one uses very special functions as the gates.

## 5.4 Quantum Computations

Let us recall that the Church-Turing thesis is the conjecture that every computable function is computable on a Turing machine (or an equivalent device). The concept of computability used in the thesis is not a precise mathematical notion; it refers to the intuitive concept of computability, which we usually associate with a process that can be physically realized at least in principle. So let us now consider the thesis from the point of view of physics. If we imagine that computations are done by mechanical devices, the thesis seems to be very likely true. But physics has much more to offer than mechanics. The traditional concept of an algorithm is based on the idea that an algorithm performs a finite number of elementary operations. This is a very ‘mechanistic’ approach. Let us abandon such presumptions and simply ask which functions can arise in physical processes. Then a computation can be viewed as a physical experiment in which we set the initial conditions according to the input data and get the output as the result of the experiment. At the early stages of computer science such devices were proposed. They were called analog computers in contrast to digital computers. For example, we can compute addition of two real numbers by combining the corresponding voltages. These devices suffered from very poor precision and therefore they were soon abandoned (a mechanical device of such a kind that survived the longest was the slide rule). As modern computers developed, more computations were done also in physics. Then the problem reappeared again, but from the opposite side. The question was whether there are physical processes that we cannot simulate using computers. In 1982 the American physicist Richard Feynman brought up this problem in connection with quantum physics. There are several reasons why quantum physics should be studied from this point of view. First, in quantum physics one can perform experiments with extremely high precision. The second general reason is that the quantum world is so much different from our experience. The main reason, however, why quantum computing is so popular nowadays is that it seems that it can really help us to compute things that we are not able to compute with classical means.

In spite of sometimes looking very bizarre, quantum physics does not contain phenomena that are not computable by ordinary Turing machines. Thus from the point of view of pure computability quantum physics does not give us new concepts. However, if we take into account the complexity of computations it seems that we can really gain something. In his seminal paper Feynman gave arguments why computer simulations of quantum phenomena may need exponential time and thus may be infeasible [73]. He also suggested that we may be able to construct quantum computers which would be able to perform such simulations, hence they would be more powerful than classical computers. (The word ‘classical’ is used to distinguish concepts not based on quantum physics from those that use quantum physics.)

The basic principles needed for quantum computations can be explained without much mathematics. In quantum physics it is possible to form combinations of states. Formally, if S 1,…,S k are possible states of a given system $$\mathcal{S}$$ and a 1,…,a k are nonzero complex numbers satisfying a certain restriction, the system can also be in the compound state a 1|S 1〉+⋯+a k |S k 〉. This is called a linear superposition of the states S 1,…,S k and the numbers a 1,…,a k are called amplitudes. The meaning of the linear superposition is, roughly speaking, that the system $$\mathcal{S}$$ is, in a way, in all the states S 1,…,S k at the same time. To distinguish the states S i from nontrivial superpositions, I will call them basis states (it should be noted that this is not an absolute concept, it depends on the way we describe the system). What makes the simulation of quantum phenomena difficult is that k, the number of possible states, can be very large. For example, suppose the system consists of n memory registers, each registers holding one bit. Then the number of all possible states is 2 n . Thus to simulate a quantum computation with n quantum bits we would need to keep track of 2 n complex amplitudes.

The possibility of having a linear superposition of an exponential number of states suggests that quantum computers could compute like truly parallel machines. Instead of having exponentially many processors, one for every process, we could use one processor to work on exponentially many tasks at the same time. But it is not so simple. Suppose that we want to solve a search problem, for instance, to find a factor of a composite number N. A naive approach would be to put the memory of a computer into the superposition that includes all numbers M, 1<M<N and let the computer try to divide N by M. Thus we obtain a superposition of the states of the computer such that at least one of the states contains the solution. This can be done, at least theoretically, but the problem is how to get the solution from the superposition? As I will explain below, the amplitudes determine the probability that we can get a particular state from the superposition. In such a superposition they are exponentially small. In fact, in this way we would not get more than if we just picked M at random and tested if it divides N.

In order to obtain more than one can get by using only randomized algorithms, one has to use the fact that the amplitudes do not have to be positive real numbers. The crucial property is that the amplitudes can sometimes cancel out, which results in increasing the others. A quantum algorithm that is able to achieve more than a probabilistic one, has to be tricky: it has to reduce the amplitudes of unwanted results while increasing the wanted ones. This has been accomplished only in a few cases so far. Thus quantum computations are able to use parallelism to some extent, but it seems that only in some limited way. In particular, we are not able to show that all NP problems are solvable in polynomial time on quantum computers, and we rather conjecture that this is not the case.

The two most important problems that are solvable on quantum computers in polynomial time but no polynomial time classical algorithm is known for them are factoring of composite numbers and computing discrete logarithms. Both algorithms were found by Peter Shor in 1996 [268]. These problems play a key role in cryptography, where one uses their apparent hardness. Hence constructing a quantum computer would have rather destructive consequences—we would have to abandon the RSA protocol and others. (But keep in mind that we are not able to prove the hardness of these functions anyway; thus it is not excluded that we can break these protocols even using classical computers.)

Another quantum algorithm, found by L.K. Grover in 1998, is not polynomial time but is very general [109]. This is an algorithm for solving search problems. It has the remarkable property that to find a solution in a search space of size N, the algorithm needs only $$c\sqrt{N}$$ steps, for some constant c. For problems that we are able to solve only by the brute force search, this is a significant speed up. Since the search space is typically exponential in the size of inputs, this means that a quantum computer of the same speed as a classical computer would be able to handle almost twice as large input data.

In the following decade a number of generalizations of these algorithms and some new ones have been found. I will not discuss them here because they are rather technical results and none of them has presented a breakthrough comparable to Shor’s algorithm. A lot of work has been done in related areas of quantum information and quantum cryptography, which also will not be treated in this book (except for a brief remark in Notes).

### A Brief Visit in the Quantum World

Several important features of quantum physics can be explained on a simple experimental set-up called the Mach-Zehnder interferometer. It consists of a source of coherent light, such as a laser, which sends a beam to a beam splitter, say a half-silvered mirror. After the beam splits into two, the two beams are reflected to the same point at another beam splitter. There are two detectors behind the second beam splitter placed in the direction of the two beams, see Fig. 5.5. Each beam should split into two on the second beam splitter, thus one may expect that both detectors should detect light. However, if all the components are accurately placed, only detector D 1 records light. If the splitters and mirrors did not absorb any light, then the intensity of the beam reaching detector D 1 would be the same as at the source.

This can easily be explained using the fact that light is a special type of electromagnetic wave. Each of the beams M 0BS 2 and M 1BS 2 splits on the beam splitter BS 2 into a pair of beams BS 2D 0 and BS 2D 1. The resulting two beams BS 2D 1 have the same phase, so they add up, whereas the two beams BS 2D 0 have opposite phases, so they cancel out. This is called respectively positive interference and negative interference.

Now assume that we gradually decrease the intensity of the light emitted from the source, say, using filters that absorb light. If light was just waves, we should detect smaller and smaller intensity at D 1. What happens in reality is, however, different. If the detector is sensitive enough, at some point it will not register decreasing intensity of light, but instead it will record pulses whose frequency will decrease with the decreasing intensity of the emitted light. This is not caused by the interferometer; it would be the same if we aimed the source directly to the detector. The explanation is that light consists of quanta, which we call photons. For each wavelength, this quantum is uniquely determined, in other words, the energy of the photon is determined by the wavelength (it is inversely proportional to it). For a given frequency, it is not possible to send a smaller amount of energy than the energy of the photon of that wavelength.

This is the wave-particle duality. We can explain light by waves, as well as the kinematics of particles, but if a phenomenon has features of both, we need a new theory. Indeed, if the source sends a single particle to the interferometer, then, according to classical notions, the particle has to choose which way it passes the interferometer. We can confirm it by temporarily putting two detectors in place of the two mirrors (or at any place on the two paths). Then we register a photon always at only one detector, which suggests that the particle did not split into two. But how can we have interference on the second beam splitter if only one particle arrives there?

Suppose we block one of the paths between the two beam splitters, say, we remove mirror M 1. Then, in terms of waves, there is no interference on the second beam splitter, hence both detectors detect light, each detecting one quarter of the original intensity (see Fig. 5.6). But now suppose that the source emits a single photon. With probability 1/2, it will choose the path that is not blocked and after reaching the second beam splitter it will go to one of the detectors with equal probability. Compare this with the original situation in which the photon always goes to detector D 1. In particular also the photons that go via the lower path. Thus if we think of a photon as a particle that always has a definite position, we get a contradiction. It looks like the photon senses if mirror M 1 is present in spite of going a different route; if the mirror is present it always goes to D 1, otherwise it sometimes goes to D 0.19

It has been proposed that in this way we are able to send a signal somewhere without sending any physical object there. In the setting of the Mach-Zehnder interferometer we can send such a “signal” from the position of mirror M 1 to detector D 0. To send a signal we simply remove the mirror. Then we argue as follows. If the mirror is present, no signal reaches D 0, so there is no communication present. If the mirror is removed, only photons that use the lower path reach D 0 and these are photons that did not reach the position of mirror M 1, so we can say that we have nothing to do with them. If, moreover, detector D 0 controls some action that we are not supposed to do, we can claim that we have not caused it.

This is just playing with the interpretation of what is going on in the Mach-Zehnder interferometer. The next example, the Elitzur-Vaidman bomb test, however shows something that we can actually do using quantum physics, but not without it [66]. For this example, we need something that is sensitive to light and is destroyed by it. In the original setting it is a bomb connected with a light detector so that the bomb explodes whenever light is detected. If you want to do it experimentally I suggest using a piece of film instead; it will be safe and will serve the purpose equally well.

In the test we assume that at the place of mirror M 1 there is either the original mirror, or the thing that is sensitive to light, say, the film. Our goal is to determine the presence of the film without destroying it. We will do it using detector D 0. We already know what will happen:
1. 1.

if the film is not present, we have the standard setting of the Mach-Zehnder interferometer, so D 0 never records a photon;

2. 2.
if the film is present, then after the beam splitter BS 1,
1. a.

with probability 1/2, the photon will go up, the film will be destroyed, and no detector detects the photon;

2. b.
with probability 1/2, it will go down avoiding the film, and after reflecting from M 0 and reaching BS 2 it will,
1. i.

with probability 1/4, go up and reach detector D 1, and

2. ii.

with probability 1/4, go down and reach detector D 0.

The point is that we only detect the photon at D 0 when the film is present, and when this happens the film is not destroyed (case 2.(b) ii). Thus, with probability 1/4, we can determine the presence of the film using light and without destroying it. Using more complicated settings one can detect the film without destroying it with probability arbitrarily close to 1.

Good scientists should not be satisfied with data they cannot fully explain; they should be curious what is really going on. So, is there a way to watch what the photon actually does? Similar experiments can be done with particles that have nonzero mass and travel at speeds lower than the speed of light, thus it is possible to detect the presence of a particle without deviating it too much off its course. What happens then is that whenever we set up the experiment so that we know which way the particle goes, the interference disappears. It is like observing a magician: if we do not know the trick, it is magic, but as soon as we learn the trick, the magic disappears.

### The Quantum Bit

In order to understand quantum computations, we do not need any physics. It suffices to learn a few basic rules of the game called quantum computing and then we can play. Moreover, once we understand quantum computing, we can model interesting quantum phenomena on quantum circuits. Such models are so far restricted to thought experiments, since we do not have a physical realization of quantum computers yet.

The best way to view quantum computations is as a generalization of the matrix model introduced in Chap.  (see page 137). In that model a column is interpreted as a memory location and rows correspond to discrete time moments. Thus a row holds information about the current content of the memory. A particular matrix model corresponds to an algorithm or a circuit. It is determined by rules that posit how to rewrite a row to the next one. The crucial condition is that the rules must be simple; therefore we allow only rules that change a small number of the entries.

The quantum version of the matrix model is based on replacing the usual bits by quantum bits.20 What is a quantum bit? A quantum bit (more fashionably called a qubit) is simply a linear superposition of the two classical bits, 0 and 1. Formally, it is the expression
$$a|0\rangle +b|1\rangle,$$
where a and b are complex numbers such that
$$|a|^2+|b|^2=1.$$
(5.6)
I put 0 and 1 in these strange brackets as it is customary in quantum physics to denote the states in this way. It is a useful notation introduced by Paul Dirac; an expression |x〉 is called a ket vector, where ‘ket’ is the second half of the word ‘bracket’.21 To give some meaning to this expression it is good to state the first rule.

### Rule 1

If we observe a superposition of states a 1|S 1〉+⋯+a k |S k 〉, then we see state S i with probability |a i |2.

This rule implicitly says that we cannot see superpositions; we can only see basis states.22 The process of looking at a quantum system is usually referred to as measurement because in general the result can be a real number. Here we will consider measurements that can give one of a finite set of values. The standard interpretation of what is going on in measurements is that the system suddenly collapses from a superposition to a basis state.

Thus, in particular, if we observe the quantum bit a|0〉+b|1〉, we get 0 with probability |a|2 and 1 with probability |b|2 (where |z| denotes the absolute value, the modulus of a complex number z). This explains the condition |a|2+|b|2=1; the total probability must be 1. So far it is not clear why we need complex numbers. This will become clearer after I state the second rule, but before doing so let us talk more about the quantum bit. Whereas the classical bit is determined by two discrete values, the quantum bit is determined by two complex numbers a, b. Thus we interpret it as a vector in the two dimensional complex linear space $$\mathbb {C}^{2}$$. The condition on the sum of the squares of the absolute values is simply the condition that the length of this vector is equal to 1.

The next thing that we need to know is how a system changes in time. We envisage that quantum computers will work in discrete steps, like the classical ones, thus we will focus on discrete time. Let us see how a single quantum bit can change. Formally, we ask what transformations of $$\mathbb {C}^{2}$$ are those that correspond to physical processes. The answer is linear transformations; linearity is the crucial property of quantum physics. There is only one additional restriction: the transformations must preserve the length of vectors. (In fact, we need it only for vectors of length 1, but linearity automatically implies that the lengths of all vectors are preserved.) Linear transformations satisfying this condition are called unitary.

### Rule 2

A transition from one superposition to another is a unitary transformation.

Let us consider a couple of examples. Consider the following unitary transformation H:
$$\begin{array}{lllll} |0\rangle &\mapsto & \frac{1}{\sqrt{2}}|0\rangle &+& \frac{1}{\sqrt{2}}|1\rangle\\ ~&~&~&~&~\\ |1\rangle &\mapsto & \frac{1}{\sqrt{2}}|0\rangle &-& \frac{1}{\sqrt{2}}|1\rangle . \end{array}$$
Suppose that the initial state is |0〉. After applying H we obtain
$$\begin{array}{lll} \frac{1}{\sqrt{2}}|0\rangle &+&\frac{1}{\sqrt{2}}|1\rangle. \end{array}$$
If we observe (perform a measurement on) this state, we obtain 0 or 1 with equal probability. If instead of observing, we apply H again, we obtain the following state
$$\begin{array}{l} \frac{1}{\sqrt{2}}\Bigl(\frac{1}{\sqrt{2}}|0\rangle +\frac{1}{\sqrt{2}}|1\rangle\Bigr) + \frac{1}{\sqrt{2}}\Bigl(\frac{1}{\sqrt{2}}|0\rangle -\frac{1}{\sqrt{2}}|1\rangle\Bigr) = |0\rangle. \end{array}$$
This is essentially what the Mach-Zehnder interferometer does. The first application of H is the first beam splitter. After a photon passes trough it, it is in the superposition of two states, one taking the upper route the other taking the lower one. After the second beam splitter we get the original basis state. If we observed the state after the first application of H, it would collapse to |0〉 or to |1〉, and then the second application of H would put it into a superposition. (A more precise description of the events in Mach-Zehnder interferometer is given in Notes.)
A more concise representation of unitary transformations is by matrices. The above transformation H is represented by the matrix
$$\left ( \begin{array}{r@{\quad }r} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\[6pt] \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \end{array} \right ).$$
(5.7)
An application of the transformation to a vector is simply matrix-vector multiplication. In our example two applications of H give the original state, which can be expressed using matrix multiplication as HH=I, with I denoting the unit matrix.

An important property of unitary transformations is that they are invertible. Hence every quantum process can be done in a reverse order. This seemingly makes it impossible to do classical computations on quantum computers, as classical computers use irreversible gates, but in fact it is not a problem; there is an efficient way to transform classical computations into reversible ones. I will get to this later.

As the next example, consider the negation as the only nontrivial reversible operation on one bit. It is defined by the matrix
$$\left ( \begin{array}{c@{\quad }c} 0 & 1\\ 1 & 0 \end{array} \right ).$$
What is more interesting is that in quantum world we also have a square root of the negation:
$$\left ( \begin{array}{r@{\quad }r} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\\[6pt] -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{array} \right ).$$
You can check that this matrix multiplied with itself gives the matrix of the negation.
The next is an example of a unitary transformation that preserves the bits.
$$\left ( \begin{array}{c@{\quad }c} 1 & 0\\ 0 & \mathrm {i}\end{array} \right ).$$
If we apply it to a quantum bit and observe, we do not see any difference; the probabilities of observing zero and one remain the same. Yet such transformations are important in combinations with others.

A quantum bit is not a mere idealization. There are many experiments that give precisely two possible results. In every such case we can interpret the measured physical quantity as a quantum bit. One that is easiest to visualize is the spin of the electron. Spin in quantum physics is a property of particles which is the quantum version of classical rotation. Let us fix an axis, that is, an oriented line through the electron. Then it can spin around it in two possible ways, the speed being always the same. By convention we say that in one case the spin is up, in the other it is down. For the given axis, the electron can be in a basis state up, or down, or in any superposition of these two. In the latter case measurements will sometimes give the value of spin up, sometimes down. What is interesting is that given such a superposition we can find an axis for which this is a basis state. Hence we can imagine the spin as an arrow attached to the electron. If we measure in the direction of the arrow we get always value up, if we measure in the opposite direction we get always value down, for other directions, we get values up and down with probabilities depending on the angle between them and the arrow.

A direction in three-dimensional space is determined by two real numbers. This does not seem to fit well with a quantum bit which is determined by two complex numbers. Two complex numbers are given by four real numbers and we have one equation (5.6) that they should satisfy, which gives three degrees of freedom. The explanation is that for a quantum bit a|0〉+b|1〉 only the ratio a:b has an interpretation in the real world. If we have a description of a quantum system, we can multiply everything by a complex unit (a complex number c whose absolute value is 1) and everything will work the same. Thus the mathematical description has an extra parameter that has no physical interpretation.

### Quantum Circuits

We need to study a system consisting of more than one quantum bit in order to understand quantum circuits. The generalization from one quantum bit is not difficult, but some caution is necessary. For example, one may be tempted to say that n quantum bits is an element of the direct product of n copies of $$\mathbb {C}^{2}$$, thus it is an element of $$\mathbb {C}^{2n}$$. This is not true, we need a space of much higher dimension.

To get the right concept of n quantum bits, we must realize what the basis states are. If we observe n quantum bits we should see n classical bits. Thus a basis state is a sequence of n classical bits, and a general state is a linear superposition of them. Formally, n quantum bits are the following expression
$$\sum_{e_1,\dots ,e_n\in\{ 0,1\}} a_{e_1,\dots ,e_n} |e_1,\dots ,e_n\rangle,$$
where $$a_{e_{1},\dots ,e_{n}}$$ are complex numbers satisfying
$$\sum_{e_1,\dots ,e_n\in\{ 0,1\}} |a_{e_1,\dots ,e_n}|^2 =1,$$
and |e 1,…,e n 〉 denotes the basis state where the bits are e 1,…,e n . Notice that such a state is determined by 2 n complex numbers, hence we are in the 2 n -dimensional complex linear space $$\mathbb {C}^{2^{n}}$$. This is what makes the simulation of quantum circuits by classical computers difficult; if the dimension was only 2n it would be easy.

Incidentally, a sequence of n quantum bits is an element of a certain kind of product of n copies of $$\mathbb {C}^{2}$$, but it is not the usual direct product, it is the tensor product (see Notes).

The transition from one state of n quantum bits to another is again by unitary transformations. Thus every computation step is such a transformation and the function that the quantum circuit computes is the composition of these unitary transformations. If we want to write down such a unitary transformation explicitly, we need a huge matrix, a matrix of dimensions 2 n by 2 n . As in classical computations, we want to decompose a given unitary transformation into a product of some simple elementary ones. The minimal number of elementary unitary transformations needed to express a given unitary transformation U as a product is the complexity of U.

Thus what remains is to say what the elementary unitary operations are. As in classical circuits so also in quantum circuits the elementary operations are defined to be those that operate on a small number of quantum bits. In other words, the operations that change only a constant number of bits. This is a reasonable proposal and it is generally accepted as the right one. However, it is harder to justify it than in classical computations. In quantum systems very distant particles can be entangled (the Einstein-Podolski-Rosen pairs) and there are systems that are more localized, but consist of large numbers of entangles particles (the Bose-Einstein condensates); both phenomena have been demonstrated experimentally. Nevertheless, it is reasonable to assume that the elementary operations act only on a small number of bits. How a unitary transformation is applied only to some bits is best seen in examples.

Suppose that the initial state is, say,
$$|00000\rangle.$$
Apply the unitary transformation defined by the matrix H, (see (5.7) above), to the first quantum bit. Then we obtain
$$\begin{array}{c} \frac{1}{\sqrt{2}}|00000\rangle + \frac{1}{\sqrt{2}}|10000\rangle. \end{array}$$
(5.8)
Let us now apply the same transformation to the second quantum bit. We get
$$\begin{array}{c} \frac{1}{{2}}|00000\rangle + \frac{1}{{2}}|01000\rangle + \frac{1}{{2}}|10000\rangle + \frac{1}{{2}}|11000\rangle. \end{array}$$
Clearly, after applying this to all bits we obtain a superposition of all possible basis states with the same amplitudes.
As an example of a unitary transformation on two bits consider the quantum version of the classical Boolean functions that maps a pair of bits (x,y) to the pair (x,xy). This gate is called controlled not since the second bit is negated if and only if the first bit is 1. I will denote it by CNOT. The matrix of this unitary transformation is
$$\left ( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 \end{array} \right ).$$
(5.9)
It is a permutation matrix, as are all matrices of classical reversible transformations. Suppose we apply it to the first two bits of the state (5.8). We thus get
$$\begin{array}{c} \frac{1}{\sqrt{2}}|00000\rangle + \frac{1}{\sqrt{2}}|11000\rangle. \end{array}$$

We can draw schemas of quantum circuits in very much the same way as ordinary circuits. The special property of quantum circuits is that the number of wires going into a gate always equals the number of wires going out because the number of bits that it processes has to be preserved throughout the entire computation. But whereas electronic circuits are very much like these diagrams, the physical realizations of quantum circuits are quite different. We cannot send quantum bits by wires, which is only a minor problem compared to others hurdles that researchers in this field have to cope with.

Let us consider a very simple quantum circuit in Fig. 5.7. This circuit works with two quantum bits. We can think of the quantum bits as stored in two memory registers and certain operations are applied to them. In this circuit we first apply a unary quantum operation to the first register, then a binary quantum operation to the first and the second registers, and finally we apply again a unary quantum operation to the first register. The unary gates are both H, defined by the matrix (5.7), and the binary gate is the CNOT, defined by the matrix (5.9). Let the input to the circuit be 00. Then the computation proceeds as follows:
$$\begin{array}{c} |00\rangle \\ \downarrow\\ \frac{1}{\sqrt{2}}|00\rangle +\frac{1}{\sqrt{2}}|10\rangle\\ \downarrow\\ \frac{1}{\sqrt{2}}|00\rangle +\frac{1}{\sqrt{2}}|11\rangle\\ \downarrow\\ \frac{1}{2}|00\rangle +\frac{1}{2}|10\rangle+ \frac{1}{2}|01\rangle -\frac{1}{2}|11\rangle. \end{array}$$
After the computation is done, we look at the output bits. According to the Rule 1, we observe one of the four possibilities 00,10,01,11, each with probability 1/4.

This circuit demonstrates an important phenomenon: the role of entanglement of quantum bits. Notice that the first output bit is 0 or 1, each with probability 1/2. We have already observed that applying H twice gives the identity mapping. Further, CNOT does not change the first bit. Thus we could expect that starting with 0 in the first register, we always get 0 in the first register. The reason why we do not, is that between the two applications H, the first bit gets entangled with the second bit (in the specific situation considered here with the second bit originally being 0, the first bit is simply copied to the second register). You can observe it also on the computation. The terms that would cancel each other if there weren’t CNOT interposed do not cancel out here because they are distinguished by the second quantum bit.

This example also bears on the fundamental question of what happens when a measurement is done. There are a variety of different explanations; the most commonly accepted one is the following. We assume that quantum laws are universally applicable, hence they apply also to measurements. A measurement is then a process in which the observed entities get entangled with the measurement devices. If an observation is done by a human, then eventually the state of their brain gets entangled with the observed entities.

In our tiny example we can view the first register as an experimental object and the second as a measuring device. (We can also assume that it is not just a device but a live observer.) First we transform |0〉 in the first register into a superposition of |0〉 and |1〉. Then the application of CNOT is a measurement. After the measurement we have the following superposition.
$$\begin{array}{c} \frac{1}{\sqrt{2}}|00\rangle +\frac{1}{\sqrt{2}}|11\rangle. \end{array}$$
Let us look at it from the point of view of the measuring device. The device can be in two states, but in each of the states of the device, the value of the bit in the first register is unique. So from its point of view the superposition has collapsed to a basis state.

This brings us back to the Mach-Zehnder interferometer. The interposed gate CNOT corresponds to a detector that determines which way the photon went. If it is present, we do not obtain interference.

Some people are sceptical about the possibility of constructing quantum computers that would work with a sufficiently large number of quantum bits so that they would be able to solve problems that classical computers cannot. However, no persuasive general argument against quantum computers has been presented so far. One of the objections that has been raised concerns the small amplitudes with which quantum computers will have to work. Indeed, suppose that the computer used a superposition of all 21000 combinations of 1000 bits. If all the amplitudes had the same absolute value, then this absolute value would be 2−500. Then the argument says that no physical instrument can have such a precision. What is wrong with this argument? Firstly, we can easily design experiments in which we have such superpositions—just send a packet of 1000 photons through the Mach-Zehnder interferometer. Secondly, the argument completely misses the point. If we had such a precise instrument we would not need quantum circuit and intricate quantum algorithms. The trick of quantum computing is that it is possible to obtain something measurable, that is, a state with a relatively large amplitude, by combining many states with extremely small amplitudes. Adding amplitudes is similar to adding probabilities. In efficient probabilistic algorithms exponentially small probabilities add up to large probabilities, such as 1/2.

### The Quantum Computing Thesis

Is there any quantum analog of the Church-Turing Thesis? One can re-state the Church-Turing Thesis using quantum Turing machines, but as I said (already in Chap. ), as far as computability is concerned we do not get more than we do with classical Turing machines. This is because the advantage of quantum Turing machines is only in their ability to perform some parallel computations. If time and space of computations are not limited, we can simulate parallel computations by sequential ones. So the advantage is apparent only if we consider the complexity of computations. But then we can ask a version of the thesis with limited computational resources: Are quantum circuits the best possible computational device? The precise meaning of this question is:

Can every physical instrument be simulated by at most a polynomially larger quantum circuit?

Since we are not aware of any phenomena that would require more complex computations, we conjecture that the answer is positive. This is the Quantum Computing Thesis.

It is unlikely that one can answer such a question before we have a unified theory of all physical phenomena. But maybe one can at least show that we cannot get more computational power from quantum theory than we have in quantum circuits. If this is true, then quantum circuits (if they are ever constructed) will not only be a useful computational device, but they will also be useful experimental devices for testing quantum theory.

However, even before quantum circuits with a sufficiently large number quantum bits are constructed, we can at least use the concept of a quantum circuit for thought experiments, as quantum circuits can easily be used to model complex situations. When thinking about them we can focus on information theoretical aspects and we are not distracted by physical phenomena that are specific to physical experiments. Above I have shown a small example of circuits for the problem of measurement, but one can study much larger circuits.

### Reversible Computations

Quantum circuits are not generalizations of classical circuits. Recall that quantum circuits are reversible, which property is, in general, not satisfied by classical circuits. Although one can efficiently simulate classical circuits by quantum circuits, it is not a trivial task. Fortunately, reversibility is the only property that we need—a reversible classical circuit can be directly interpreted as a quantum circuit.

What does it precisely mean for a computation to be reversible? I will talk about computations in general, as the essence is the same for circuits and Turing machines. A computation is a sequence of states. A transitions from a given state to the next one is done by an elementary operation. In particular, the next state is uniquely determined by the previous one. But in general, the previous state does not have to be determined uniquely from the current state; some information may be lost. (A side effect of the loss of information is the heat that processors have to dissipate.) In a reversible computation the previous state must be uniquely determined by the current state. But this is not enough; we want to be able to actually reverse the entire computation—to start with the output and compute backwards to the input. Therefore, we require that, given a state, the previous state should be computable by an elementary operation.

It is not difficult to show that every computation can be turned into a reversible one without substantially increasing the running time, or the circuit size. The basic trick is very simple: record all the history of the computation. It works like the editor I am now using. If I erase something by mistake, it is not a disaster; I can simply invoke the command undo because what I erased on the screen remains stored somewhere in the memory of the computer.

If we wanted only to show that quantum computers can simulate classical computers, this would suffice. However, what we also need is to know whether we can use well-known classical algorithms as subroutines in quantum algorithms. A reversible algorithm can always be performed by a quantum circuit, but if used inside of a quantum algorithm the additional bits that the algorithm produces may cause problems. In particular, the simulation based on recording the full history has a serious problem: it creates a lot of garbage. In classical computers we can erase all the data that we do not need for the rest of the computation, or we can simply ignore it. In quantum computations we neither can erase, nor ignore it. The garbage data contains information that may result in entanglements at places where we do not want it. Recall that truly quantum phenomena can occur only when some information is not present. (For example, the Mach-Zehnder interferometer does not work if the information about the path of the particle is recorded.) Thus, if possible, the redundant bits should be eliminated from reversible computations.

Therefore, we need a better simulation. Again, another simple trick suffices to show that we can erase all the history except for the input data. More formally, if we have an algorithm for computing
$$x\mapsto f(x),$$
then there is a reversible algorithm for computing
$$x\mapsto \bigl(x,f(x)\bigr)$$
whose running time is not essentially longer. The whole point is to erase the history of the computation gradually and in the reverse order. The crucial observation is that if we erase the last item of the history, we can restore it very easily because it is determined by the previous item. Hence the action of erasing the last item is reversible! Thus we can continue until only the input data x remains on the history record.

What about the input bits x, can we get rid of them too? A necessary condition is, clearly, that f is one-to-one, otherwise we would loose information. The answer to this question is, again, not difficult. I will state it in terms of polynomial time algorithms. This theorem is due to C.H. Bennett [19].

### Theorem 41

Given a one-to-one function f, one can compute xf(x) reversibly in polynomial time, if and only if one can compute both f and its inverse function f −1 in polynomial time.

The forward direction is trivial. The proof of the opposite direction is based on the fact mentioned before the theorem and a simple idea. For the sake of symmetry, let us denote f(x) by y and express our assumption as follows: we can compute in polynomial time xy and yx. This implies that we can also compute reversibly in polynomial time x→(x,y) and y→(y,x). Clearly, we can switch x and y in (x,y). Since the computation y→(x,y) is reversible, we can compute (x,y)→y reversibly in polynomial time. Combining the computations x→(x,y) and (x,y)→y, we get a reversible polynomial time computation of xy.

The theorem, in particular, implies that in order for f to be computable by a polynomial size quantum circuit (without having to store anything but the output value f(x)), it suffices that f and its inverse function f −1 are computable in polynomial time (on a classical Turing machine). On the other hand, any one-way function is an example of a function for which this is not possible.

### Quantum Algorithms

If a quantum computer is ever built, it will most likely be a physical realization of the quantum circuits, as described in previous sections. Hence the best way to define an algorithm is by means of quantum circuits. But an algorithm is not a circuit; it is an idea how to compute a function. Thus when presenting a quantum algorithm, one should first explain the essence by a less formal description and only then give the circuits for the procedures used in the algorithm.

We do not know any exact quantum algorithms that are faster than classical. The exactness means that when computing a Boolean function f for a given input x, we obtain the output f(x) with amplitude 1, which means that the measurement of the output gives always the string of bits f(x). Thus it is possible that quantum circuits that compute Boolean functions exactly are not more powerful than reversible classical circuits. The quantum algorithms that apparently are superior to any classical algorithms are not exact; they produce the required output only with a sufficiently large probability. So they behave like probabilistic algorithms, but their essence is different. The output is a linear superposition of strings of bits, one of which is the output value f(x). The amplitude of the string that we need should have a large amplitude, so that we get it with large probability. As in probabilistic algorithms, ‘large’ means at least 1/2.

Quantum algorithms are usually presented as a combination of classical and quantum computation. Typically, there is a preprocessing classical phase, then a quantum computation, and finally, a classical postprocessing phase. The main reason for presenting quantum algorithms in this way is that it is easier to prove for a classical algorithm that it is correct (in most cases only well-known algorithms are used in the classical parts of the quantum algorithms, thus one does not have to prove their correctness at all). Also it is good to know which particular part of the algorithm uses quantum effects in an essential way. Another reason is that researchers would like to demonstrate that nontrivial quantum computations are possible. However, the number of quantum bits that the current experimental methods are able to realize is very small. Therefore, it is an advantage to strip a quantum algorithm of everything that does not have to be quantum. That said, it is always possible to perform the entire algorithm on a quantum circuit.

However, not everything that looks classical can be taken out of quantum computations. The typical structure of the quantum phase of quantum algorithms is that we first apply quantum operation to form a linear superposition of many basis states, then we apply a classical reversible algorithm to each of these states, and then we apply again a quantum operation. One may get the impression that the middle part should be easier to realize, since it is classical. But one should always bear in mind that the algorithm is working with a linear superposition of states also in this part of the computation and we need to preserve the quantum nature of the superposition. Thus one cannot use classical electronic components; we need to compute with quantum bits also there.

To get a feel for quantum algorithms, I will briefly sketch the most interesting and the most important quantum algorithm, which is Shor’s algorithm for factoring natural numbers. Shor’s algorithm allows us to factor numbers in time that is polynomial in the length of the number. Recall that we do not know any polynomial time probabilistic algorithm for factoring and the commonly accepted conjecture is that there exists no such an algorithm. Furthermore, if successfully implemented, it would break many of the currently used cryptographic protocols.

Suppose we want to factor a composite number N. Think of N as a medium size number, which means that number of digits is small enough so that we can compute with it, but N itself is so large that we cannot examine any substantial part of the set of numbers less than N. Several algorithms, running in exponential time, are based on the following simple observation attributed to Fermat. If we find two numbers a and b such that
$$a^2\equiv b^2\ \mathbin {\mathrm {mod}}N\quad\mbox{and}\quad a\not\equiv\pm b\ \mathbin {\mathrm {mod}}N,$$
then
$$(a+b) (a-b)\equiv 0\ \mathbin {\mathrm {mod}}N \quad\mbox{and}\quad a+b,a-b\not\equiv 0\ \mathbin {\mathrm {mod}}N.$$
Hence both a+b and ab contain nontrivial factors of N. Thus having such a pair a, b we can find a factor of N by computing the greatest common divisor of a+b and N (or of ab and N). The greatest common divisor can be computed in polynomial time using the ancient Euclid algorithm. The problem is, however, how to find such a pair a, b efficiently.

One possibility is to find, for some c, its multiplicative period r in the ring of integers modulo N. The period r is the smallest positive integer such that c r ≡1 modN. If r is even and if $$c^{r/2}\not\equiv -1\ \mathbin {\mathrm {mod}}N$$, then we can take a=c r/2 and b=1. Since r is minimal such that c r ≡1 modN, the condition c r/2≠1 modN, is also guaranteed and we get a factor of N using the above argument. One can show, using elementary number theory, that if we choose 1<c<N at random, then with probability at least 1/2 either the two conditions above are satisfied, or already c has a common factor with N. Hence, the only essential problem is to find the period.

We do not know how to compute multiplicative periods in polynomial time on classical machines, but it is possible to do it on quantum machines—which is the key component of Shor’s algorithm.

Let us forget about the factoring problem for a moment and focus on the problem of computing the period r of a function f defined on integers. The idea of the quantum algorithm for this problem comes from the branch of mathematics called harmonic analysis. It is well known that a tone can be decomposed into pure tones, tones whose form is sinusoid. Mathematically this is a decomposition of a periodic function f(x), whose period divides 2π into an infinite sum of the functions 1,sinx,cosx,sin2x,cos2x,… multiplied by suitable constants. The constant at a particular function of this set depends on to what degree the function agrees or disagrees with f. In the physical world such a good agreement can be observed as various forms of resonance and interference. If we want to determine the pitch of a tone, we can play it near a string instrument and watch which string resonates. In particular, if
$$f(x)= a_1\sin x+b_1\cos x+a_2\sin 2x +b_2\cos 2x \dots$$
and the only nonzero coefficients are at sinℓx and cosℓx for which p divides , then we know that the period of the function is 2π/p.

To explain the main idea of the algorithm, we will consider the following simplified problem. Suppose we have to determine an unknown natural number p>2 which is represented by a regular p-gon with the center at the point (0,0) and vertices at the unit circle; otherwise its position is completely random. The information that we can get is a randomly chosen vertex of the polygon. Using classical means, we cannot infer anything about p because we do not know how the polygon is rotated, hence what we get is a random point on the unit circle. Now suppose we can get more samples of the vertices, but each time we ask for another vertex, the polygon is rotated by a random angle. Again, the information that we get are simply random points on the unit circle.

However, the quantum version of this problem is solvable. In the quantum setting the vertices of the polygon are not given to us randomly; instead we get a quantum superposition of all vertices of the p-gon, each vertex with the same amplitude $$1/\sqrt{p}$$. If we do a measurement on this quantum state, we get a random vertex. But we can first transform it to another quantum state and measure the new state. (Instead of applying the transformation we can also view it as doing a different measurement on the same state.) Then we can learn relevant information about p.

Mathematically, this means that if the lines through the vertices form angles $$k\frac{2\pi}{p}+h$$ with the x-axis, for k=0,1,…,p−1 and some real number h, then we get the state
$$\sum_{k=0}^{p-1} \frac{1}{\sqrt{p}}\big|k\mbox{\frac{2\pi}{p}}+h\big\rangle,$$
(5.10)
which represents the linear superposition of the vertices of the p-gon. Next we apply a suitable unitary transformation to obtain a state of the form
$$\sum_{\ell\ \mathrm{divisible}\ \mathrm{by}\ p} \alpha_\ell| \ell\rangle,$$
(5.11)
where we assume some upper bound M on in order to get a finite sum (and the sum of the squares of the absolute values of α is 1). If all these nonzero amplitudes α have the same absolute value, then, by measuring the state, we get a random multiple of p that is less than M. If we get enough samples, we obtain multiples of p from which we will be able to determine p with high probability. The key technical problem is how to compute (5.11) from (5.10). I will come back to it shortly.

The general problem of finding a period of a periodic function f can be reduced to the above special case as follows. Suppose f is defined on integers and has period r. We take a number M that is sufficiently larger than r and consider the function f on the interval [0,1,…,M−1]. Furthermore, I will assume that f takes on r distinct values and M is divisible by r. The latter assumption is not justified; I will explain later how to eliminate it. Now rather than being defined on integers, we can think of f as being defined on M points of the unit circle which form the regular M-gon with one point on the axis x. Let w be one of the r values of f. The points on which the value of f is w form a regular p-gon with p=M/r. Thus getting a random sample of an argument and a value of f, (a,f(a)), is the same as getting a random vertex from one of these p-gons: the argument a tells us the point on the unit circle and the value f(a) tells us the ‘name’ of the polygon. Since M is exponentially large, the probability that we obtain vertices from the same polygon is negligible. (This explains why in the simplified problem above we rotate the polygon by a random angle each time we should get a new sample.) Thus using the solution for the problem about polygons, we get p, and then we compute the period of the function r=M/p.

This was a high level description of the quantum algorithm for finding periods with a lot of details omitted. Some of these are less important, some are essential. One of the inessential technicalities concerns the divisibility of M by r. Finding a random multiple of r is equivalent to finding r itself, thus we cannot assume that r divides M. But taking M not divisible by r causes only minor problems; we just do not get precise numbers and have to do some rounding to get integers.23

What is more important is how the main transformations can be realized by quantum circuits. The computation of the period of the function f starts with computing the linear superposition of all natural numbers a less than M. Since we have the freedom to take any M in a large range (we need M>2r 2 and M must have length bounded by a polynomial in the input N), we take M to be a power of 2. Then such a superposition is the superposition of all binary strings of the length log2 M. This is fairly easy, as I showed on page 458.

The next step is to compute the linear superposition of all pairs (a,f(a)). To this end we only need to compute reversibly (a,f(a)) from a. As shown in the previous section, this can be done by a polynomial size reversible circuit if f is computable in polynomial time. The particular function that we use in factoring is f(x)=c x modN, which is known to be computable in polynomial time.

The last transformation is called the discrete quantum Fourier transform. It is a transformation based on the complex function eix defined, for all real numbers x, by
$$\mathrm{e}^{\mathrm {i}x} = \cos x + \mathrm {i}\sin x.$$
Viewed geometrically in the plane of complex numbers, as x goes from 0 to 2π, the values of eix follow the unit circle (in the counter-clockwise direction). Thus one can easily see that if we sum the values f(x) for $$x=k\frac{2\pi}{p}+h$$ and k=0,1,…,p−1, we get 0, the center of gravity of the p-gon.
Now consider the function eiℓx for some natural number >1. If is not divisible by p, and we sum the values of this function for $$x=k\frac{2\pi}{p}+h$$ and k=0,1,…,M−1, where M is a common multiple of p and , then, again, one can easily show by computation that the sum is 0. But if is divisible by p, then we sum the terms of the form
$$\mathrm{e}^{\mathrm {i}(\ell k\frac{2\pi}{p}+h)}=\mathrm{e}^{\mathrm {i}\ell k\frac{2\pi}{p}+\mathrm {i}h}= \mathrm{e}^{\mathrm {i}\ell k\frac{2\pi}{p}} \mathrm{e}^{\mathrm {i}h}= \bigl(\mathrm{e}^{\mathrm {i}2\pi\frac{\ell}{p}}\bigr)^k \mathrm{e}^{\mathrm {i}h}=\mathrm{e}^{\mathrm {i}h}$$
because $$\frac{\ell}{p}$$ is an integer and eix =1 if x is a multiple of 2π. Hence the terms add up to the nonzero value Meih . Thus we get a nonzero value if and only if is divisible by p, which is the resonance principle mentioned above.

Let us now consider Fourier transform applied to a p-gon on the unit circle. The classical Fourier transform consists of the sums of the above form for in a certain range. Shor’s quantum Fourier transform is a unitary transformation that replaces the vertices of the p-gon by the numbers and their amplitudes are the sums considered above, suitably normalized. Thus we get the quantum state (5.11) which enables us to determine p. Let us just recall that in the factoring algorithm, instead of zeros and non-zeros, we only have small amplitudes and large amplitudes respectively because we cannot assume the divisibility of M by p.

Having a nice formula for a unitary transformation does not guarantee that it can be computed by a polynomial size circuit. Indeed, designing a quantum circuit for discrete quantum Fourier transform is a nontrivial task. I will describe it in Notes.

This amazing algorithm and other applications of the quantum Fourier transform are, unfortunately, good only for very special problems. Ideally, we would like to show that all problems form a complexity class, such as NP, can be computed by quantum machines in polynomial time. According to our experience that we have so far, that is very unlikely. It seems that the problems that are solvable by quantum machines in polynomial time must have a very special nature, or they are already efficiently solvable by classical machines (see The hidden subgroup problem on page 475).

If we do not insist on having a polynomial time algorithm and are satisfied with any improvement in the running time, then it is different. Grover’s algorithm helps us to solve search problems for which we do not have any nontrivial algorithm, only the thorough search of all instances. Such is, for example, the problem of finding a satisfying assignment for a Boolean formula. For this problem, all known algorithms run in time 2 n , (unless the formula is of a special form). Grover showed that using quantum machines we can speed up the search quadratically. Thus in the case of Boolean formulas with n variables, instead of searching all 2 n assignments, we can find a satisfying assignment in time $$c\sqrt{2^{n}}$$, for some constant c, which is c2 n/2. If quantum computers are built and had the same speed as classical ones, this would enable us to solve twice larger instances of this problem.

Another example of a potential application of Grover’s algorithm is searching a secret key. Quadratic reduction of the time for this problem, could be a substantial help, but there is a simple remedy: take twice longer keys.

This algorithm is also an example of how we can prove in a particular setting that quantum computations help. The setting is that we use a black box model in which the box acts as an “oracle” that tells us whether the given string of bits is right or not. If we use classical computations, we can only check systematically, or randomly, all inputs, since we do not know how the box computes and we are not allowed to open it. If we search systematically, we need to search all strings in the worst case; if we do it probabilistically, we will check half of them on average. But if we do in the quantum way, the number of steps needed is approximately only the square root of the size of the search space.

### Notes

1. 1.
A more precise explanation of the Mach-Zehnder interferometer. I will denote by |↗〉 and |↘〉 a photon flying in the corresponding directions. When the photon is reflected from the beam splitter (half-silvered mirror) or from a mirror, then not only |↗〉 changes to |↘〉 and vice versa, but also the amplitude rotates by i. Thus the sequence of states is as follows:
$$\begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c} |\!\nearrow\rangle & \mapsto & \begin{array}{c@{\quad}c@{\quad}c} \frac{1}{\sqrt{2}}\ |\! \nearrow\rangle & & \frac{1}{\sqrt{2}}\mathrm {i}\ |\!\searrow\rangle \\ + & \mapsto & + \\ \frac{1}{\sqrt{2}}\mathrm {i}\ |\!\searrow\rangle & & - \frac{1}{\sqrt{2}}\ |\!\nearrow\rangle\\ \end{array} & \mapsto & -\ |\!\nearrow\rangle \end{array}$$
The three unitary transformations are defined by the following matrices:
$$\left ( \begin{array}{c@{\quad}c} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\mathrm {i}\\[6pt] \frac{1}{\sqrt{2}}\mathrm {i}& \frac{1}{\sqrt{2}} \end{array} \right ) \quad \left ( \begin{array}{c@{\quad}c} 0 & \mathrm {i}\\ \mathrm {i}& 0 \end{array} \right ) \quad \left ( \begin{array}{c@{\quad}c} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}}\mathrm {i}\\[6pt] \frac{1}{\sqrt{2}}\mathrm {i}& \frac{1}{\sqrt{2}} \end{array} \right ).$$
The state of the photon is not identical after passing through the interferometer; its amplitude is rotated by −1, but the detector cannot distinguish such states.

2. 2.

The mathematics of quantum circuits. The tensor product is a natural concept defined in every category of structures. Here I will use an explicit definition for vector spaces, which should be more comprehensible for those who have never heard about it. For two complex vector spaces V and W of dimensions c and d, the tensor product is a space Z of dimension cd together with a certain bilinear mapping from the set-theoretical product V×W into Z. Z is denoted by VW; the mapping is also denoted by ⊗. If we choose bases v 1,…,v c of V and w 1,…,w d , then the vectors v i w j for i=1,…,c, and j=1,…,d form a basis of VW. Furthermore, we will assume that the mapping ⊗ is chosen so that it preserves unit lengths, that is, the tensor product of two vectors of length 1 has length 1. The tensor product of more than two vector spaces is defined in the same way.

Let us apply this concept to quantum bits. We define a string of n quantum bits to be a vector of length 1 of the tensor product of n copies of the two-dimensional complex space $$\mathbb {C}^{2}$$, which is denoted by $$(\mathbb {C}^{2})^{\otimes n}$$. In $$\mathbb {C}^{2}$$ we take one orthonormal basis and denote its elements by |0〉,|1〉. This determines the following basis of the tensor product of n such spaces: the set of the vectors |i 1〉⊗⋯⊗|i n 〉, where i 1,…,i n ∈{0,1}. In order to simplify notation, we abbreviate |i 1〉⊗⋯⊗|i n 〉 by |i 1,…,i n 〉.

Given two linear mappings L:VV′ and K:WW′, their tensor product LK:VWV′⊗W′ is defined in a natural way: we define LK(uv)=L(u)⊗K(v) and extend it to the whole space VW by linearity.

Let $$U:\mathbb {C}^{2}\otimes \mathbb {C}^{2}\to \mathbb {C}^{2}\otimes \mathbb {C}^{2}$$ be a unitary transformation; we think of U as representing a binary quantum gate (a quantum gate that works with two quantum bits). Let n>2 and suppose we apply U to the first two quantum bits of the string of n quantum bits. Then, as a unitary transformation of the whole space $$(\mathbb {C}^{2})^{\otimes n}$$, it corresponds to UI n−2, where I n−2 denotes the identity mapping on $$(\mathbb {C}^{2})^{\otimes n-2}$$ (which is the tensor product of n−2 copies of the identity mapping on $$\mathbb {C}^{2}$$). If this unitary transformation is applied to two non-consecutive quantum bits, then we would have to extend our notation to express it in such a way, but it is clear that such a mapping is the same up to a permutation of the terms of the tensor product $$(\mathbb {C}^{2})^{\otimes n}$$. Similarly, if we have a unitary transformation U′:C 2C 2, i.e., a transformation of one quantum bit, the corresponding transformation of the whole space is U′⊗I n−1 (up to permuting the terms in the tensor product).

We will restrict ourselves to the circuits that use at most binary gates; these transformations will be our elementary operation. So given an arbitrary unitary transformation $$T:(\mathbb {C}^{2})^{\otimes n}\to(\mathbb {C}^{2})^{\otimes n}$$, we want to know the minimal number k of elementary transformations into which it can be decomposed. This k is the quantum circuit complexity of T.

This is a very clean and natural mathematical concept, but what we actually need is a little different. There are two modifications that we have to add.

1. As in classical computations, we should expect that in order to compute efficiently, we will often need more memory bits than just those that store the input bits. Thus when computing some $$T:(\mathbb {C}^{2})^{\otimes n}\to(\mathbb {C}^{2})^{\otimes n}$$ we should be allowed to use more than n quantum bits. As usual we will assume that the registers that do not contain input bits are initially set to 0.

2. We should keep in mind that the unitary transformation computed by a quantum circuit only serves to compute a Boolean function. The input data that we want to use are basis states, actual strings of zeros and ones, and what we can read from the output is again one of the basis states. It would be natural to require that the unitary transformation T computed by a quantum circuit must map a basis state to a basis state (thus T would be the quantum extension of a permutation of the set {0,1} n ). Unfortunately, this seems also to be a too sever restriction. Presently, we do not have any example of a quantum circuit of this type that would be significantly smaller than known classical circuits computing the same function. Thus we allow the circuit to output a superposition from which we get the required output value with a reasonable probability. Like in the case of probabilistic circuits, it suffices if the probability is at least 1/n c , where n is the input size and c is a constant because it enables us to boost the probability to become close to 1 by repeating the computation a polynomial number of times.

3. 3.

Quantum complexity classes. It seems natural to define Quantum Polynomial Time, QP as the class of sets computable by quantum Turing machines in polynomial time. (I have not defined quantum Turing machines, but it is not difficult to imagine what they should be.) In the definition of QP we require that the input-output behavior of a quantum Turing machine be deterministic. Formally, it means that the machine computes a function F:Σ →{0,1} in such a way that for a given input xΣ , it outputs F(x) with an amplitude whose absolute value is 1.

This class is not as natural as one would expect. The problem is that when we require quantum Turing machines to compute precisely, we do not get universal machines. Consider just a unary quantum gate. There are infinitely many such gates and it is impossible to simulate them using a finite number of finitely dimensional unitary transformations with absolute precision. The main reason, however, why researchers do not like this class is that it does not seem to help us to compute faster than using classical deterministic Turing machines. Whether or not QP=P is an open problem, but it is even hard to conjecture what is true.

The most important class is the Bounded Error Quantum Polynomial Time, BQP. This is an analog of the probabilistic class BPP, where instead of probabilistic Turing machines we use quantum Turing machines. The condition for accepting an input is the same: if we observe the output of the machine, we see an accepting state with probability at least 2/3 if the input is in the computed set, and with probability at most 1/3 if it is not. This can easily be restated in terms of the amplitudes of the accepting and rejecting states.

BQP contains BPP because instead of using r random bits we can take the quantum superposition of all strings of zeros and ones of length r with equal amplitudes. This can be done using r unary gates H (see page 455) and the rest of the computation is done by a reversible circuit. The commonly accepted conjecture is that BQP contains more sets than BPP. This is supported by the fact that we have polynomial time bounded error quantum algorithms for factoring and the discrete logarithm, whereas no such algorithms are known if we only use probabilistic algorithms. The two problems concern computations of functions, but they can easily be reduced to sets. For example, the problem to determine, for given numbers N and i, if the ith bit of the smallest nontrivial factor of N is 0 is equivalent to factoring integers. This problem is in NP as we can guess the complete factorization of N. However, we do not believe that the whole NP is contained in BQP and we rather conjecture that no NP-complete problem is contained in BQP.

Concerning the upper bounds, the best upper bound on BQP that one can state using the complexity classes defined in this book is BQPPSPACE. Thus we cannot exclude that BQP contains sets that are outside NP; for all we know, it might even contain PSPACE-complete problems.

In quantum computing it is often more convenient to work with circuits rather than Turing machines. Both classes QP and BQP can be defined using uniform families of quantum circuits.

4. 4.

A quantum algorithm for linear equations. Let A be an N×N matrix and b a vector of length N. Suppose we want to solve the system of linear equations given by A and b, which means that we want to find a vector x such that Ax=b. A classical algorithm needs time at least N 2 because it has to read the input data.

Now suppose that the matrix is huge, but we have an efficient algorithm to compute the entries of A and b. Moreover, suppose that we only need to know some properties of the solution. Then, in principle, we may be able to compute these properties in time essentially less than N. A.W. Harrow, A. Hassidim and S. Lloyd [113] found a quantum algorithm that for certain matrices and certain properties of solutions, can solve the task more efficiently than any known classical algorithm. Moreover, they proved that their algorithm is faster than any classical algorithm if PBQP.

The essence of their algorithm is to compute the quantum state
$$\tfrac{1}{\sum |x_i|^2}\sum _{i=1}^N x_i|i\rangle,$$
where x 1,…,x n is a solution. Note that we need only ⌈log2 N⌉ bits to represent the indices i. For some matrices, the state can also be computed by quantum circuits of size polynomial in logN. Although the complete information about the solution (except for the normalizing factor) is present in the state, we can get very little from this state. A simple measurement gives an index i with probability |x i |2. So we only learn that x i ≠0 (or that with high certainty, |x i | is not very small). More sophisticated measurements, which may require further quantum computations, can produce information that is hard to obtain by classical means.

5. 5.

How much information is in one quantum bit? A quantum bit is, by definition, a system from which we can get only two possible values (more precisely it is a system and a particular measurement). Thus the basic tenet is that one quantum bit carries the same amount of information as the classical bit. This is in spite of the infinitely many states in which the quantum bit can be. Yet, a schema has been devised by C.H. Bennett and S.J. Wiesner in which this rule is seemingly violated [20].

Suppose Alice wants to send two bits to Bob. The natural way is simply to send two bits to Bob. Bennett and Wiesner showed that instead Bob can send one quantum bit to Alice and then Alice will need to send only one quantum bit to Bob. Thus it looks as if the quantum bit sent by Alice carries information of two classical bits. But notice that if we interpret the law about quantum bits corresponding to classical bits more liberally, namely, that they have to exchange two quantum bits in order to exchange two bits of information, then the law is not violated.

Let us try to deduce what Alice has to do. Suppose Alice succeeds in sending two bits using a single quantum bit. Then the above law could be violated in a different way. Remember that Alice received one bit from Bob, so the total exchange of bits could be three bits. Should Alice not break the law, she must dispose of the bit that she received. The only way she can do it is to send it back to Bob. Alas, now her task seems even harder—she has to send three bits using one quantum bit!

The clue is actually in sending Bob’s quantum bit back. The trick is to modify Bob’s bit without looking at it and send it back. There are four possible ways to modify it because Alice can flip the bit and switch the sign of the amplitude. Since Bob keeps his own copy of the quantum bit, he can determine in which way the bit was modified. Four possibilities means two bits.

In quantum physics we are never sure that the idea is correct, unless we check it by a mathematical argument. The schema works as follows. Bob prepares the superposition of two bits
$$\begin{array}{c} \frac{1}{\sqrt{2}}|00\rangle + \frac{1}{\sqrt{2}}|11\rangle. \end{array}$$
This means that the two bits are completely entangled. Then he sends the first bit to Alice and keeps the second. Alice applies one of the following four unitary mappings to the received quantum bit
$$\begin{array}{cccc} \left (\begin{array}{c@{\quad }c} 1 & 0\\ 0 & 1 \end{array} \right ) & \left (\begin{array}{c@{\quad }c} 1 & 0\\ 0 & -1 \end{array} \right ) & \left (\begin{array}{c@{\quad }c} 0 & 1\\ 1 & 0 \end{array} \right ) & \left (\begin{array}{c@{\quad }c} 0 & -1\\ 1 & 0 \end{array} \right ). \end{array}$$
Then she sends the bit back to Bob. Thus Bob has one of the following four linear superpositions
$$\begin{array}{c@{\hspace*{.3cm}}c@{\hspace*{.3cm}}c@{\hspace*{.3cm}}c} \frac{1}{\sqrt{2}}|00\rangle \,{+}\, \frac{1}{\sqrt{2}}|11\rangle, & \frac{1}{\sqrt{2}}|00\rangle \,{-}\, \frac{1}{\sqrt{2}}|11\rangle, & \frac{1}{\sqrt{2}}|10\rangle \,{+}\, \frac{1}{\sqrt{2}}|01\rangle, & \frac{1}{\sqrt{2}}|10\rangle \,{-}\, \frac{1}{\sqrt{2}}|01\rangle. \end{array}$$
As these vectors are orthogonal, Bob can determine them by a measurement. Equivalently, there exists a unitary transformation that maps these vectors to basis states of two bits |00〉,|01〉,|10〉,|11〉.
It is always instructive to represent such schemas by quantum circuits, see the circuit in Fig. 5.8. The ternary gate applies one of the four unitary transformations to the third quantum bit of its input bits. The choice of the unitary transformation is controlled by the first two input bits, in a similar way as in the control not gate. This gate can be replaced by two binary gates. I leave to the reader to figure out the unitary transformations computed by the gates. Once you know what they should do, it is easy.

6. 6.

More details about computing the period of a function. Let f be a periodic function defined on integers whose period is less than some number N. The length of N (which is ≈log2 N) will be our input size. We assume that f is efficiently computable, namely, it is computable by a classical deterministic algorithm that runs in polynomial time.

Let M be a number which is a power of 2, is sufficiently larger than N, and its length is polynomial in the length of N.

The first step is to compute the linear superposition of all numbers 0,1,…,M−1, all with the same amplitudes:
$$\frac{1}{\sqrt{M}}\sum_{a=0}^{M-1} |a,0 \rangle.$$
The 0 in |a,0〉 indicates that we use more registers for quantum bits than just those for a, and they are initially set to 0. (I have already explained how to produce such a superposition, see page 458.)
Next we compute f(a) and put it into the free memory registers. So the next state is given by the following expression:
$$\frac{1}{\sqrt{M}}\sum_{a=0}^{M-1} \big|a,f(a)\big\rangle.$$
(5.12)
Again, we already know how to do it using a reversible, hence also quantum circuit (see page 461).
The last step is the quantum Fourier transform. This is the unitary transformation defined by
$$|x\rangle\ \mapsto\ \frac{1}{\sqrt{M}}\sum_{\ell=0}^{M-1} \mathrm{e}^{\frac{2\pi \mathrm {i}}{M}\ell x}\ |\ell\rangle.$$
Thus applying it to the first number of the state in which we left our quantum computer (5.12), we get
$$\frac{1}{{M}}\sum_{a=0}^{M-1} \sum_{\ell=0}^{M-1} \mathrm{e}^{\frac{2\pi{ \mathrm {i}}}{M}\ell a}\ \big|\ell,f(a)\big\rangle.$$
(5.13)
Here it is important that no information about the computation of f(a) was present before applying the Fourier transform, except for a and f(a).
When observing this state (in terms of quantum physics, measuring this state) we see a pair (,d), where d=f(a) for some a, with probability equal to the square of absolute value of the amplitude of this state. In order to compute the amplitude at |,d〉, we have to add all terms with |,f(a)〉 in which f(a)=d. Thus we get
$$\frac{1}{{M}}\sum_{a,\ f(a)=d} \mathrm{e}^{\frac{2\pi{ \mathrm {i}}}{M}\ell a} \ |\ell,d\rangle.$$
Let r be the period of f; then f(a)=d is equivalent to a=kr+s for some s, 0≤s<r, determined by d. Further, let p=M/r and h=2πs/M. Then the amplitude at |,d〉 is
$$\frac{1}{{M}}\sum_{k} \mathrm{e}^{\frac{2\pi{ \mathrm {i}}}{M}\ell(kr+s)}= \frac{1}{{M}}\sum_{k} \mathrm{e}^{\mathrm {i}(2\pi\frac{\ell}{p}k+h)}.$$

If r divides M, then p is an integer, and the analysis of this expression is easy (we have already done it). In such a case we see those that are random multiples of p. Having sufficiently many such numbers we can determine p by taking the greatest common divisor of them, whence we also get r.

If r does not divide M, which is typically true, the analysis is slightly more complicated. We take p to be the smallest integer larger than M/r. Since pM/r, we get with non-negligible probability also some which is not a multiple of p; so taking the greatest common divisor does not work. Therefore, we need an argument that would determine r with a sufficiently large probability from a single . Suppose we obtain an which is a multiple of p, say =dp. Then /M=dp/M is very close to d/r. Also we have a good probability that d and r are mutually prime. In such a case, applying the theory of rational approximations to /M will produce the fraction d/r, in particular it will give us r. This is the way to analyze the algorithm; to do it formally, however, requires some computations and the use of some results from number theory. I omit these details.

It is, however, important to realize that I skipped an essential part which is to show that the quantum Fourier transform can be computed by a polynomial size circuit. I will confine myself to defining the quantum gates that are used in the circuit in general, and drawing the circuit for M=24, see Fig. 5.9.
One gate that we need is the unary gate defined by the matrix H, see (5.7), page 455. The others are binary gates defined by the matrices
$$S_k= \left ( \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} 1&0&0&0\\ 0&1&0&0\\ 0&0&1&0\\ 0&0&0&\mathrm{e}^{\mathrm {i}\pi/2^{k}} \end{array} \right ),$$
for 0<k<m. The transformation S k does nothing if at least one bit is 0; if both bits are 1, it rotates the phase by the angle π/2 k . The key for understanding the circuit is the following formula of the quantum Fourier transform. Let x 1,…,x m denote the bits of the number x, with x m being the most significant bit. Then the transform can be defined by
$$|x_1,\dots ,x_m\rangle\mapsto \frac{1}{\sqrt{M}}\bigoplus _{j=1}^m \bigl(|0\rangle+ \mathrm{e}^{\pi \mathrm {i}\sum_{t=0}^{j-1}x_{m-t}/2^t}|1\rangle \bigr).$$
I leave the verification to the reader.

7. 7.

The hidden subgroup problem. Having a set of quantum algorithms only for some particular problems is unsatisfactory from the point of view of theory. We would rather like to know what is the essential property of problems that makes them solvable by quantum computers in polynomial time. The so far best candidate for such a characterization is the hidden subgroup problem. This is the class of problems defined as follows. Given a finitely presented group G and an unknown subgroup H, one should find a set of generators of H. The group G is given quite explicitly, which means that we can decide in polynomial time if a string of bits is an element of the group, and we can compute the products and inverse elements in polynomial time. The hidden subgroup is determined by a function f which is constant on the cosets of H and takes on different values on different cosets.

The most important polynomial time quantum algorithms fall into this class. Let us consider the period finding problem as an example. The group G is the additive group of integers $$\mathbb {Z}$$. The function f is the function whose period we want to compute. Suppose r is the period and f takes on r different values. Then the hidden subgroup is $$r\mathbb {Z}$$, the set of multiples of r. This subgroup is generated by r (or by −r), thus finding a set that generates it is equivalent to finding r.

In the polynomial time quantum algorithms that we have, the group G is, moreover, commutative. Furthermore, one can solve the hidden subgroup problem in quantum polynomial time for every commutative group. This is very interesting, but not surprising, considering the fact that all commutative groups are products of cyclic groups, and the hidden subgroup problem for cyclic groups is essentially the period finding problem. The quantum Fourier transform can be defined for every finite group and for some noncommutative groups polynomial size quantum circuits have been found. Yet we do not have any nontrivial polynomial time quantum algorithm for the hidden subgroup problem for a class of noncommutative groups. More recently G. Kuperberg [174] found a subexponential (more precisely, running in time $$2^{c\sqrt{n}}$$) quantum algorithm for dihedral groups, which confirms the belief that noncommutative groups may also be tractable.

One of the few natural combinatorial problems that are in NP, but apparently are neither in P, nor NP-complete, is the graph isomorphism problem. It is the problem, for two given graphs, to determine if they are isomorphic. This problem can be presented as a hidden subgroup problem for a noncommutative group, hence we would get a polynomial time quantum algorithm for this problem if we showed that the general hidden subgroup problem is solvable in polynomial time on quantum machines.

8. 8.

Quantum cryptography. When quantum computers are built, we will need to develop new crypto-systems. It is possible that these systems will use coding and decoding functions that are efficiently computable only by quantum computers.

Current quantum cryptography focuses on a different approach. Suppose we want to send a message x consisting of n bits. Let r be a random sequence of zeros and ones. Then we can encode x by the pair of strings (r,s), where s=xr, the bitwise sum of x and r modulo 2. If we send only one of the two strings, then a potential eavesdropper cannot learn any information because it is a random string. Now, it is possible to send quantum bits in such a way that the receiving party always detects any attempt to tamper with the message. This can be used to design protocols for secure communication. The basic idea of such a protocol is as follows. Suppose Alice wants to send x to Bob. Alice first sends r. If Bob determines that the message has not been tampered with, he confirms it to Alice. If Alice determines that Bob’s confirmation is authentic, then she sends the second part s. The details are more complicated and would take us too far afield.

The advantage of this approach is that one can prove that it is secure. Thus unlike the classical cryptographic protocols, the security of quantum cryptography is not based on any unproved assumptions. Such communication has been experimentally demonstrated.

9. 9.

The many-worlds interpretation of quantum physics. The classical (also called Copenhagen, or Bohr) interpretation of quantum physics is based on distinguishing between two phenomena: (1) a freely evolving system and (2) a measurement. A freely evolving system is governed by the Schrödinger equation. The essence of this equation is that the infinitesimal changes of the system are linear. The discrete approximation that we have considered here when discussing quantum computations is based on applying unitary transformations. Such a system is, in general, in a linear superposition of several states. A measurement is a process in which the system abruptly and randomly collapses to one of the states from the superposition. The probability of collapsing to a particular state is given by the square of the absolute value of the phase. I was using this interpretation when explaining the basics of quantum physics needed for quantum computations.

While nobody has any problem with the first part, the other part, the measurement, raises a lot of questions: what is a measurement? what is a measuring apparatus? is human presence necessary? and others. Among these the main one is: why can we not apply the same rules as we use for freely evolving systems also to measuring instruments and observers?

A solution in which measurement and observers obey the same laws as freely evolving systems was proposed by Hugh Everett III in his PhD thesis in 1956. The main idea is that a measurement is the process in which a measuring apparatus becomes entangled with the measured state.

Example Suppose an apparatus $$\mathcal{A}$$ has three states A ?,A 0 and A 1. The state A ? the initial state, the two states A 0 and A 1 are the states in which $$\mathcal{A}$$ shows 0, respectively, 1. Suppose that we want to measure the quantum bit 2−1/2|0〉+2−1/2|1〉. The initial state of the system consisting of the apparatus and the quantum bit is
$$|A_?\rangle\otimes\bigl( 2^{-1/2}|0\rangle + 2^{-1/2}|1\rangle\bigr)= 2^{-1/2}|A_?\rangle|0\rangle + 2^{-1/2}|A_?\rangle|1\rangle.$$
(5.14)
After the measurement we get
$$2^{-1/2}|A_0\rangle|0\rangle + 2^{-1/2}|A_1\rangle|1\rangle.$$
(5.15)

Observers, however, do not see the apparatus in a superposition; they only see one of the possible states. The reason is that observers also get entangled with the states of apparatuses. This presupposes that we allow superpositions of people in different states, an unacceptable idea for some philosophers. The typical argument against this interpretation is: why don’t we see the superposition of people in different states? The answer is again: because of the entanglement. By watching a person we get immediately entangled with his or her state.

The complete picture of a measurement is more complex. First, the apparatus gets entangled with the system and then very quickly the world around gets entangled with it too because the macroscopic apparatus interacts very strongly with the matter around it. In particular, human observers also get entangled, most likely, already before they read the data. The entanglement then spreads through the universe. This is the explanation of the “collapse of the wave function”. This, however, requires postulating the existence of alternative universes because now we have to talk about superpositions of whole chunks of the world, which is an even more controversial idea.

If we interpret $$\mathcal{A}$$ in the example above as the rest of the world when the quantum bit is taken away, we can use the same formulas (5.14) and (5.15). This is popularly explained as the world A ? splitting into two worlds A 0 and A 1. This is not quite precise because what actually splits is only the part without the quantum bit. (Furthermore, whether some part of the universe splits or joins depends on the basis that we use to describe it.)

Most of the arguments against this interpretation point out that we do not observe superpositions of macroscopic objects and, therefore, it is unlikely that observers, not to say the whole universe, can be a superposition of several different states. The absence of such phenomena can, however, be easily explained by the concept of environmental decoherence. Decoherence means that a system that starts in a superposition of states gets spontaneously entangled with the environment and thus loses its quantum nature. This is because it is impossible to completely isolate a system in a superposition of states from the environment. Even individual elementary particles kept separately in vacuum interact with the matter outside. The bigger the object is, the stronger the interaction is and the stronger the interaction is, the shorter the time is before the superposition collapses to one of the states. The decoherence time is extremely short already for systems with a small number of atoms. Therefore, we do not observe quantum phenomena when large objects are involved.

In this interpretation it is also possible to explain the probabilities of collapsing to a particular state in the measurement process. The alternative universes in the superposition have amplitudes and the square of the absolute value of the amplitude of a universe U determines the probability that we are in U. Since we always are only in one of the alternative worlds, we cannot test these probabilities. What we can only do is to assume that we are in a world that is not “very unlikely”. This sounds rather suspicious, but as a matter of fact, this is the standard assumption in all probabilistic reasonings. I will show in an example how one can use it to justify the probabilities of measurements.

Example Suppose we study the quantum bit 2−1/2|0〉+2−1/2|1〉. The probability 1/2 that we get 0 (or 1) manifests itself only if we perform many measurements. So suppose that we measure n non-entangled copies of this quantum bit. When measuring these quantum bits we obtain very likely $$n/2\pm\sqrt{n}$$ ones, according to the law of large numbers. The many-worlds explanation of this fact is that the total amplitude of the worlds in which the string has $$n/2\pm\sqrt{n}$$ ones is in absolute value close to 1. So it is likely that we will end up in one of these worlds.

10. 10.

Quantum proofs. ‘Quantum proofs’ usually refers to interactive quantum protocols. There are a number of results showing that very likely one can define larger complexity classes by allowing quantum states and measurements instead of mere randomness. Here I will briefly describe a concept of non-interactive quantum proofs introduced in [228]. This concept was defined for propositional logic, but it can be generalized to first order logic. The basic idea is to allow the linear superposition of strings of formulas in the proof.

Our aim is to define a kind of proof systems that corresponds to the usual proof systems in propositional logic that are based on axiom schemas and deduction rules. Such systems are called Frege systems and we will learn more about them in the next chapter.

A quantum deduction rule is a unitary transformation U on a (small) finite set of strings of propositions S with the following property. If Γ,ΔS and Δ occurs in with nonzero amplitude, then the propositions of Δ logically follow from those in Γ. One can show that a quantum deduction rule is invertible—not only as a linear operator, but also logically. This means that U −1 is also a deduction rule.

In order to define a quantum proof, we have to view proofs as a rewriting process. The initial state is always the same, say, a string of ⊤s, where ⊤ is a constant for truth. In each step we rewrite the string by a deduction rule. A proposition is proved if it appears in the string obtained in this way.

In the quantum setting we apply quantum deduction rules and thus at each step we have a linear superposition of strings of propositions. To prove a proposition ϕ we only need to have ϕ occurring with a sufficiently large amplitude. To formalize the rewriting process we use quantum circuits. So, formally, a quantum proof is a quantum circuit with certain properties.

It is not difficult to show that such a proof encodes classical proofs of essentially the same size as the circuit (of the same propositions). However, there are arguments that support the conjecture that in general it is impossible to extract a classical proof from a quantum proof in polynomial time. It is conceivable that this is impossible even using quantum circuits. So it is theoretically possible that we will be able to prove a proposition with a quantum circuit, but we will not be able to produce a classical proof, although we will know that there is one that is not too long.

11. 11.

Communication with extraterrestrials—continued from page 80. Suppose we already could construct quantum computers. Then a reasonable message to send out would be to say “We have quantum computers”. It will tell the recipients that our science is quite advanced, at least in physics and mathematics. One of the messages sent out, the Arecibo Message, contained a similar thing: a message that we know what life is based on—a picture of a double helix.

Whether or not it is a reasonable thing to do aside, it is an interesting problem how to formulate such a message. We would like to send a message that would show the solution of a problem that can only be solved using a quantum computer. Such a problem, as we believe, is factoring a large random number. But if we sent a large number with its factors, how would they know that we did not make it up? We can generate randomly two large primes and compute their product without a quantum computer. If we communicated bilaterally, it would be simple, we would ask them to send us a number to be factored, but the assumption is that they are too far away.

My proposal is to send the complete factorizations of all composite numbers in an interval [n,n+a]. The number n must be large enough, a should be small, so that we can do all these factorizations, but not too small. It seems that it is not possible to make up such an interval with all factorizations without being able to factor efficiently.

## 5.5 Descriptional Complexity

There is a type of complexity that is essentially different from what we have considered so far. Consider finite strings of symbols, finite graphs, finite algebras or in general any finite mathematical structures. Can one define the complexity of such entities? When defining the complexity of computations, we study a dynamical process, which quite naturally needs time and space to be performed. We clearly need something different when we consider static objects. For such objects, the most natural thing is to define the complexity to be the length of the shortest description. One can show in many examples that this is a good concept. For instance, we tend to think of symmetric objects as simpler than nonsymmetric and, indeed, one can use symmetries to give a more concise description. To determine an equilateral triangle we need only one number, whereas for a general triangle we need three. We often associate beauty with symmetry, and thus also with the possibility of short description. It is not only flowers whose symmetry we admire, but also theories. A theory whose equations are short and manifest symmetries seems to reflect the reality much better than a long list of apparently unrelated and complicated axioms.

The concept of descriptional complexity seems intuitively clear, but one has to be a bit careful and define it precisely. As we know, speaking vaguely about descriptions leads to Berry’s paradox (the paradox of the least number that cannot be described by a sentence with at most 100 letters, see page 38). Therefore it is necessary to say clearly what the descriptions are. In general, we may use any formal language which is sufficiently universal to describe all entities that we are interested in. Each of these choices may define a different concept and there are almost no principles that would guide us which we should choose. In such a situation the natural thing is to take the simplest formal system which is universal. A sufficiently universal system should, among other things, be able to describe computations. So why not just focus on computations? Let the formal system used for descriptions be Turing machines (or anything that is equivalent to them).

Thus we are naturally led to the concept of algorithmic complexity. The algorithmic complexity of an object x is the size of the simplest algorithm that produces the object. This concept possesses several properties that make it interesting for further investigations. But one should bear in mind that many of the results proved about algorithmic complexity hold true for other versions of the descriptional complexity based on different formal systems.

The concept of algorithmic complexity was conceived in the 1960s independently by three mathematicians: G.J. Chaitin [39], A.N. Kolmogorov [160] and R.J. Solomonoff [277]. Most researchers refer to algorithmic complexity as the Kolmogorov complexity and I will follow this tradition.

### The Algorithmic Complexity of Strings

As I said it is necessary to give a precise definition of the Kolmogorov complexity in order to avoid paradoxes, but not only because of that. If we want to state theorems, we need precise mathematical concepts. But as in other parts of this book, I will only try to convey the most important ideas and avoid inessential technicalities.

Our first convention will be restricting the class of studied structures to finite binary strings (sequences of zeros and ones). Finite binary strings are universal structures in the sense that they can encode any finite structures. Unlike in some other situations, for Kolmogorov complexity the particular way we code other structures is irrelevant; the only requirement is that the coding must be computable. Thus we only need to develop the theory for binary strings, keeping in mind that we can always apply it to arbitrary structures.

The next thing to choose is a computational model. We can take any model that defines all computable functions. I will use Turing machines, which is the most common approach, but this is not essential.

A key result used in algorithmic complexity is the existence of universal Turing machines. Recall that a universal Turing machine is a machine that can simulate all Turing machines (see page 130). This is a key concept in Kolmogorov complexity and therefore we have to state precisely what it means.

### Definition 15

U is a universal Turing machine, if for every Turing machine M, there exists a binary string p such that for every input string x, U will halt on the input px if and only if M will halt, and if they halt then both output the same string.

This needs some explanation. We use here the convention that Turing machines compute only with binary strings, thus the ‘code’ of M must also be a binary string. The expression px is the concatenation of strings p and x, the string obtained by writing x after p. We cannot use additional symbols to separate p and q, but we can assume that the coding of Turing machines Mp is chosen so that one can always determine where p ends and x starts.

Notice that we consider all Turing machines, not only those that halt on every input. This is an annoying complication, but it cannot be avoided—the class of Turing machines that always halt does not contain a universal one.

Let me also recall that a real computer is a good approximation of this theoretical concept. An input to a computer also can be split into two parts: a program p and data x to be processed by the program. The difference is that a real computer, of course, cannot work with arbitrarily large data.

Now we are ready to define the Kolmogorov complexity with respect to a universal Turing machine U.

### Definition 16

The Kolmogorov complexity of a binary string y is the length of the shortest string x such that U eventually halts on the input x and outputs the string y.

Thus Kolmogorov complexity is a function that associates natural numbers to binary strings; it will be denoted by C U (y).

The first simple observation is that C U (y) is defined for all finite strings y and it is at most the length of y plus a constant. This is because the string y can always be a part of the program which simply prints the string. In the programming language C, for example, we can use:
if we need to print the string 11010001010111. In a computer this program is represented by a string of bits.

In this way we have defined infinitely many different measures of complexity one for each universal Turing machine U. Can we show that one of them is the right one? That would require one universal Turing machine to be distinguished from others by a special property. One possibility would be to posit that it is the smallest universal Turing machine. But how should we measure the size? The size depends on the particular formalization of Turing machines and then we have a problem again: what is the right formalization. What is even worse is that to find the smallest universal Turing machine seems to be a hopeless task.

Fortunately, the problem of having infinitely many different measures is not as bad as it looks at first glance. Although we have infinitely many different measures of Kolmogorov complexity, they do not differ much. It is not difficult to prove that every two universal Turing machines U 1 and U 2 give rise to measures that differ at most by an additive constant. Thus for long strings the difference is negligible. This is called the Invariance Theorem and it is one of the main results that justify the naturalness of the concept of the Kolmogorov complexity.

The proof of this basic result is very easy. Given U 1 and U 2, let p be the string by which we can simulate U 2 on U 1. Thus if U 2(x) is defined (U 2 halts on the input x) and equals y, then also U 1(px)=y. Hence, the Kolmogorov complexity of y with respect to U 1 cannot be larger than the Kolmogorov complexity of y with respect to U 2 plus the length of p. This is expressed by the inequality
$$C_{U_1}(y)\leq C_{U_2}(y)+c,$$
where c is the length of p.

### Incompressibility and Randomness

Let us assume that one reference universal Turing machine U is fixed and let us only use this machine from now on. Thus we can also suppress the subscript at C(x).

The next basic result concerns the existence of strings with large Kolmogorov complexity. It asserts that, for every given length n, there is a string x of length n whose Kolmogorov complexity is at least n. Such strings are, quite naturally, called incompressible. If x is incompressible, it means that there is no way to encode it by a shorter string. We know that we can always program U to print x, which gives an upper bound n+c. Thus incompressible strings achieve this bound up to the constant c.

Again the proof of this theorem is very easy. It uses a counting argument that is not dissimilar to the proof of the existence of Boolean functions with exponential circuit complexity. We count the number of strings that have Kolmogorov complexity less than n and show that this number is less than the number of all strings of length n. The calculation is trivial. The number of all binary strings of length less than n is
$$1+2+4+\cdots+2^{n-1}=2^n - 1< 2^n.$$
Since every string with Kolmogorov complexity less than n is coded by a string whose length is less than n, we conclude that there must be at least one string of length n whose Kolmogorov complexity is at least n.24

Incompressible strings are very interesting objects. In order for a string x to be incompressible, there must be no regularity, no discernible pattern in the string. A simple example of a compressible string is a string of the form yy, a string whose first and the second halfs are the same. This can be defined by a program that instructs the machine to print y twice. Hence the Kolmogorov complexity of this string is at most half of its length plus a constant; the constant is the length of the part of the program that says ‘print the following string twice’. Recall that we used this idea for constructing a self-referential sentence, see page 273.

A remarkable property of incompressible strings is that they pass statistical tests of randomness. I will explain the idea on the simplest possible test. Consider binary strings of length n. The most basic property of random binary sequences is that they have approximately the same number of ones and zeros. (The difference between these two quantities is typically within $$\pm\sqrt{n}$$.) Suppose x is a string that has only n/4 ones. We may produce such a sequence by a random process, thus x may also look completely patternless, yet it can be compressed. The reason for that is that x belongs to a relatively small set that can be efficiently enumerated. The number of strings of length n with n/4 ones is approximately 20.8113n . We can order them, say, lexicographically and enumerate them. Then, to specify one of them, we only need to give its number, which has at most 0.8113n binary digits. Hence the Kolmogorov complexity of every such string is at most 0.8113n+c, for some constant c.

Thus the concept of incompressible strings enables us to specify particular strings that look completely random. This is not possible using only concepts from classical probability theory, where all strings of a given length have the same status. In probability theory we have to talk about properties of random strings (or other random structures) indirectly. For instance, when we need to express formally that ‘a random string has approximately the same number of ones and zeros’, we say that ‘the probability that a randomly chosen string has approximately the same number of ones and zeros tends to 1 as n→∞’. So in probability theory we can talk about properties of random strings, but not about a concrete random string.

This reminds one of the possibility to talk about infinitely small/large numbers in nonstandard analysis. Whereas the concepts of infinitely small/large numbers can be treated only indirectly in classical analysis, in nonstandard analysis we can define such numbers. But Kolmogorov complexity gives us more than just the possibility to speak about random strings. It enables us to develop a new type of probability theory, algorithmic probability. In this theory it is possible, for example, to solve a problem which cannot be treated using classical probability theory—the problem of prediction from given data. The setting is very much like the questions in IQ tests: we are given a finite sequence of symbols that are assumed to be an initial part of an infinite sequence and we should predict the next symbol.

For example, given a string
$$01010101010101010101010101010101010101010$$
the natural answer is that the next symbol is 1 because we conjecture that the infinite string is the string of alternating zeros and ones. According to classical probability theory, all strings are equally probable, thus we can only say that the next symbol is 0 or 1, each with probability 1/2. The algorithmic approach is not to assume that all strings have equal probability, but to look for regularities that can be described by an algorithm. But again there are many algorithms that are consistent with given data, so we need a principle by which we choose one. The principle is to choose the algorithm with the shortest program. In terms of universal Turing machines, we look for the shortest binary string p such that, given p as the input, the universal machine U will print an infinite sequence starting with the given finite string. In the example above it seems clear that every program, say in the programming language C, that will extend the sequence by printing 0 must be longer than the simple program that prints a sequence of alternating zeros and ones.

Although it only looks like a rule of thumb, it is possible to justify this principle formally. It would take us too far afield to explain the necessary parts of the theory, therefore I will consider only a very special situation and sketch a simplified argument. Suppose X is an infinite sequence of zeros and ones defined by a program q. Suppose that we get the elements of the sequence X successively one after another and our task is to find a program that prints X (it does not have to be q itself). In order to make this example simpler, let us also assume that we have a computable upper bound t(n) on the time that q needs to compute the nth digit of the infinite sequence.

To solve this problem we take an enumeration p 1,p 2,p 3,… of all programs. Then we try p 1 on longer and longer initial segments of X. We keep p 1 as long as it runs in the time limit t(n) and prints the same bits as there are in X. When we discover that it runs longer, or prints a different bit, we discard it and test p 2 in the same way. In this way we may try many programs, but after a finite number of steps we will arrive at q or at a program that behaves like q. From that point on we will keep this program, although we will never be sure about it.

In other words, we make conjectures that X is defined by p 1, then by p 2 etc. We disprove some conjectures, but eventually we arrive at the correct conjecture that will be confirmed by all finite segments of X. Replace the word program by theory and X by experimental data and you get a description of how should science develop. In reality it is, of course, a much more complex process.

In this solution of the problem the lengths of programs are not mentioned at all. The only thing we need is that the enumeration is complete—it contains all programs so that we do not miss q. Notice, however, that the most natural way to enumerate programs is to order them by their lengths (and those of equal lengths arbitrarily). So the strategy to try the shortest program that has not been disproved as the current conjecture is just one of the possible strategies that ensure the completeness of the enumeration. This is the essence of why the rule of the shortest program works.

### Noncomputability of the Kolmogorov Complexity

After good news there is also bad news: The Kolmogorov complexity of finite binary strings, the function xC(x), is not algorithmically computable. Thus it is a useful theoretical instrument, but in practice we cannot use it. One may still hope to at least approximate the function, but the matter is even much worse. Not only we cannot compute C(x), but in fact there is only a finite number of strings whose Kolmogorov complexity we can determine! 25

Let me first give an intuitive explanation of why the Kolmogorov complexity is not computable. Essentially, it is a consequence of the noncomputability of the halting problem. Recall that the halting problem is to decide if for a given Turing machine T and a finite string x, the machine T will eventually stop when started on x. A consequence of the undecidability of the halting problem is that if we take one universal Turing machine U, then the halting problem is still undecidable. Suppose we want to compute C(x) for a string of length n. We know that C(x)≤n+c, for some constant c (that we can compute), thus we only need to run the universal Turing machine U on all strings of length ≤n+c, which is a finite number of strings. If n is small, we may be lucky and find p such that U(p)=x and on all shorter strings U halts and prints strings different from x. Then we know that C(x) is the length of p. But there are strings on which U does not stop. Suppose q is such a string and suppose that C(x) is larger than the length of q. If we are sufficiently patient, we will find the shortest p such that U(p)=x, but in order to know that p is the shortest one, we must be sure that U does not halt on q and that is undecidable.

In mathematics many quantities cannot be computed by an algorithm, but often we can determine them using mathematics; we can prove that they are equal to particular numbers. However even that is a problem if we are to determine the Kolmogorov complexity of a string.

### Theorem 42

Let T be a sound theory axiomatized by a finite list of axioms. Then there exists a natural number k T such that for no concrete string x, T is able to prove that x has Kolmogorov complexity larger than k T .

The theorem is also true for recursively axiomatized theories (theories axiomatized by an infinite list of axioms for which it is algorithmically decidable whether a given sentence belongs to the list), but for the sake of avoiding distracting technicalities, I have only stated the weaker version.

The first reaction to this theorem is that there must be something wrong. Just a moment ago I proved that for every n, there exists at least one incompressible string. The proof was very easy and certainly there is a true theory T in which it can be done. So if we take n>k T , we get a contradiction. Or do we really get it? The gist is that the theorem talks about concrete strings. Indeed, a fairly weak theory T is able to formally prove that there exist incompressible strings. Then, if we take a specific number n, we can list all 2 n strings of length n and T proves that one of them is incompressible. We can reduce the list by proving in T that some strings are compressible. But for a sufficiently large n (n>k T ), we cannot reduce the list to a single item—this is what the theorem says.

The proof of this theorem is very simple. Let Unpr T be the unpredictable program for theory T defined on page 287. Recall that for every concrete string x, it is consistent with T that Unpr T prints x as the output. Let p T be a binary encoding of Unpr T for the universal Turing machine U. We take some natural encoding so that the above property of Unpr T is preserved for U(p T ). Thus, in particular, for every concrete string x, it is consistent with T that its Kolmogorov complexity C(x) is at most the length of p T . So we can take k T equal to the length of p T and the theorem is proved.

I will sketch a second proof, which is a simple application of Berry’s paradox. Let T be the length of a natural binary encoding of the axioms of T. We need to take k T a little larger than T ; it will be clear from the proof that k T = T +c for a sufficiently large constant c (independent of T) will do. Suppose that there is at least one string x for which T proves that C(x)>k T . Then we can write a program p that finds such a string. The program will systematically generate all proofs of T and, for each proof, it will check if it is a proof that a specific string x has C(x)>k T . As soon as it finds such a string, p prints it and halts. It is clear that the essential part of the program will be the description of the axioms of T. The rest will be the same for all theories, hence it can be bounded by a constant. Since T proves only true sentences, the string that p finds has Kolmogorov complexity larger than k T , but the string is also defined by a program that has the length at most k T . This is a contradiction (the same as in Berry’s paradox).

A striking property of the second proof, which is the reason why I gave the second proof, is that it does not use self-reference. (In the first proof self-reference was used to define the unpredictable program.) This is in contradiction with the commonly accepted presumption that every proof of the Incompleteness Theorem has to use self-reference. In this proof the only part that could be regarded as being related to self-reference is the fact that the program p should look for proofs that some string has the Kolmogorov complexity larger than p. Thus the program in a way refers to its length. Should this be called self-reference?

Let us rather look for similarities between Gödel’s proof and this proof. Recall that the main trick in writing self-referential sentences was doubling the text—writing essentially the same text twice. As we already observed, this is a simple way to generate compressible strings, strings whose Kolmogorov complexities are less than their lengths. And this is, indeed, the essence of the trick: since the sentence should refer to itself, it should be possible to describe the sentence by a text that is shorter than the whole sentence. Thus if the two proofs of the incompleteness theorem do not share self-reference, they at least share something from Kolmogorov complexity.

Notice that Theorem 42 also implies that Kolmogorov complexity is not computable. Indeed, suppose that there were a Turing machine M that would compute C(x), given x as the input. Let T be the theory axiomatized by some basic axioms of set theory plus the axiom:

M computes C.

Then to prove that a specific string x has the Kolmogorov complexity k we would only need to formalize the computation of M on the input x. Thus T would prove C(x) to be equal to the numeral expressing its value for every given string. This would be in contradiction with Theorem 42 and the fact that there are strings of arbitrarily high Kolmogorov complexity.

For a true finitely axiomatized theory T let k T denote the least number that satisfies the condition in Theorem 42. Chaitin proposed to use this number to measure the strength of theories. It is certainly an interesting parameter. Contrary to what the proofs of Theorem 42 may suggest, k T is not the Kolmogorov complexity of the set of axioms of T. For example, if the set of axioms of T contains some axioms that do not influence what T proves about the Kolmogorov complexity, we can omit them and the new theory will have the same parameter. If S is stronger than T, we clearly have k T k S . It may also happen that there is a stronger theory with much more concise presentation than T.

It would be very useful to scale the theories that we are working with (Peano Arithmetic, Zermelo-Fraenkel set theory and its extensions by large cardinal axioms etc.) using a numerical parameter such as k T . Unfortunately, there is no algorithm for computing k T for specific theories; we can only prove some bounds. Classical proof theory gives a scale based on constructive ordinals (see page 510). These ordinals can be determined for Peano Arithmetic and some extension of it, but unsurmountable complications arise when we want to do it for set theories.

Theorem 42 has an interesting corollary. Let us first observe there is no infinite algorithmic process that would enable us to extend a given sufficiently strong consistent theory S indefinitely. By ‘extending indefinitely’ I mean that the theory resulting after adding all infinitely many axioms is not contained in a recursively axiomatized consistent theory T. This observation that such a process does not exist is easy—the process itself produces a recursively enumerable axiomatization and every such axiomatization can easily be transformed into a recursive one. (If we do not insist that the theory T should use the same language as S, we can even find a finitely axiomatized theory T with this property.) However, this is not true if we allow the process to be random and allow an error with small probability. Here is the precise statement, where again I state it only for finitely axiomatized theories.

### Theorem 43

For every sufficiently strong, sound and finitely axiomatized theory S and every ε>0, there exists an algorithm (formally, a Turing machine M) with the following property. Given access to a source of random bits, M produces an infinite set of axioms such that with probability at least 1−ε, the resulting extension T of S is a sound theory, but it is not contained in any consistent recursively axiomatized theory.26

The idea of the proof is simple. Given S, although it is not possible to compute k S precisely, it is possible to compute an upper bound kk S . If we now take n sufficiently larger than k and take a random sequence r of this length, we can make the probability that the sequence has Kolmogorov complexity larger than k arbitrarily small. Whence the axiom C(r)>k will be true with high probability. Since kk S , it will be a new axiom. (For more details of the proof see Notes.)

What conclusion should we draw for the foundations of mathematics? The axioms C(r)>k do not seem to be useful. Further, the way mathematicians discover new axioms is not completely random—they always have some reasons for adding a particular axiom. Yet randomness is involved in almost every discovery. The theorem above should be viewed as a theoretical possibility that the presence of randomness in the process of discovering new axioms may enable us to extend set theory beyond any limits.

Kolmogorov complexity can also be used to classify empirical theories. Suppose we have experimental data D and some theories T i that explain the data. Which theory should we choose as the best? Now we cannot focus only on the Kolmogorov complexity of theories because the “agnostic” theory saying that “all data are just random numbers” is consistent with any data and is likely to have the smallest complexity. We have to take into account also the Kolmogorov complexity of data. Therefore J. Rissanen [245] proposed the following principle:

### The Minimum Description Length Principle

The best theory T is the one that minimizes the sum of the length of the description of T plus the length of the description of data D using theory T.

The idea behind it is that typical data have both some randomness and some regularity. We would like to separate these two by having a theory that would describe the regularities and a string of random bits that would encode the entropy contained in data. The principle forces us to optimize the choice of the theory: if we pick a too simple theory, then the description of data will be complicated; if we try to make the description of data too simple, the entropy contained in them will have to be moved to the theory.

Needless to say, the success of applying this concept in practice depends very much on how one interprets the notion of description.

### Notes

1. 1.

Proof of Theorem 43, more details. If we do not insist that each time we add an axiom it be not provable from the theory constructed so far, we can define a sequence of axioms independently of the theory S. For a given ε>0, pick a c sufficiently large and for every natural number n≥1, generate randomly a bit string r n of length 2n+c. We can choose c so that the probability that C(r n )≤n is at most 2n . Hence the probability that C(r n )>n for all n≥1 is at least 1−ε.

Theorem 42 was stated for sound theories, but in fact one only needs consistent theories. Such a stronger version of this theorem implies that the set of all sentences C(r n )>n is not contained in any consistent recursively axiomatized theory.

A related question is whether it is possible to obtain by such a process a complete extension of a given theory. It has been proved that this is impossible for Peano Arithmetic [142], hence also for set theories in which Peano Arithmetic is interpretable. Let me stress that here I am talking about all consistent extensions, not only about the one that consists of all true arithmetical sentences.

2. 2.

The prefix-free Kolmogorov complexity. The concept of Kolmogorov complexity that we defined above seems very natural, but it still has some undesirable properties. Let us see what their source is. When we define computations of Turing machine on finite strings, we assume that they are delimited in some way, so that the machine knows where the word starts and where it ends. The usual convention is that there is an additional symbol, viewed as a blank, and the squares of the tape outside of the string contain this symbol (this is just an awkward way of saying that they are blank). Thus the machine can recognize the ends of the input string by reading the blank symbols. This apparently innocent technical detail has quite an important effect on the Kolmogorov complexity. The point is that if we mark the ends of the input string, then we give the machine additional information. One way to see this is to imagine that on the Turing machine tape we can only use the two symbols 0 and 1, no blank symbols, and we still need to delimit the input string. Then we must use some coding that necessarily increases the length.

Thus the usual way of presenting input to a Turing machine gives more information than it should. The amount of additional information bits is small, but if we want to compute precisely, this is an essential error. Can we filter out the additional information about the length of the input? It seems that a quite satisfactory answer to this question has been found. The basic idea is that we should restrict the class of Turing machines to those that only read symbols of the input string and never look outside. As a Turing machine needs typically much more space than is the input length, we must use several tapes. We obviously need an input tape and a separate work tape, but it is also natural to have an output tape. Then we can assume that the head on the input tape moves only in one direction; say, starting on the first symbol it moves right. Thus we ensure that the machine never goes left from the input word.

It is trickier to ensure that the machine does not go right from the input word. In fact, this is possible only because we can ignore inputs on which the machine does not print anything because it never stops. If the machine does not print any output on a given string, we can ignore it; it does not make any difference how much information it got. Thus, given a Turing machine T, we change it to a machine T′ that is the same as T, except that if T′ leaves the input string (which means it reads a blank symbol), then it goes into an infinite loop and hence does not print any output. Further, we stipulate that if T stops before reading the entire input, we will treat it as if it did not produce any output.

Let T′ be such a Turing machine, let L be the set of inputs on which it stops and produces an output. Observe that if x,yL, then x is never a proper initial segment of y; we say x is not a prefix of y. Such sets are called prefix codes. Their advantage is that we can transmit the words from the code without any separating messages. Such Turing machines are, therefore, called prefix Turing machines. Kolmogorov complexity based on prefix Turing machines renders much better the intuitive notion of descriptive complexity. One can show that the basic results remain true in the new context. In particular, there exist universal prefix Turing machines and thus we also have the Invariance Theorem.

3. 3.

Inductive reasoning. Inductive reasoning was a problem that puzzled philosophers for a long time. On the one hand, it is clear that humans use it, on the other hand, there seemed to be no way to show the possibility of such reasoning formally. For deductive reasoning, there was logic, but there was no “inductive logic”. What was missing was the concept of computability. If we allow observable phenomena to be governed by completely arbitrary rules, then we surely have no chance to discover them. If, however, we restrict the rules to computable ones, we may eventually discover every such rule using a sufficient number of examples.

I have already considered a simple example in which the data is given by a computable process. In that example the process of producing zeros and ones was completely deterministic. This is called learning because once we learn a concept (definable by an algorithm), prediction becomes easy. In general, we would like to make predictions also in the case of processes involving randomness.

An extreme case is the uniform probability distribution. (This means that zeros and ones are equally likely and their appearance on different positions in the sequence are independent.) In this case we would also like to be able to discover what is going on, namely, that the sequence is completely random. An example that combines randomness with a strong dependence are sequences generated by repeating the following action: toss a coin, if you get heads write zero, if you get tails write two ones. The problem of predicting the next element in the sequence has the following solution:
1. a.

if the current finite sequence ends with an odd number of ones, the next element is 1;

2. b.

otherwise the next element is zero or one, both with probability 1/2.

In the 1960s, Solomonoff pioneered an approach to the problem of prediction based on algorithmic probability [277]. His results and subsequent work of other researchers clearly demonstrate that prediction is possible. I will only state a simple version of his main result.

First we need to formalize the concept of an infinite process that produces a binary infinite sequence. Let us denote the set of infinite binary sequences by {0,1} ω . A probability measure on {0,1} ω is a mapping μ defined on certain subsets of {0,1} ω and taking on values in the real interval [0,1]. Let us leave aside the question for which subsets of {0,1} ω the measure μ is defined, as it plays little role in what we are going to discuss. The conditions that μ has to satisfy in order to be called a probability measure is that the measure of the entire space is 1 and that it is σ-additive, which means that the measure of the union of a countable family of disjoint sets is the sum of their measures.

We have already used the uniform measure on {0,1} ω , which is a formalization of the process of randomly tossing a coin infinitely many times, when we discussed forcing (page 363). This measure is determined by the condition that, for every n and every finite sequence x of length n, the measure of the set of infinite sequences extending x is 1/2 n . Let u denote the uniform probability measure and Γ x the set of infinite sequences extending x. Then the defining condition for u is
$$\mathbf{u}(\varGamma_x)=2^{-n},$$
for x of length n.
In general, every measure on {0,1} ω is determined in such a way. Thus we only need to know the values μ(Γ x ) for all finite sequences x. The conditions defining measures reduce to
$$\mu(\varGamma_\varLambda)=1,$$
where Λ denotes the empty sequence, and
$$\mu(\varGamma_{x})=\mu(\varGamma_{x0})+\mu( \varGamma_{x1}).$$
Hence we can identify measures with functions defined on finite binary sequences satisfying the two conditions above. Let us simplify notation by writing μ(x) for μ(Γ x ). In terms of probability, μ(x) is the probability that the random infinite sequence starts with x, where randomness is with respect to the probability measure μ.
The prediction problem that we are interested in is: for a given finite sequence x, to determine (or at least to approximate) the probabilities that the sequence will continue with 0, respectively with 1. Formally, this means that we want to determine the conditional probabilities μ(x0|x) and μ(x1|x) defined by
$$\mu(x0|x)=\frac{\mu(x0)}{\mu(x)}\quad \mbox{and} \quad \mu(x1|x)=\frac{\mu(x1)}{\mu(x)}.$$
Since μ(x0|x)+μ(x1|x)=1, it suffices to have one of the two probabilities.

The assumption is that we get longer and longer segments of a random infinite sequence and the goal is to use this knowledge to make better and better predictions.

Note that we cannot repeat the experiment by starting over with another random infinite sequence. Therefore we cannot learn μ; if we ever can do anything, we can only learn a part of it.

The key idea is to restrict the class of measures to computable ones, but since μ is a real valued function we must define what computability means. It means that there exists an algorithm (a Turing machine) that for a given string x and a natural number k computes a pair of integers p, q such that |μ(x)−p/q|≤1/k. Now we are ready to state Solomonoff’s result.

Theorem 44 There exists a measure M such that for every computable measure μ, the following is true. Let X be a random infinite sequence and let X n denote the initial segment of length n. Then with probability 1,
$$\lim_{n\to\infty} \frac{\mathbf{M}(X_n0|X_n)}{\mu(X_n0|X_n)}=1.$$
The probability of this event is with respect to measure μ.

This theorem does not say anything about computability of M or the rate of convergence, still it is a striking result. The meaning of the theorem is that we do not have to try various hypothesis—there is one universal measure that suffices. How can a single measure approximate all computable measures? The point is in the last sentence of the theorem. It means that the sequences X are chosen according to the probability distribution μ. In order to get intuition, recall the problem of predicting a sequence defined by an algorithm on page 484. One can view it as a simplified version of the theorem above, where each measure μ has full weight on a single infinite sequence. The algorithm that solves this problem is the measure M that satisfies the theorem for such special measures μ.

Concerning computability, one can prove that M cannot be computable. It is, however, possible to generalize the concept of a computable measure so that we can compute universal measures at least approximately. We say that ν is a semimeasure, if instead of the equalities above it satisfies
$$\nu(\varLambda)\leq 1\quad \mbox{and} \quad \nu(x0)+\nu(x1)\leq \nu(x).$$
We say that ν is semicomputable from below, if there exists an algorithm which for every x computes a nondecreasing sequence of rational numbers converging to ν(x). Then one can show that there exists a semimeasure m that is universal, in the sense of Theorem 44, for semimeasures semicomputable from below, and m is also semicomputable from below. It follows that we can, at least theoretically, approximate m(x0|x) with arbitrary precision. Thus Theorem 44, generalized in this way, gives us at least a theoretical possibility to compute predictions.
One can show several connections between these concepts and the Kolmogorov complexity of which I will mention only one. One can prove that there exists a constant c such that
$$2^{K(x)-c}\leq \mathbf{m}(x)\leq 2^{K(x)+c}.$$

4. 4.

Using incompressibility instead of randomness. The existence of incompressible strings is proved by a probabilistic proof (and moreover we know that we cannot do it by a constructive proof for sufficiently long length of the strings). Once we have such strings, we can use them to avoid probabilistic arguments in proofs such as Erdős’ bound on the Ramsey number. Instead of estimating probabilities, we estimate Kolmogorov complexities. Such proofs are often conceptually simpler because they use formulas with fewer quantifier alternations.

## Footnotes

1. 1.

In Russian a single word, perebor, which means picking over, is used for this type of algorithms.

2. 2.

This number is called RSA-129.

3. 3.

This is a simplified version of the Traveling Salesperson Problem in which we may have different costs associated with different connections.

4. 4.

Assuming f is time constructible, sets computable in time f(n) are computable in space f(n)/logf(n), see [132].

5. 5.

One can prove that this is an inessential restriction.

6. 6.

I deviate from the standard notation which is P/poly.

7. 7.

Note that the input size is 2 n , hence polynomial time means 2 cn for some constant c.

8. 8.

This is for circuits in the basis of all binary connectives; o(n) denotes a term of a lower order than n.

9. 9.

To give an explicit construction of a Ramsey graph is one of the many famous Erdős problems for which he offered money prizes.

10. 10.

In fact one can extract a specific transcendental number from this proof. We can compute solutions of algebraic equations with rational coefficients to arbitrary precision and we can enumerate them (possibly with repetitions). Thus applying the diagonal trick we can define digits of a transcendental number by an algorithm. Similarly, we can enumerate all Boolean functions of n variables and taking the first one whose complexity exceeds a given bound. We do not consider such definitions explicit, as the defined entity is chosen by a process that has little to do with the property that we need to ensure.

11. 11.

In general, they are only determined up to the functional equivalence; I will ignore this subtlety.

12. 12.

A related concept had been studied in the research area algorithmic randomness before the mathematical definition of pseudorandom generators was introduced. In that area the aim was to define and study countably infinite sequences that share properties with typical random sequences.

13. 13.

Strictly speaking, the encoding and decoding keys are not only the numbers e and d, but the pairs (e,N) and (d,N).

14. 14.

Recently chips that integrate several cores have been introduced. Computers equipped with such processors can run some processes in parallel.

15. 15.

I will only be concerned with the parts of the brain that are responsible for cognitive processes or motor actions.

16. 16.

In fact, cnlogn size, for c a constant.

17. 17.

Also signals between neurons are transmitted chemically. There is a tiny gap between a synapse and another neuron, the synaptic cleft. An electric signal arriving to a synapse causes release of neurotransmitters into the gap, which in turn triggers, or inhibits, an electric signal in the adjacent neuron. As the gaps are very small, this does not cause much time delay.

18. 18.

I prefer to use ‘crossing-over’ since it is less ambiguous than ‘recombination’. Recombination often refers to all possible editing operations that ever occur. For example, sometimes a segment of the string is inverted, but this is rather an error, like a mutation, and as such it may be useful, but most often it is detrimental.

19. 19.

One may suggest that a photon is not just a point, but rather a wave packet, and a part of this packet may touch mirror M 1 while the center of the packet will be at M 0. But this explanation also does not work, since there is no restriction on the distances in the interferometer. Even if the distances were cosmic, it would behave exactly the same way.

20. 20.

We could use alphabets larger than two, but it would be just an unnecessary complication.

21. 21.

We will not need bra vectors.

22. 22.

We cannot see the superposition because the measuring apparatus can produce only one of k possible values. But we can set the apparatus differently and then what was previously a superposition may become a basis state. In particular, we can detect in the new setting what was originally a superposition.

23. 23.

Here is an intuitive explanation. If r does not divide M, the p-gons are not quite regular—one edge is shorter, thus we only know that $$\frac{M}{p}<r<\frac{M}{p-1}$$. But if M>2r 2, this interval is shorter than 1, hence r is determined uniquely.

24. 24.

An attentive reader may have noticed that incompressibility is not invariant with respect to different choices of the universal Turing machine. An incompressible string of length n for one machine may have Kolmogorov complexity nc for a different machine, where c is a nonnegative constant (depending only on the pair of machines). But if n is very large with respect to c, the difference between having Kolmogorov complexity n or nc is not essential.

25. 25.

Using a fixed set of axioms.

26. 26.

Here I exceptionally deviate from the convention that theories in this book are always recursively axiomatizable.

## References

1. 1.
Ajtai, M.: $$\varSigma_{1}^{1}$$ formulae on finite structures. Ann. Pure Appl. Log. 24, 1–48 (1983)
2. 2.
Agrawal, M., Kayal, N., Saxena, N.: PRIMES is in P. Ann. Math. 160(2), 781–793 (2004)
3. 7.
Arora, S., Lund, C., Motwani, R., Sudan, M., Szegedy, M.: Proof verification and the hardness of approximation problems. J. ACM 45(3), 501–555 (1998)
4. 9.
Arora, S., Safra, S.: Probabilistic checking of proofs: A new characterization of NP. J. ACM 45(1), 70–122 (1998)
5. 11.
Babai, L.: Trading group theory for randomness. In: Proc. 17th ACM Symp. on Theory of Computing, pp. 421–429 (1985) Google Scholar
6. 17.
Baker, T.P., Gill, J., Solovay, R.: Relativizations of the P = ? NP question. SIAM J. Comput. 4(4), 431–442 (1975)
7. 19.
Bennett, C.H.: Logical reversibility of computation. IBM J. Res. Dev. 17(6), 525–532 (1973)
8. 20.
Bennett, C.H., Wiesner, S.J.: Communication via one- and two-particle operators on Einstein-Podolsky-Rosen states. Phys. Rev. Lett. 69, 2881–2884 (1992)
9. 25.
Blum, L., Shub, M., Smale, S.: On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions and universal machines. Bull. Am. Math. Soc. 21, 1–46 (1989)
10. 26.
Blum, L., Cucker, F., Shub, M., Smale, S.: Complexity and Real Computation. Springer, Berlin (1997)
11. 27.
Blum, N.: A boolean function requiring 3n network size. Theor. Comput. Sci. 28, 337–345 (1984)
12. 39.
Chaitin, G.J.: On the simplicity and speed of programs for computing infinite sets of natural numbers. J. ACM 16(3), 407–422 (1969)
13. 40.
Cheng, Q.: Straight-line programs and torsion points on elliptic curves. Comput. Complex. 12(1), 150–161 (2003)
14. 48.
Cook, S.A.: The complexity of theorem proving procedures. In: Proc. 3rd Annual ACM Symposium on Theory of Computing, pp. 151–158 (1971)
15. 66.
Elitzur, A.C., Vaidman, L.: Quantum mechanical interaction-free measurements. Found. Phys. 23, 987–997 (1993)
16. 67.
Erdős, P.: Some remarks on the theory of graphs. Bull. Am. Math. Soc. 53, 292–294 (1947)
17. 73.
Feynman, R.: Simulating physics with computers. Int. J. Theor. Phys. 21(6–7), 467 (1982)
18. 83.
Furst, M.L., Saxe, J.B., Sipser, M.: Parity, circuits, and the polynomial-time hierarchy. Math. Syst. Theory 17(1), 13–27 (1984)
19. 87.
Gál, A., Hansen, K.A., Koucký, M., Pudlák, P., Viola, E.: Tight bounds on computing error-correcting codes by bounded-depth circuits with arbitrary gates. In: Proc. STOC 2012, pp. 479–494 (2012) Google Scholar
20. 100.
Gödel, K. (ed.): Collected Works: Volume V. Correspondence, H.-Z. Feferman, S., Dawson, J.W., Goldfarb, W., Parsons, C., Sieg, W. (eds.). Oxford University Press, London (2003)
21. 109.
Grover, L.K.: A fast quantum mechanical algorithm for database search. In: Proc. 28th Annual ACM Symposium on the Theory of Computing, pp. 212–218 (1996) Google Scholar
22. 113.
Harrow, A.W., Hassidim, A., Lloyd, S.: Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103, 150502 (2009)
23. 114.
Hartmanis, J., Stearns, R.E.: On the computational complexity of algorithms. Trans. Am. Math. Soc. 117, 285–306 (1965)
24. 123.
Hilbert, D.: Über die Endlichkeit des Invariantensystems für binären Grundformen. Math. Ann. 33, 223–226 (1889)
25. 132.
Hopcroft, J., Paul, W.J., Valiant, L.G.: On time vs. space. J. ACM 24(2), 332–337 (1977)
26. 136.
Impagliazzo, R., Wigderson, A.: P = BPP unless E has subexponential circuits: Derandomizing the XOR lemma. In: Proc. 29th STOC, pp. 220–229 (1997) Google Scholar
27. 142.
Jockusch, C.G., Soare, R.I.: $$\varPi^{0}_{1}$$ classes and degrees of theories. Trans. Am. Math. Soc. 173, 33–56 (1972)
28. 145.
Kabanets, V., Impagliazzo, R.: Derandomizing polynomial identity tests means proving circuit lower bounds. Comput. Complex. 13(1–2), 1–46 (2004)
29. 156.
Koblitz, N.: A Course in Number Theory and Cryptography. Springer, New York (1987)
30. 159.
Kollár, J., Rónyai, L., Szabó, T.: Norm-graphs and bipartite Turán numbers. Combinatorica 16(3), 399–406 (1996)
31. 160.
Kolmogorov, A.: On tables of random numbers. Sankhya, Ser. A 25, 369–375 (1963)
32. 174.
Kuperberg, G.: A subexponential-time quantum algorithm for the dihedral hidden subgroup problem. SIAM J. Comput. 35(1), 170–188 (2005)
33. 176.
Ladner, R.E.: On the structure of polynomial time reducibility. J. ACM 22, 155–171 (1975)
34. 182.
Levin, L.: Universal’nye perebornye zadachi. Probl. Inf. Transm. 9(3), 265–266 (1973). (Russian) Google Scholar
35. 191.
Margulis, G.A.: Explicit constructions of expanders. Probl. Pereda. Inf. 9(4), 71–80 (1973)
36. 198.
Miller, G.L.: Riemann’s hypothesis and tests for primality. J. Comput. Syst. Sci. 13(3), 300–317 (1976)
37. 203.
Mulmuley, K., Sohoni, M.: Geometric complexity theory I: An approach to the P vs. NP and related problems. SIAM J. Comput. 31(2), 496–526 (2001)
38. 209.
Nisan, N., Wigderson, A.: Hardness vs. randomness. J. Comput. Syst. Sci. 49(2), 149–167 (1994)
39. 226.
Pudlák, P.: Complexity theory and genetics: The computational power of crossing over. Inf. Comput. 171, 201–223 (2001)
40. 228.
Pudlák, P.: Quantum deduction rules. Ann. Pure Appl. Log. 157, 16–29 (2009)
41. 234.
Rabin, M.O.: Digital signatures and public-key functions as intractable as factorization. MIT Laboratory of Computer Science Technical Report 212, (1979). Google Scholar
42. 238.
Razborov, A.: Lower bounds for the monotone complexity of some boolean functions. Dokl. Akad. Nauk SSSR 281(4), 798–801 (1985). (In Russian; English translation in: Sov. Math. Dokl. 31, 354–357 (1985))
43. 239.
Razborov, A.: Lower bounds on the size of bounded-depth networks over a complete basis with logical addition. Mat. Zametki 41(4), 598–607 (1987). (In Russian; English translation in: Math. Notes Acad. Sci. USSR 41(4), 333–338 (1987))
44. 240.
Razborov, A.: On the method of approximation. In: Proc. of the 21st ACM STOC, pp. 169–176 (1989) Google Scholar
45. 243.
Razborov, A., Rudich, S.: Natural proofs. J. Comput. Syst. Sci. 55(1), 24–35 (1997)
46. 244.
Reisch, S.: Hex ist PSPACE-vollstn̈dig (Hex is PSPACE-complete). Acta Inform. 15, 167–191 (1981)
47. 245.
Rissanen, J.: Modeling by the shortest data description. Automatica 14, 465–471 (1978)
48. 246.
Rivest, R., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978)
49. 255.
Savitch, W.J., Stimson, M.J.: Time bounded random access machines with parallel processing. J. ACM 26(1), 103–118 (1979)
50. 257.
Schönhage, A., Strassen V, V.: Schnelle multiplikation grosser Zahlen. Computing 7, 281–292 (1971)
51. 259.
Schwartz, J.: Fast probabilistic algorithms for verification of polynomial identities. J. ACM 27, 701–717 (1980)
52. 264.
Shannon, C.E.: The synthesis of two-terminal switching circuits. Bell Syst. Tech. J. 28, 59–98 (1949)
53. 268.
Shor, P.W.: Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 26(5), 1484–1509 (1997)
54. 277.
Solomonoff, R.: A Preliminary Report on a General Theory of Inductive Inference. Report V-131, Cambridge, Ma., Zator Co. (1960) Google Scholar
55. 282.
Solovay, R.M., Strassen, V.: A fast Monte-Carlo test for primality. SIAM J. Comput. 6(1), 84–85 (1977)
56. 286.
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13, 354–356 (1969)
57. 292.
Tsfasman, M.A., Vlăduţ, S.G., Zink T, T.: Modular curves, Shimura curves and Goppa codes, better than Varshamov-Gilbert bound. Math. Nachr. 104, 13–28 (1982) Google Scholar
58. 295.
Uhlig, D.: On the synthesis of self-correcting schemes from functional elements with a small number of reliable elements. Mat. Zametki 15(6), 937–944 (1974)
59. 297.
Valiant, L.: The complexity of computing permanent. Theor. Comput. Sci. 8, 189–201 (1979)
60. 320.
Zippel, R.E.: Probabilistic algorithms for sparse polynomials. In: Proc. EUROSAM’79. Springer Lecture Notes in Computer Science, vol. 72, pp. 216–226 (1979) Google Scholar