Abstract
We provide a detailed estimate for the logical resource requirements of the quantum linearsystem algorithm (Harrow et al. in Phys Rev Lett 103:150502, 2009) including the recently described elaborations and application to computing the electromagnetic scattering cross section of a metallic target (Clader et al. in Phys Rev Lett 110:250504, 2013). Our resource estimates are based on the standard quantumcircuit model of quantum computation; they comprise circuit width (related to parallelism), circuit depth (total number of steps), the number of qubits and ancilla qubits employed, and the overall number of elementary quantum gate operations as well as more specific gate counts for each elementary faulttolerant gate from the standard set \(\{ X, Y, Z, H, S, T, \text{ CNOT } \}\). In order to perform these estimates, we used an approach that combines manual analysis with automated estimates generated via the Quipper quantum programming language and compiler. Our estimates pertain to the explicit example problem size \(N=332{,}020{,}680\) beyond which, according to a crude bigO complexity comparison, the quantum linearsystem algorithm is expected to run faster than the best known classical linearsystem solving algorithm. For this problem size, a desired calculation accuracy \(\varepsilon =0.01\) requires an approximate circuit width 340 and circuit depth of order \(10^{25}\) if oracle costs are excluded, and a circuit width and circuit depth of order \(10^8\) and \(10^{29}\), respectively, if the resource requirements of oracles are included, indicating that the commonly ignored oracle resources are considerable. In addition to providing detailed logical resource estimates, it is also the purpose of this paper to demonstrate explicitly (using a finegrained approach rather than relying on coarse bigO asymptotic approximations) how these impressively large numbers arise with an actual circuit implementation of a quantum algorithm. While our estimates may prove to be conservative as more efficient advanced quantumcomputation techniques are developed, they nevertheless provide a valid baseline for research targeting a reduction of the algorithmiclevel resource requirements, implying that a reduction by many orders of magnitude is necessary for the algorithm to become practical.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Quantum computing promises to efficiently solve certain hard computational problems for which it is believed no efficient classical algorithms exist [1]. Designing quantum algorithms with a computational complexity superior to that of their best known classical counterparts is an active research field [2]. The quantum linearsystem algorithm (QLSA), first proposed by Harrow et al. [3], afterward improved by Ambainis [4], and recently generalized by Clader et al. [5], is appealing because of its great practical relevance to modern science and engineering. This quantum algorithm solves a large system of linear equations under certain conditions exponentially faster than any current classical method.
The basic idea of QLSA, essentially a matrixinversion quantum algorithm, is to convert a system of linear equations, \(A{\mathbf {x}}={\mathbf {b}}\), where A is a Hermitian^{Footnote 1} \(N\times N\) matrix over the field of complex numbers \({\mathbb {C}}\) and \({\mathbf {x}},{\mathbf {b}}\in {\mathbb {C}}^N\), into an analogous quantumtheoretic version, \(A\left x\right\rangle =\left b\right\rangle \), where \(\left x\right\rangle , \left b\right\rangle \) are vectors in a Hilbert space \({\mathscr {H}} =({\mathbb {C}}^2)^{\otimes n}\) corresponding to \(n=\lceil \log _2N\rceil \) qubits and A is a selfadjoint operator on \({\mathscr {H}}\), and use various quantumcomputation techniques [1, 6,7,8] to solve for \(\left x\right\rangle \).
Extended modifications of QLSA have also been applied to other important problems (cf. [2]), such as leastsquares curvefitting [9], solving linear differential equations [10], and machine learning [11]. Recent efforts in demonstrating smallscale experimental implementation of QLSA [12, 13] have further highlighted its popularity.
1.1 Objective of this work
The main objective of this paper is to provide a detailed logical resource estimate (LRE) analysis of QLSA based on its further elaborated formulation [5]. Our analysis particularly also aims at including the commonly ignored resource requirements of oracle implementations. In addition to providing a detailed LRE for a large practical problem size, another important purpose of this work is to demonstrate explicitly, i.e., using a finegrained approach rather than relying on bigO asymptotic approximations, how the concrete resource counts accumulate with an actual quantumcircuit implementation of a quantum algorithm.
Our LRE is based on an approach which combines manual analysis with automated estimates generated via the programming language Quipper and its compiler. Quipper [14, 15] is a domainspecific, higherorder, functional language for quantum computation, embedded in the hostlanguage Haskell. It allows automated quantum circuit generation and manipulation; equipped with a gatecount operation, Quipper offers a universal automated LRE tool. We demonstrate how Quipper’s powerful capabilities have been exploited for the purpose of this work.
We underline that our research contribution is not merely providing the LRE results, but also to demonstrate an approach to how a concrete resource estimation can be done for a quantum algorithm used to solve a practical problem of a large size. Finally, we would also like to emphasize the modular nature of our approach, which allows to incorporate future work as well as to assess the impact of prospective advancements of quantumcomputation techniques.
1.2 Context and setting of this work
Our analysis was performed within the scope of a larger context: IARPA Quantum Computer Science (QCS) program [16], whose goals were to achieve an accurate estimation and moreover a significant reduction of the necessary computational resources required to implement quantum algorithms for practically relevant problem sizes on a realistic quantum computer. The work presented here was conducted as part of our general approach to tackle the challenges of IARPA QCS program: the PLATO project,^{Footnote 2} which stands for “Protocols, Languages and Tools for Resourceefficient Quantum Computation.”
The QCS program BAA [17] presented a list of seven algorithms to be analyzed. For the purpose of evaluation of the work, the algorithms were specified in “governmentfurnished information” (GFI) using pseudocode to describe purely quantum subroutines and explicit oracles supplemented by Python or MATLAB code to compute parameters or oracle values. While this IARPA QCS program GFI is not available as published material,^{Footnote 3} the Quipper code developed as part of the PLATO project to implement the algorithms and used for our LRE analyses is available as published library code [18, 19]. In our analyses, we found the studied algorithms to cover a wide range of different quantumcomputation techniques. Additionally, with the algorithm parameters supplied for our analyses, we have seen a wide range of complexities as measured by the total number of gate operations required, including some that could not be executed within the expected life of the universe under current predictions of what a practical quantum computer would be like when it is developed.
This approach is consistent with the one commonly used in computer science for algorithms analysis. There are at least two reasons for looking at large problem sizes. First, in classical computing, we have often been wrong in trying to predict how computing resources will scale across periods of decades. We can expect to make more accurate predictions in some areas in quantum computing because we are dealing with basic physical properties that are relatively well studied. However, disruptive changes may still occur.^{Footnote 4} Thus, in computer science, one likes to understand the effect of scale even when it goes beyond what is currently considered practical. The second reason for considering very large problem sizes, even those beyond a practical scale, is to develop the level of abstraction necessary to cope with them. The resulting techniques are not tied to a particular size or problem and can then be adapted to a wide range of algorithms and sizes. In practice, some of our original tools and techniques were developed while expecting smaller algorithm sizes. Developing techniques for enabling us to cope with large algorithm sizes resulted in speeding up the analysis for small algorithm sizes.
Our focus in this paper is the logical part of the quantum algorithm implementation. More precisely, here we examine only the algorithmiclevel logical resources of QLSA and do not account for all the physical overhead costs associated with techniques to enable a faulttolerant implementation of this algorithm on a realistic quantum computer under realworld conditions. Such techniques include particularly quantum control (QC) protocols and quantum error correction (QEC) and/or mitigation codes. Nor do we take into account quantum communication costs required to establish interactions between two distant qubits so as to implement a twoqubit gate between them. These additional physical resources will depend on the actual physical realization of a quantum computer (ion traps, neutral atoms, quantum dots, superconducting qubits, photonics, etc.) and also include various other costs, such as those due to physical qubit movements in a given quantumcomputer architecture, their storage in quantum memories, etc. The resource estimates provided here are for the abstract logical quantum circuit of the algorithm, assuming no errors due to realworld imperfections, no QC or QEC protocols, and no connectivity constraints for a particular physical implementation.
Determining the algorithmiclevel resources is a very important and indispensable first step toward a complete analysis of the overall resource requirements of each particular realworld quantumcomputer implementation of an algorithm, for the following reasons. First, it helps to understand the structural features of the algorithm, and to identify the actual bottlenecks of its quantumcircuit implementation. Second, it helps to differentiate between the resource costs that are associated with the algorithmic logicallevel implementation (which are estimated here) and the additional overhead costs associated with physically implementing the computation in a faulttolerant fashion including quantumcomputertechnologyspecific resources. Indeed, the algorithmiclevel LRE constitutes a lower bound on the minimum resource requirements that is independent of which QEC or QC strategies are employed to establish faulttolerance, and independent of the physics details of the quantumcomputer technology. For this reason, it is crucial to develop techniques and tools for resourceefficient quantum computation even at the logical quantumcircuit level of the algorithm implementation. The LRE for QLSA provided in this paper will serve as a baseline for research into the reduction of the algorithmiclevel minimum resource requirements.
Finally we emphasize that our LRE analysis only addresses the resource requirements for a single run of QLSA, which means that it does not account for the fact that the algorithm needs to be run many times and followed by sampling in order to achieve an accurate and reliable result with high probability.
1.3 Review of previous work
The key ideas underlying QLSA [3,4,5] can be briefly summarized as follows; for a detailed description, see Sect. 3. The preliminary step consists of converting the given system of linear equations \(A{\mathbf {x}}={\mathbf {b}}\) (with \({\mathbf {x}},{\mathbf {b}}\in {\mathbb {C}}^N\) and A a Hermitian \(N\times N\) matrix with \(A_{ij}\in {\mathbb {C}}\)) into the corresponding quantumtheoretic version \(A\left x\right\rangle =\left b\right\rangle \) over a Hilbert space \({\mathscr {H}}=({\mathbb {C}}^2)^{\otimes n}\) of \(n=\lceil \log _2N\rceil \) qubits. It is important to formulate the original problem such that the operator \(A:{\mathscr {H}}\rightarrow {\mathscr {H}}\) is selfadjoint, see footnote 1.
Provided that oracles exist to efficiently compute A and prepare state \(\left b\right\rangle \), the main task of QLSA is to solve for \(\left x\right\rangle \). According to the spectral theorem for selfadjoint operators, the solution can be formally expressed as \(\left x\right\rangle =A^{1}\left b\right\rangle =\sum _{j=1}^N\beta _j/\lambda _j\left u_j\right\rangle \), where \(\lambda _j\) and \(\left u_j\right\rangle \) are the eigenvalues and eigenvectors of A, respectively, and \(\left b\right\rangle =\sum _{j=1}^N\beta _j\left u_j\right\rangle \) is the expansion of quantum state \(\left b\right\rangle \) in terms of these eigenvectors. QLSA is designed to implement this representation.
QLSA starts with preparing (in a multiqubit data register) the known quantum state \(\left b\right\rangle \) using an oracle for vector \({\mathbf {b}}\). Next, Hamiltonian evolution \(\exp (iA\tau /T)\) with A as the Hamilton operator is applied to \(\left b\right\rangle \). This is accomplished by using an oracle for matrix A and Hamiltonian Simulation (HS) techniques [8]. The Hamiltonian evolution is part of the wellestablished technique known as quantum phase estimation algorithm (QPEA) [6, 7], here employed as a subalgorithm of QLSA to acquire information about the eigenvalues \(\lambda _j\) of A and store them in QPEA’s control register. In the next step, a singlequbit ancilla starting in state \(\left 0\right\rangle \) is rotated by an angle inversely proportional to the eigenvalues \(\lambda _j\) of A stored in QPEA’s control register. Finally, the latter are uncomputed by the inverse QPEA yielding a quantum state of the form \(\sum _{j=1}^N\beta _j\sqrt{1C^2/\lambda ^2_j}\left u_j\right\rangle \otimes \left 0\right\rangle +\sum _{j=1}^N C\beta _j/\lambda _j\left u_j\right\rangle \otimes \left 1\right\rangle \), with the solution \(\left x\right\rangle \) correlated with the value 1 in the auxiliary singlequbit register. Thus, if the latter is measured and the value 1 is found, we know with certainty that the desired solution of the problem is stored in the quantum amplitudes of the multiqubit quantum register in which \(\left b\right\rangle \) was initially prepared. The solution can then either be revealed by an ensemble measurement (a statistical process requiring the whole procedure to be run many times), or useful information can also be obtained by computing its overlap \(\left \langle R\left x\right\rangle \right ^2\) with a particular (known) state \(\left R\right\rangle \) (corresponding to a specific vector \({\mathbf {R}}\in {\mathbb {C}}^N\)) that has been prepared in a separate quantum register [5].
Harrow, Hassidim and Lloyd (HHL) [3] showed that, given the matrix A is wellconditioned and sparse or can efficiently be decomposed into a sum of sparse matrices, and if the elements of matrix A and vector \({\mathbf {b}}\) can be efficiently computed, then QLSA provides an exponential speedup over the best known classical linearsystemsolving algorithm. The performance of any matrixinversion algorithm depends crucially on the condition number \(\kappa \) of the matrix A, i.e., the ratio between A’s largest and smallest eigenvalues. A large condition number means that A becomes closer to a matrix which cannot be inverted, referred to as “illconditioned”; the lower the value of \(\kappa \) the more “wellconditioned” is A. Note that \(\kappa \) is a property of the matrix A and not of the linearsystemsolving algorithm. Roughly speaking, \(\kappa \) characterizes the stability of the solution \({\mathbf {x}}\) with respect to changes in the given vector \({\mathbf {b}}\). Further important parameters to be taken into account are the sparseness d (i.e., the maximum number of nonzero entries per row/column in the matrix A), the size N of the square matrix A, and the desired precision of the calculation represented by error bound \(\varepsilon \).
In [3] it was shown that the number of operations required for QLSA scales as
while the best known classical linearsystemsolving algorithm based on conjugate gradient method [20, 21] has the runtime complexity
where, compared to \(O(\cdot )\), the \({\widetilde{O}}(\cdot )\) notation suppresses more slowly growing terms. Thus, it was concluded in [3] that, in order to achieve an exponential speedup of QLSA over classical algorithms, \(\kappa \) must scale, in the worst case, as \(\text{ poly }\log (N)\) with the size of the \(N\times N\) matrix A.
The original HHLQLSA [3] has the drawback to be nondeterministic, because accessing information about the solution is conditional on recording outcome 1 of a measurement on an auxiliary singlequbit, thus in the worst case requiring many iterations until a successful measurement event is observed. To substantially increase the success probability for this measurement event indicating that the inversion \(A^{1}\) has been successfully performed and the solution \(\left x\right\rangle \) (up to normalization) has been successfully computed (i.e., probability that the postselection succeeds), HHLQLSA includes a procedure based on quantum amplitude amplification (QAA) [22]. However, in order to determine the normalization factor of the actual solution vector \(\left x\right\rangle \), the success probability of obtaining 1 must be “measured,” requiring many runs to acquire sufficient statistics. In addition, because access to the entire solution \(\left x\right\rangle \) is impractical as it is a vector in an exponentially large space, HHL suggested that the information about the solution can be extracted by calculating the expectation value \(\left\langle x\right {\hat{M}}\left x\right\rangle \) of an arbitrary quantummechanical operator \({\hat{M}}\), corresponding to a quadratic form \({\mathbf {x}}^TM{\mathbf {x}}\) with some \(M\in {\mathbb {C}}^{N\times N}\) representing the feature of \({\mathbf {x}}\) that one wishes to evaluate. But such a solution readout is generally also a nontrivial task and typically would require the whole algorithm to be repeated numerous times.
In a subsequent work, Ambainis [4] proposed using variabletime quantum amplitude amplification to improve the runtime of HHL algorithm from \({\widetilde{O}}(\kappa ^2\log N)\) to \({\widetilde{O}}(\kappa \log ^3 \kappa \log N)\), thus achieving an almost optimal dependence on the condition number \(\kappa \).^{Footnote 5} However, the improvement of the dependence of the runtime on \(\kappa \) was thereby attained at the cost of substantially worsening its scaling in the error bound \(\varepsilon \).
The recent QLSA analysis by Clader, Jacobs and Sprouse (CJS) [5] incorporates useful generalizations to make the original algorithm more practical. In particular, a general method is provided for efficient preparation of the generic quantum state \(\left b\right\rangle \) (as well as of \(\left R\right\rangle \)). Moreover, CJS proposed a deterministic version of the algorithm by removing the postselection step and demonstrating a resolution to the readout problem discussed above. This was achieved by introducing several additional singlequbit ancillae and using the quantum amplitude estimation (QAE) technique [22] to deterministically estimate the values of the success probabilities of certain ancillae measurement events in terms of which the overlap \(\left \langle R\left x\right\rangle \right ^2\) of the solution \(\left x\right\rangle \) with any generic state \(\left R\right\rangle \) can be expressed after performing a controlled swap operation between the registers storing these vectors. Finally, CJS also addressed the conditionnumber scaling problem and showed how by incorporating matrix preconditioning into QLSA, the class of problems that can be solved with exponential speedup can be expanded to worse than \(\kappa \sim \text{ poly }\log (N)\)conditioned matrices. With these generalizations aiming at improving the efficiency and practicality of the algorithm, CJSQLSA was shown to have the runtime complexity^{Footnote 6}
which is quadratically better in \(\kappa \) than in the original HHLQLSA. To demonstrate their method, CJS applied QLSA to computing the electromagnetic scattering cross section of an arbitrary object, using the finiteelement method (FEM) to transform Maxwell’s equations into a sparse linear system [23, 24].
1.4 What makes our approach differ from previous work?
In the previous analyses of QLSA [3,4,5], resource estimation was performed using “bigO” complexity analysis, which means that it only addressed the asymptotic behavior of the runtime of QLSA, with reference to a similar bigO characterization for the best known classical linearsystemsolving algorithm. BigO complexity analysis is a fundamental technique that is widely used in computer science to classify algorithms; indeed, it represents the core characterization of the most significant features of an algorithm, both in classical and quantum computing. This technique is critical to understanding how the use of resources and time grows as the inputs to an algorithm grow. It is particularly useful for comparing algorithms in a way where details, such as startup costs, do not eclipse the costs that become important for the larger problems where resource usage typically matters. However, this analysis assumes that those constant costs are dwarfed by the asymptotic costs for problems of interest as has typically proven true for practical classical algorithms. In QCS, we set out to additionally learn (1) whether this assumption holds true for quantum algorithms, and (2) what the actual resource requirements would be as part of starting to understand what would be required for a quantum computer to be a practical quantum computer.
In spite of its key relevance for analyzing algorithmic efficiency, a bigO analysis is not designed to provide a detailed accounting of the resources required for any specific problem size. That is not its purpose, rather it is focused on determining the asymptotic leadingorder behavior of a function, and does not account for the constant factors multiplying the various terms in the function. In contrast, in our case we are interested, for a specific problem input size, in detailed information on such aspects as the number of qubits required, the size of the quantum circuit, and runtime required for the algorithm. These aspects, in turn, are critical to evaluating the practicality of actually implementing the algorithm on a quantum computer.
Thus, in this work we report a detailed analysis of the number of qubits required, the quantity of each type of elementary quantum logic gate, the width and depth of the quantum circuit, and the number of logical timesteps needed to run the algorithm—all for a realistic set of parameters \(\kappa , d\), N, and \(\varepsilon \). Such a finegrained approach to a concrete resource estimation may help to identify the actual bottlenecks in the computation, which algorithm optimizations should particularly focus on. Note that this is similar to the practice in classical computing, where we would typically use techniques like runtime profiling to determine algorithmic bottlenecks for the purpose of program optimization. It goes without much saying that the bigO analyses in [3,4,5] and the more finegrained LRE analysis approach presented here are both valuable and complement each other.
Two more differences are worth mentioning. Unlike in previous analyses of QLSA, our LRE analysis particularly also includes resource requirements of oracle implementations. Finally, this work leverages the use of novel universal automated circuitgeneration and resourcecounting tools (e.g., Quipper) that are currently being developed for resourceefficient implementations of quantum computation. As such our work advances efforts and techniques toward practical implementations of QLSA and other quantum algorithms.
1.5 Main results of this work
We find that surprisingly large logical gate counts and circuit depth would be required for QLSA to exceed the performance of a classical linearsystemsolving algorithm. Our estimates pertain to the specific problem size \(N=332{,}020{,}680\). This explicit example problem size has been chosen such that QLSA and the best known classical linearsystemsolving method are expected to require roughly the same number of operations to solve the problem, assuming equal algorithmic precisions. This is obtained by comparing the corresponding bigO estimates, Eqs. (3) and (2). Thus, beyond this “crossover point” the quantum algorithm is expected to run faster than any classical linearsystemsolving algorithm. Assuming an algorithmic accuracy \(\varepsilon =0.01\), gate counts and circuit depth of order \(10^{29}\) or \(10^{25}\) are found, respectively, depending on whether we take the resource requirements for oracle implementations into account or not, while the numbers of qubits used simultaneously amount to \(10^8\) or 340, respectively. These numbers are several orders of magnitude larger than we had initially expected according to the bigO analyses in [3, 5], indicating that the constant factors (which are not included in the asymptotic bigO estimates) must be large. This indicates that more research is needed about whether asymptotic analysis needs to be supplemented, particularly in comparing quantum to classical algorithms.
To get an idea of our results’ implications, we note that the practicality of implementing a quantum algorithm can strongly be affected by the number of qubits and quantum gates required. For example, the algorithm’s runtime crucially depends on the circuit depth. With circuit depth on the order of \(10^{25}\), and with gate operation times of 1 ns (as an example), the computation would take approx. \(3\times 10^8\) years. And such large resource estimates arise for the solely logical part of the algorithm implementation, i.e., even assuming perfect gate performance and ignoring the additional physical overhead costs (associated with QEC/QC to achieve faulttolerance and specifics of quantumcomputer technology). In practice, the full physical resource estimates typically will be even larger by several orders of magnitude.
One of the main purposes of this paper is to demonstrate how the impressively large LRE numbers arise and to explain the actual bottlenecks in the computation. We find that the dominant resourceconsuming part of QLSA is Hamiltonian Simulation and the accompanying quantumcircuit implementations of the oracle queries associated with Hamiltonian matrix A. Indeed, to be able to accurately implement each run of the Hamiltonian evolution as part of QPEA, one requires a large timesplitting factor of order \(10^{12}\) when utilizing the SuzukiHigherOrder Integrator method including Trotterization [8, 25, 26]. And each single timestep involves numerous oracle queries for matrix A, where each query’s quantumcircuit implementation yields a further factor of several orders of magnitude for gate count. Hence, our LRE results suggest that the resource requirements of QLSA are to a large extent dominated by the numerous oracle A queries and their associated resource demands. Finally, our results also reveal lack of parallelism; the algorithmic structure of QLSA is such that most gates must be performed successively rather than in parallel.
Our LRE results are intended to serve as a baseline for research into the reduction of the logical resource requirements of QLSA. Indeed, we anticipate that our estimates may prove to be conservative as more efficient quantumcomputation techniques become available. However, these estimates indicate that, for QLSA to become practical (i.e., its implementation in real world to be viable for relevant problem sizes), a resource reduction by many orders of magnitude is necessary (as is, e.g., suggested by \(\sim \)3\(\times 10^8\) years for the optimistic estimate of the runtime given current knowledge).
1.6 Outline of the paper
This paper is organized as follows. In Sect. 2 we identify the resources to be estimated and expand on our goals and techniques used. In Sect. 3 we describe the structure of QLSA and elaborate on its coarsegrained profiling with respect to resources it consumes. Section 4 demonstrates our quantum implementation of oracles and the corresponding automated resource estimation using our quantum programming language Quipper (and compiler). Our LRE results are presented in Sect. 5 and further reviewed in Sect. 6. We conclude with a brief summary and discussion in Sect. 7.
2 Resource estimation
As mentioned previously, the main goal of this work is to find concrete logical resource estimates of QLSA as accurately as possible, for a problem size for which the quantum algorithm and the best known classical linearsystemsolving algorithm are expected to require a similar runtime order of magnitude, and beyond which the former provides an exponential speedup over the latter. An approximation for this specific “crossover point” problem size can be derived by comparing the coarse runtime bigO estimates of the classical and quantum algorithms, provided, respectively, by Eqs. (2) and (3), assuming the same algorithmic computation precision \(\varepsilon \), and the same \(\kappa \) and d values.^{Footnote 7} For instance, choosing the accuracy \(\varepsilon =0.01\) and presuming \(d\approx 10\), yields the approximate value \(N_{\mathrm{{cross}}}\approx 4\times 10^7\) for the crossover point. The specified example problem that has been subject to our LRE analysis has the somewhat larger size \(N=332{,}020{,}680\), while the other relevant parameters have the values \(\kappa =10^4, d=7\), and \(\varepsilon =10^{2}\).
Logical resources to be tracked are the overall number of qubits (whereby we track data qubits and ancilla qubits separately), circuit width (i.e., the max. number of qubits in use at a time, which also corresponds to the max. number of “wires” in algorithm’s circuit), circuit depth (i.e., the total number of logical steps specifying the length of the longest path through the algorithm’s circuit assuming maximum parallelism), the number of elementary (1 and 2qubit) gate operations (thereby tracking the quantity of each particular type of gate operation), and “Tdepth” (i.e., the total number of logical steps containing at least one Tgate operation, meaning the total number of Tgate operations that cannot be performed in parallel but must be implemented successively in series). While we are not considering the costs of QEC in this paper, it is nevertheless important to know that, when QEC is considered, the T gate, as a nontransversal gate, has a much higher pergate resource cost than the transversal gates X, Y, Z, H, S, and CNOT, and thus contributes more to algorithm resources relative to the latter. It is for this reason that we call out the Tdepth separately.
Note that the analysis in this paper involves only the abstract algorithmiclevel logical resources; i.e., we ignore all additional costs that must be taken into account when implementing the algorithm on a faulttolerant realworld quantum computer, namely resources associated with techniques to avoid, mitigate, or correct errors which occur due to decoherence and noise. More specifically, here we omit the overhead resource costs associated with various QC and QEC strategies. We furthermore assume no connectivity constraints, thus ignoring resources needed to establish faulttolerant quantum communication channels between two distant (physically remotely located) qubits which need to interact in order to implement a twoqubit gate such as a CNOT in the course of the algorithm implementation. Besides being an indispensable first step toward a complete resource analysis of any quantum algorithm, focusing on the algorithmiclevel resources allows setting a lower limit on resource demands which is independent of the details of QEC approaches and physical implementations, such as qubit technology.
To be able to represent large circuits and determine estimates of their resource requirements, we take advantage of repetitive patterns and the hierarchical nature of circuit decomposition down to elementary quantum gates and its associated coarsegrained profiling of logical resources. For example, we generate “templates” representing circuit blocks that are reused frequently, again and again. These templates capture both the quantum circuits of the corresponding algorithmic building blocks (subroutines or multiqubit gates) and their associated resource counts. As an example, it is useful to have a template for Quantum Fourier Transform (or its inverse) acting on n qubits; for other templates, see Fig. 2 and “Appendix 2.” The cost of a subroutine may thereby be measured in terms of the number of specified gates, data qubits, ancilla uses, etc., or/and in addition in terms of calls of lowerlevel subsubroutines and their associated costs. Furthermore, the cost may vary depending on input argument value to the subroutine. Many of the intermediate steps represent multiqubit gates that are frequently used within the overall circuit. Such intermediate representations can therefore also improve the efficiency of data representation. Accordingly, each higherlevel circuit block is decomposed in a hierarchical fashion, in a series of steps, down to elementary gates from the standard set \(\{ X, Y, Z, H, S, T, \text{ CNOT } \}\), using the decomposition rules for circuit templates (see “Appendices 1 and 2” for details).
Indeed, QLSA works with many repetitive patterns of quantum circuits involving numerous iterative operations, repeated a large number of times. Repetitive patterns arise from the wellestablished techniques such as Quantum Phase Estimation, Quantum Amplitude Estimation, and Hamiltonian Simulation based on SuzukiHigherOrder Integrator decomposition and Trotterization. These techniques involve large iterative factors, thus contributing many orders of magnitude to resource requirements, in particular to the circuit depth. Indeed, these large iterative factors explain why we get such large gate counts and circuit depth.
It is useful to differentiate between the resources associated with the “bare algorithm” excluding oracle implementations and those which also include the implementation of oracles. In order to perform the LRE, we chose an approach which combines manual analysis for the bare algorithm ignoring the cost of oracle implementations (see Sect. 3) with automated resource estimates for oracles generated via the Quipper programming language and compiler (see Sect. 4). Whereas a manual LRE analysis was feasible for the bare algorithm thus allowing a better understanding of its structural “profiling” as well as checking the reliability of the automated resource counts, it was not feasible (or too cumbersome) for the oracle implementations. Hence, an automated LRE was inevitable for the latter. The Quipper programming language is thereby demonstrated as a universal automated resource estimation tool.
3 Quantum linearsystem algorithm and its profiling
3.1 General remarks
QLSA computes the solution of a system of linear equations, \(A{\mathbf {x}}={\mathbf {b}}\), where A is a Hermitian \(N\times N\) matrix over \({\mathbb {C}}\) and \({\mathbf {x}},{\mathbf {b}}\in {\mathbb {C}}^N\). For this purpose, the (classical) linear system is converted into the corresponding quantumtheoretic analogue, \(A\left x\right\rangle =\left b\right\rangle \), where \(\left x\right\rangle , \left b\right\rangle \) are vectors in a Hilbert space \({\mathscr {H}}=({\mathbb {C}}^2)^{\otimes n}\) corresponding to \(n=\lceil \log _2N\rceil \) qubits and A is a Hermitian operator on \({\mathscr {H}}\). Note that, if A is not Hermitian, we can define \({\bar{A}}:=\bigl ({\begin{matrix} 0&{}A\\ A^\dagger &{}0 \end{matrix}} \bigr ), {\bar{\mathbf{b}}}:= ({\mathbf {b}},0)^T\), and \({\bar{\mathbf{x}}}:= (0,{\mathbf {x}})^T\), and restate the problem as \({\bar{A}}{\bar{\mathbf{x}}}={\bar{\mathbf{b}}}\) with a Hermitian \(2N\times 2N\) matrix \({\bar{A}}\) and \({\bar{\mathbf{x}}},{\bar{\mathbf{b}}}\in {\mathbb {C}}^{2N}\).
The basic idea of QLSA has been outlined in the Introduction. In what follows, we illustrate the structure of QLSA including the recently proposed generalization [5] in more detail. In particular, we expand on its coarsegrained profiling with respect to resources it consumes. Our focus in this section is the implementation of the bare algorithm, which accounts for oracles only in terms of the number of times they are queried. The actual quantumcircuit implementation of oracles is presented in Sect. 4. Our overall LRE results are summarized in Sect. 5.
3.2 Problem specification
We analyze a concrete example which was demonstrated as an important QLSA application of high practical relevance in [5]: the linear system \(A{\mathbf {x}}={\mathbf {b}}\) arising from solving Maxwell’s equations to determine the electromagnetic scattering cross section of a specified target object via the FiniteElement Method (FEM) [23]. Applied in sciences and engineering as a numerical technique for finding approximate solutions to boundaryvalue problems for differential equations, FEM often yields linear systems \(A{\mathbf {x}}={\mathbf {b}}\) with highly sparse matrices—a necessary condition for QLSA. The FEM approach to solving Maxwell’s equations for scattering of electromagnetic waves off an object, as demonstrated in [5, 23, 24], introduces a discretization by breaking up the computational domain into small volume elements and applying boundary conditions at neighboring elements. Using finiteelement edge basis vectors [24], the system of differential Maxwell’s equations is thereby transformed into a sparse linear system. The matrix A and vector \({\mathbf {b}}\) comprise information about the scattering object; they can be derived, and efficiently computed, from a functional that depends only on the discretization chosen and the boundary conditions which account for the scattering geometry. For details, see [5] and [23, 24] including its supplementary material.
Within the scope of the PLATO project, we analyzed a 2D toyproblem given by scattering of a linearly polarized plane electromagnetic wave \({{\varvec{E}}}(x,y)=E_0 {{\varvec{p}}}\exp [i({{\varvec{k}}}\cdot {{\varvec{r}}}\omega t)]\), with magnitude \(E_0\), frequency \(\omega \), wave vector \({{\varvec{k}}}=k(\cos \theta {{\varvec{e}}_x}+ \sin \theta {{\varvec{e}}_y})\), and polarization unit vector \({{\varvec{p}}}={{\varvec{e}}_z}\times {{\varvec{k}}}/k\), while \({{\varvec{r}}}=x{{\varvec{e}}_x}+ y{{\varvec{e}}_y}\) is the position, off a metallic object with a 2dimensional scattering geometry. The scattering region can have any arbitrary design. A simple square shape was specified for our example problem, whose edges are parallel (or perpendicular) to the Cartesian xy plane axes, and an incident field propagating in xdirection (\(\theta =0\)) toward the square, as illustrated in Fig. 1. The receiver polarization, needed to calculate the farfield radar cross section of the scattered waves, has been assumed to be parallel to the polarization of the incident field.
For the sake of simplicity, for FEM analysis we used a twodimensional uniform finiteelement mesh with square finite elements. Note that QLSA requires the matrix elements to be efficiently computable, a constraint which restricts the class of FEM meshes that can be employed. As a result of the local nature of the finiteelement expansion of the scattering problem, the corresponding linear system has a highly sparse matrix A. For meshes with rectangular finite elements, the maximum number of nonzero elements in each row of A (i.e., sparseness) is \(d=7\). Moreover, for regular grids, such as used for our analysis, we obtain a banded sparse matrix A, with a total of \(N_b=9\) bands.
The actual instructions for computing the elements of the linear system’s matrix A and vector \({\mathbf {b}}\), as well as of the vector whose overlap with the solution \({\mathbf {x}}\) is used to calculate the farfield radar cross section (see Sect. 3.3), are specified in our Quipper code for QLSA, see [18, 19]. The metallic scattering region is thereby given in terms of an array of scatteringnodes denoted as “scatteringnodes.” Here we briefly summarize the FEM dimensions and the values of all other system parameters that are necessary to reproduce the analysis. For all other details, we refer the reader to our QLSA’s Quipper code and its documentation in [18, 19].
The total number of FEM vertices in x and y dimensions were \(n_x=12{,}885\) and \(n_y=12{,}885\), respectively, yielding \(N=n_x(n_y1)+n_y(n_x1)=332{,}020{,}680\) for the total number of FEM edges, which thus determines the number of edge basis vectors, and hence also the size of the linear system, and in particular the size of the \(N\times N\) matrix A. The lengths of FEM edges in x and y dimensions were \(l_x=0.1m\) and \(l_y=0.1m\), respectively. The analyzed 2D scattering object was a square with edge length \(L=2\lambda \), which in our analysis was placed right in the center of the FEM grid. In our Quipper code for QLSA [18, 19] it is represented by the array “scatteringnodes” containing the corner vertices of the scattering region. The dimensions of the scattering region can also be expressed in terms of the number of vertices in x and y directions; using \(\lambda =1m\) (see below), the scatterer was given by a \(200\times 200\) square area of vertices. The incident and scattered field parameters were specified as follows. The incident field amplitude, wave number and angle of incidence were set \(E_0=1.0\, V/m, k=2\pi \, m^{1}\) (implying wavelength \(\lambda =1m\)) and \(\theta =0\), respectively. The receiver (for scattered field detection) was assumed to have the same polarization direction as the incident field and located along the xaxis (at angle \(\phi =0\)). The task of QLSA is to compute the farfield radar cross section with a precision specified in terms of the multiplicative error bound \(\varepsilon =0.01\).
Finally, we remark that our example analysis does not include matrix preconditioning that was also proposed in [5] to expand the number of problems that can achieve exponential speedup over classical linearsystem algorithms. With no preconditioning, condition numbers of the linearsystem matrices representing a finiteelement discretization of a boundaryvalue problem typically scale worse than polylog(N), which would be necessary to attain a quantum advantage over classical algorithms. Indeed, as was rigorously proven in [27, 28], FEM matrix condition numbers are generally bounded from above by \(O(N^{2/n})\) for \(n\ge 3\) and by \({\widetilde{O}}(N)\) for \(n=2\), with n the number of dimensions of the problem. For regular meshes, the bound \(O(N^{2/n})\) is valid for all \(n\ge 2\). In our 2D toyproblem, \(n=2\) and the mesh is regular, implying that the condition number is bounded by O(N). However, we used the much smaller value \(\kappa = 10^4\) from IARPA GFI to perform our LRE. This “guess” can be motivated by an estimate for the lower bound of \(\kappa \) that we obtained numerically.^{Footnote 8}
3.3 QLSA: abstract description
The generalized QLSA [5] is based on two wellknown quantum algorithm techniques: (1) Quantum Phase Estimation Algorithm (QPEA) [6, 7], which uses Quantum Fourier Transform (QFT) [1] as well as Hamiltonian Simulation (HS) [8] as quantum computational primitives, and (2) Quantum Amplitude Estimation Algorithm (QAEA) [22], which uses Grover’s searchalgorithm primitive. The purpose of QPEA, as part of QLSA, is to gain information about the eigenvalues of the matrix A and move them into a quantum register. The purpose of the QAEA procedure is to avoid the use of nondeterministic (nonunitary) measurement and postselection processes by estimating the quantum amplitudes of the desired parts of quantum states, which occur as superpositions of a “good” part and a “bad” part.^{Footnote 9}
QLSA requires several quantum registers of various sizes, which depend on the problem size N and/or the precision \(\varepsilon \) to which the solution is to be computed. We denote the jth quantum register by \(R_j\), its size by \(n_j\), and the quantum state corresponding to register \(R_j\) by \(\left \psi \right\rangle _j\) (where \(\psi \) is a label for the state). The following Table 1 lists all logical qubit registers that are employed by QLSA, specified by their size as well as purpose. The register size values chosen (provided in GFI within the scope of IARPA QCS program) correspond to the problem size \(N=332{,}020{,}680\) and algorithm precision \(\varepsilon =0.01\).
For example, the choice \(n_0=\lceil \log _2 M\rceil =14\) for the size of the QAE control register can be explained as follows. According to the error analysis of Theorem 12 in [22], using QAEA the modulus squared \(0\le \alpha \le 1\) of a quantum amplitude can be estimated within \(\pm \varepsilon \alpha \) of its correct value^{Footnote 10} with a probability at least \(8/\pi ^2\) for \(k=1\) and with a probability greater than \(1\frac{1}{2(k1)}\) for \(k\ge 2\), if the QAE control register’s Hilbert space dimension M is chosen such that (see [22])
where \({\tilde{\alpha }}~(0\le {\tilde{\alpha }} \le 1)\) denotes the output of QAEA. Moreover, if \(\alpha =0\) then \({\tilde{\alpha }}=0\) with certainty, and if \(\alpha =1\) and M is even, then \({\tilde{\alpha }}=1\) with certainty. Corollary (4) can be viewed as a requirement used to determine the necessary value of M, yielding (for \(\alpha \not =0\))
The RHS of this expression is strictly decreasing, tending to \(\frac{k\pi }{\sqrt{\varepsilon \alpha }}\) as \(\alpha \) becomes close to 1, whereas for \(\alpha \ll 1\) we have \(M\ge \lceil \frac{k\pi }{\varepsilon \sqrt{\alpha }}[(1\frac{\alpha }{2})+(1\frac{\alpha \varepsilon }{2})]\rceil = \lceil \frac{2k\pi }{\varepsilon \sqrt{\alpha }}\rceil \). Hence, we take \(M\ge \lceil \frac{2k\pi }{\varepsilon \sqrt{\alpha }}\rceil \), so as to account for all possibilities. Moreover, we want QAEA to succeed with a probability close to 1, allowing failure only with a small error probability \(\wp _{\mathrm{{err}}}\). According to Theorem 12 in [22], this indeed can be achieved when \(1\frac{1}{2(k1)}\ge 1\wp _{\mathrm{{err}}}\), i.e., for \(k\ge \lceil 1+\frac{1}{2\wp _{\mathrm{{err}}}}\rceil \), and thus for
While we may assume any value for the failure probability, for the sake of simplicity we here choose \(\wp _{\mathrm{{err}}}=\varepsilon \), which is also the desired precision of QLSA. Unless \(\alpha \) is very small, this justifies our choice \(M=2^{\lceil \log _2( 1/{\varepsilon ^2)}\rceil }\). A similar requirement for the value of M was also proposed in the supplementary material of [5]. In our example computation, \(\varepsilon =0.01\), and so we have \(n_0=14\). Note that small \(\alpha \) values require an even larger value for the QAE control register size in order to ensure that the estimate \({\tilde{\alpha }}\) is within \(\pm \varepsilon \alpha \) of the actual correct value with a success probability greater than \(1\varepsilon \).
As a first step, QLSA prepares the known quantum state \(\left b\right\rangle _2=\sum _{j=0}^{N1} b_j \left j\right\rangle _2\) in a multiqubit quantum data register \(R_2\) consisting of \(n_2=\lceil \log _2(2N)\rceil \) qubits. This step requires numerous queries (see details below) of an oracle for vector \({\mathbf {b}}\). Moreover, as pointed out in [5], efficient quantum state preparation of arbitrary states is in general not always possible. However, the procedure proposed in [5] can efficiently generate the state
where the multiqubit data register \(R_2\) contains (as a quantum superposition) the desired arbitrary state \(\left b\right\rangle \) entangled with a 1 in an auxiliary singlequbit register \(R_6\), as well as a garbage state \(\left {\tilde{b}}\right\rangle \) (denoted by the tilde) entangled with a 0 in register \(R_6\). To generate the state (7), in addition to data registers \(R_2\) and singlequbit auxiliary register \(R_6\), two further, computational registers \(R_4\) and \(R_5\) are employed, each consisting of \(n_4\) auxiliary qubits. The latter registers are used to store the magnitude and phase components, which in [5] are denoted as \(b_j\) and \(\phi _j\), respectively, that are computed each time the oracle b is queried. Which component (\(j=1,2,3, \dots \)) to query is thereby controlled by data register \(R_2\). The quantum circuit for state preparation [Eq. (7)] is shown in Sect. 3.4.3, Fig. 13. Following the oracle b queries, a controlledphase gate is applied to the auxiliary singlequbit register \(R_6\), controlled by the calculated value of the phase carried by quantum register \(R_5\); in addition, the singlequbit register \(R_6\) is rotated conditioned on the calculated value of the amplitude carried by quantum register \(R_4\). Uncomputing registers \(R_4\) and \(R_5\) involves further oracle b calls, leaving registers \(R_2\) and \(R_6\) in the state (7) with \(\sin ^2\phi _b= \frac{C_b^2}{2N}\sum _{j=0}^{2N1}b_j^2\) and \(\cos ^2\phi _b=\frac{1}{2N}\sum _{j=0}^{2N1}\left( 1C_b^2b_j^2\right) \), where \(C_b=1/{\text{ max }(b_j)}\), cf. [5].
As a second step, QPEA is employed to acquire information about the eigenvalues \(\lambda _j\) of A and store them in a multiqubit control register \(R_1\) consisting of \(n_1=\lceil \log _2 T\rceil \) qubits, where the parameter T characterizes the precision of the QPEA subroutine and is specified in Table 1. This highlevel step consists of several hierarchy levels of lowerlevel subroutines decomposing it down to a finegrained structure involving only elementary gates. More specifically, controlled Hamiltonian evolution \(\sum _{\tau =0}^{T1}\left( \left \tau \right\rangle \left\langle \tau \right \right) _1\otimes \left[ \exp (iA\tau t_0/T)\right] _2\otimes \mathbbm {1}_6\) with A as the Hamiltonian is applied to quantum state \(\left \phi \right\rangle _1\otimes \left b_T\right\rangle _{2,6}\). Here, similar to the presentation in [3], a time constant \(t_0\) such that \(t=\tau t_0/T\le t_0\) has been introduced for the purpose of minimizing the error for a given condition number \(\kappa \) and matrix norm \(\Vert A\Vert \). As shown in [3], for the QPEA to be accurate up to error \(O(\varepsilon )\), we must have \(t_0\sim O({\kappa }/{\varepsilon })\) if \(\Vert A\Vert \sim O(1)\). Accordingly, we define \(t_0:= \Vert A\Vert \kappa /\varepsilon \). The application of \(\exp (iA\tau t_0/T)\) on the data register \(R_2\) is thereby controlled by \(n_1\)qubit control register \(R_1\) prepared in state \(\left \phi \right\rangle _1=H^{\otimes n_1}\left 0\right\rangle ^{\otimes n_1}=\frac{1}{\sqrt{T}}\sum _{\tau =0}^{T1}\left \tau \right\rangle _1\) (with H denoting the Hadamard gate). Controlled Hamiltonian evolution is subsequently followed by a QFT of register \(R_1\) to complete QPEA.
The Hamiltonian quantum state evolution is accomplished by multiquerying an oracle for matrix A and HS techniques [8], which particularly include the decomposition of the Hamiltonian matrix into a sum
of submatrices, each of which ought to be 1sparse, as well as the SuzukiHigherOrder Integrator method and Trotterization [25, 26]. In the general case, an arbitrary sparse matrix A with sparseness d can be decomposed into \(m=6d^2~1\)sparse matrices \(A_j\) using the graphcoloring method, see [8]. However, a much simpler decomposition is possible for the toyproblem example considered in this work. Indeed, a uniform finiteelement grid has been used to analyze the problem specified in the GFI. For uniform finiteelement grids the matrix A is banded; furthermore, the number and location of the bands is given by the geometry of the scattering problem. Hence, to decompose the Hamiltonian matrix [Eq. (8)], the simplest way do so is to break it up by band into \(m=N_b\) submatrices, with \(A_j\) denoting the jth nonzero band of matrix A, and \(N_b\) denoting the overall number of its bands. For the square finiteelement grid used in the analyzed example, \(N_b = 9\). Moreover, because the locations of the bands are known, this decomposition method requires only time of order O(1). Having the matrix decomposition (8), it is then necessary to implement the application of each individual onesparse Hamiltonian from this decomposition to the actual quantum state of the data register \(R_2\). This “Hamiltonian circuit” can be derived by a procedure resembling the techniques of quantumrandomwalk algorithm [30] and is discussed in more detail in Sect. 3.4.5.
After QPEA has been accomplished including the QFT of register \(R_1\), the joined quantum state of registers \(R_1, R_2\) and \(R_6\) becomes, approximately,
where \(\lambda _j\) and \(\left u_j\right\rangle \) are the eigenvalues and eigenvectors of A, respectively, and \(\left b\right\rangle _2=\sum _{j=1}^N\beta _j\left u_j\right\rangle _2\) and \(\left {\tilde{b}}\right\rangle _2=\sum _{j=1}^N{\tilde{\beta }}_j\left u_j\right\rangle _2\) are the expansions of quantum states \(\left b\right\rangle _2\) and \(\left {\tilde{b}}\right\rangle _2\), respectively, in terms of these eigenvectors, and \({\tilde{\lambda }}_j:=\lambda _jt_0/2\pi \).
As a third step, a further singlequbit ancilla in register \(R_7\) is employed, initially prepared in state \(\left 0\right\rangle _7\) and then rotated by an angle inversely proportional to the value stored in register \(R_1\), yielding the overall state:
where \(C:=1/\kappa \) is chosen such that \(C/\lambda _j<1\) for all j, because of \(\kappa =\lambda _{\mathrm{{max}}}/\lambda _{\mathrm{{min}}}.\)
Finally, the eigenvalues stored in register \(R_1\) are uncomputed, by the inverse QFT of \(R_1\), inverse Hamiltonian evolution on \(R_2\) and \(H^{\otimes n_1}\) on \(R_1\), leaving registers \(R_1, R_2, R_6\), and \(R_7\) in the state
Ignoring register \(R_1\) and collecting all terms that are not entangled with the term \(\left 1\right\rangle _{6}\otimes \left 1\right\rangle _{7}\) into a “garbage state” \(\left \varPhi _0\right\rangle _{2,6,7}\), the common quantum state of registers \(R_2, R_6\), and \(R_7\) can be written as, see [5]:
where
is the normalized solution to \(A\left x\right\rangle =\left b\right\rangle \) stored in quantum data register \(R_2\) and \(\sin ^2\phi _x:=C^2\sum _{j=1}^N\beta _j^2/\lambda ^2_j\). Note that the solution vector [Eq. (13)] in register \(R_2\) is correlated with the value 1 in the auxiliary register \(R_7\). Hence, if register \(R_7\) is measured and the value 1 is found, we know with certainty that the desired solution of the problem is stored in the quantum amplitudes of the quantum state of register \(R_2\), which can then either be revealed by an ensemble measurement (a statistical process requiring the whole procedure to be run many times) or useful information can also be obtained by computing its overlap \(\left \langle R\left x\right\rangle \right ^2\) with a particular (known) state \(\left R\right\rangle \) (corresponding to a specific vector \({\mathbf {R}}\in {\mathbb {C}}^N\)) that has been prepared in a separate quantum register. To avoid nonunitary postselection processes, CJSQLSA [5] employs QAEA.^{Footnote 11}
With respect to the particular application example that has been analyzed here, namely, solving Maxwell’s equations for a scattering problem using the FEM technique, we are interested in the radar scattering cross section (RCS) \(\sigma _{\mathrm{{RCS}}}\), which can be expressed in terms of the modulus squared of a scalar product, \( \sigma _{\mathrm{{RCS}}}=\frac{1}{4\pi }{\mathbf {R}}\cdot {\mathbf {x}}^2\), where \({\mathbf {x}}\) is the solution of \(A{\mathbf {x}}={\mathbf {b}}\) and \({\mathbf {R}}\) is an Ndim vector whose components are computed by a 2D surface integral involving the corresponding edge basis vectors and the radar polarization, as outlined in detail in [5]. Thus, to obtain the cross section using QLSA, we must compute \(  \left\langle R x\right\rangle ^2\), where \(\left R\right\rangle \) is the quantumtheoretic representation of the classical vector \({\mathbf {R}}\). It is important to note that, whereas \(\left R\right\rangle \) and \(\left x\right\rangle \) are normalized to 1, the vectors \({\mathbf {R}}\) and \({\mathbf {x}}\) are in general not normalized and carry units. Hence, after computing \(  \left\langle R x\right\rangle ^2\), units must be restored to obtain \({\mathbf {R}}\cdot {\mathbf {x}}^2\).
As for \(\left b\right\rangle \), the preparation of the quantum state \(\left R\right\rangle \) is imperfect. Employing the same preparation procedure that has been used to prepare \(\left b_T\right\rangle \), but with oracle R instead of oracle b, we can prepare the entangled state
where the multiqubit quantum data register \(R_3\) consisting of \(n_3=\lceil \log _2(2N)\rceil \) qubits contains (as a quantum superposition) the desired arbitrary state \(\left R\right\rangle \) entangled with value 1 in an auxiliary singlequbit register \(R_8\), as well as a garbage state \(\left {\tilde{R}}\right\rangle \) (denoted by the tilde) entangled with value 0 in register \(R_8\). Moreover, the amplitudes squared are given as \(\sin ^2\phi _r= \frac{C_R^2}{2N}\sum _{j=0}^{2N1}R_j^2\) and \(\cos ^2\phi _r=\frac{1}{2N}\sum _{j=0}^{2N1}\left( 1C_R^2R_j^2\right) \), where \(C_R=1/{\text{ max }(R_j)}\), cf. [5]. As outlined in [5], the state (14) is adjoined to state (12) along with a further ancilla qubit in singlequbit register \(R_9\) that has been initialized to state \(\left 0\right\rangle _9\). Then, a Hadamard gate is applied to the ancilla qubit in register \(R_9\) and a controlled swap operation is performed between registers \(R_2\) and \(R_3\) controlled on the value of the ancilla qubit in register \(R_9\), which finally is followed by a second Hadamard transformation of the ancilla qubit in register \(R_9\). After a few simple classical transformations, the algorithm can compute the scalar product between \(\left x\right\rangle \) and \(\left R\right\rangle \) as, cf. [5]:
where \(P_{1110}\) and \(P_{1111}\) denote the probability of measuring a “1” in the three ancilla registers \(R_6, R_7\) and \(R_8\) and a “0” or “1” in ancilla register \(R_9\), respectively. Finally, after restoring the units to the normalized output of QLSA, the RCS in terms of quantities received from the quantum computation is, cf. [5]:
where \(\sin \phi _{r0}:=P^{\frac{1}{2}}_{1110}\sin \phi _r\) and \(\sin \phi _{r1}:=P^{\frac{1}{2}}_{1111}\sin \phi _r\).
It is important to note that, because all the employed state preparation and linearsystemsolving operations are unitary, the four amplitudes \(\sin \phi _b, \sin \phi _x, \sin \phi _{r0}\) and \(\sin \phi _{r1}\) that are needed for the computation of the RCS according to Eq. (16) can be estimated nearly deterministically (with error \(\varepsilon \)) using QAEA which allows to avoid nested nondeterministic subroutines involving postselection.^{Footnote 12} Yet, there is a small probability of failure, which means that QLSA can occasionally output an estimate \({\tilde{\sigma }}_{\mathrm{{RCS}}}\) that is not within the desired precision range of the actual correct value \(\sigma _{\mathrm{{RCS}}}\). The failure probability is generally always nonzero but can be made negligible.^{Footnote 13}
3.4 QLSA: algorithm profiling and quantumcircuit implementation
The highlevel structure of QLSA [5] is captured by a tree diagram depicted in Fig. 2. It consists of several highlevel subroutines hierarchically comprising (i)‘ ‘Amplitude Estimation” (first level), (ii) “State Preparation” and “Solve for x” (second level), (iii) “Hamiltonian Simulation” (third level), and several further sublevel subroutines, such as “HsimKernel” and “Hmag” that are used as part of HS. Figure 2 illustrates the coarsegrained profiling of QLSA for the purpose of an accurate LRE of the algorithm, demonstrating the use of repetitive patterns, i.e., templates representing algorithmic building blocks that are reused frequently. Representing each algorithmic building block in terms of a quantum circuit thus yields a stepbystep hierarchical circuit decomposition of the whole algorithm down to elementary quantum gates and measurements. The cost of each algorithmic building block is thereby measured in terms of the number of calls of lowerlevel subroutines or directly in terms of the number of specified elementary gates, data qubits, ancilla uses, etc.
To obtain an accurate LRE of QLSA, we thus need to represent each algorithmic building block in terms of a quantum circuit that then enables us to count elementary resources. In what follows, we present quantum circuits for selected subroutines of QLSA. Wellknown circuit decompositions of common multiqubit gates (such as, e.g., Toffoli gate, multicontrolled NOTs, and W gate) and their associated resource requirements are discussed in the “Appendix.”
3.4.1 The “main” function QLSA_\(\mathbf {\text{ main }}\)
The task of the main algorithm “QLSA_\(\mathbf {\text{ main }}\)” is to estimate the radar cross section for a FEM scattering problem specified in GFI using the quantum amplitude estimation subalgorithms “AmpEst_\({\phi _b}\),” “AmpEst_\({\phi _x}\)” and “AmpEst_\({\phi _r}\)” to approximately compute the angles corresponding to the probability amplitudes \(\sin (\phi _b), \sin (\phi _x), \sin (\phi _{r0})\) and \(\sin (\phi _{r1})\):
 \(\phi _b\) :

\(\leftarrow \) AmpEst_\({\phi _b}(\text{ Oracle }\_\mathbf{b})\)
 \(\phi _x\) :

\(\leftarrow \) AmpEst_\({\phi _x}(\text{ Oracle }\_\mathbf{A}, \text{ Oracle }\_\mathbf{b})\)
 \(\phi _{r0}\) :

\(\leftarrow \) AmpEst_\({\phi _r}(\text{ Oracle }\_\mathbf{A}, \text{ Oracle }\_\mathbf{b}, \text{ Oracle }\_\mathbf{R}, 0)\)
 \(\phi _{r1}\) :

\(\leftarrow \) AmpEst_\({\phi _r}(\text{ Oracle }\_\mathbf{A}, \text{ Oracle }\_\mathbf{b}, \text{ Oracle }\_\mathbf{R}, 1)\)
where in the last two lines “0” and “1” refer to the probability of measuring value 0 or 1 on ancilla qubit in register \(R_9\), respectively. It then uses these probability amplitudes (or rather their corresponding probabilities) to calculate an estimate of the radar cross section \(\sigma _{\mathrm{{RCS}}}=\sigma _{\mathrm{{RCS}}}(\phi _b, \phi _x, \phi _{r0}, \phi _{r1})\) according to Eq. (16), whereby this part uses only classical computation. The result of the whole computation ought to be as precise as specified by the multiplicative error term \(\pm \varepsilon \sigma _{\mathrm{{RCS}}}\), where the desired (given) accuracy parameter in our analysis has the value \(\varepsilon =0.01\). The LRE of the complete QLS algorithm is thus obtained as the sum of the LREs of the four calls of the quantum amplitude estimation subalgorithms, respectively, that are employed by QLSA_\(\mathbf {{main}}\).
3.4.2 Amplitude estimation subroutines
In this subsection we present the quantum circuits of the three Amplitude Estimation subroutines “AmpEst_\({\phi _b}\),” “AmpEst_\({\phi _x}\)” and “AmpEst_\({\phi _r}\),” which are called by “QLSA_\(\mathbf {{main}}\)” to compute estimates of the angles \(\phi _b, \phi _x, \phi _{r0}\) and \(\phi _{r1}\) that are needed to obtain an estimate for the RCS \(\sigma _{\mathrm{{RCS}}}\).
Subroutine AmpEst_\({\phi _b}\) This subroutine computes an estimate for the angle \(\phi _b\), which determines the probability amplitude of success \(\sin (\phi _b)\) for the preparation of the quantum state \(\left b\right\rangle \) in register \(R_2\), see Eq. (7). Its algorithmic structure is represented by the circuits depicted in Figs. 3, 4 and 5. It employs subroutine “StatePrep_\({{\mathbf {b}}}\),” which prepares the state [Eq. (7)], and a Grover iterator whose construction is illustrated by the circuit in Fig. 5.
Subroutine AmpEst_\({\phi _x}\) This subroutine computes an estimate for the angle \(\phi _x\), which, together with the previously computed angle \(\phi _b\), determines the probability amplitude of success, \(\sin (\phi _b)\sin (\phi _x)\), of computing the solution state \(\left x\right\rangle \) in register \(R_2\), see Eq. (12). Its algorithmic structure is represented by the circuits depicted in Figs. 6, 7 and 8. It involves subroutine “StatePrep_\({{\mathbf {b}}}\),” which prepares the quantum state (7), the subroutine “Solve_x,” which implements the actual “solveforx” procedure that incorporates all required lowerlevel subroutines such as those needed for Hamiltonian Simulation, and a Grover iterator whose construction is given in Fig. 8.
Subroutine AmpEst_\({\phi _r}\) This subroutine computes an estimate for the angle \(\phi _{r0}\) or \(\phi _{r1}\), respectively, which, together with the previously computed angles \(\phi _b\) and \(\phi _x\), determine the probability amplitude of successfully computing the overlap integral \(\left\langle R x\right\rangle \). Its algorithmic structure is represented by the circuits depicted in Figs. 9, 10, 11 and 12. It involves subroutines “StatePrep_\({{\mathbf {b}}}\)” and “StatePrep_\({{\mathbf {R}}}\),” which prepare the quantum states (7) and (14), respectively, the subroutine “Solve_x,” which implements the actual “solveforx” procedure, and furthermore a swapp protocol that is required for computing an estimate of \(\left\langle R x\right\rangle \), and finally a Grover iterator whose construction is given by the quantum circuit in Fig. 12.
3.4.3 State preparation subroutine
The state preparation subroutine “StatePrep” is used to generate the quantum states \(\left b_T\right\rangle \) and \(\left R_T\right\rangle \) in Eqs. (7) and (14) from given classical vectors \({\mathbf {b}}\) and \({\mathbf {R}}\) using the corresponding oracles and controlledphase and rotation gates. The circuit for generating \(\left b_T\right\rangle \) is depicted in Fig. 13. A similar circuit is used to generate \(\left R_T\right\rangle \), by replacing the Oracle b by Oracle R. The subroutines “CPhase” and “CRotY” and their associated resource counts are discussed in Appendix “Controlled phase: \(\text{ CPhase }({\mathbf {c}}; \phi _0,f)\)” and “ControlledRotY: \(\text{ CRotY }({\mathbf {c}}, {\mathbf {t}}; \phi _0, f)\),” respectively. The implementation of Oracles b and R is analyzed in Sect. 4.
3.4.4 Solve_x subroutine
Subroutine “Solve_x \(({\mathbf {x}}, {\mathbf {s}}; \text{ Oracle }\_A)\)” is the actual linearsystemsolving procedure, i.e., it implements the “solveforx”) transformation. More concretely, it takes as input the state \(\left b_T\right\rangle _{2,6}\) (see Eq. (7)) that has been prepared in registers \(R_2, R_6\), and computes the state given in Eq. (12) which contains the solution state \(\left x\right\rangle _2=A^{1}\left b\right\rangle _2\) in register \(R_2\) with successprobability amplitude \(\sin (\phi _b)\sin (\phi _x)\). The arguments of this subroutine are \({\mathbf {x}}\) and \({\mathbf {s}}\) corresponding to the input states in data register \(R_2\) and singlequbit ancilla register \(R_7\); furthermore, \(\text{ Oracle }\_A\) occurs in the argument list to indicate that it is called by Solve_x to implement the HS lowerlevel subroutines. Note that “Solve_x” does not act on register \(R_6\).
The quantum circuit for “Solve_x” is shown in Fig. 14. It involves lowerlevel subroutines “HamiltonianSimulation” (see Fig. 15), QFT, “IntegerInverse,” and their Hermitian conjugates, respectively, and the controlled rotation “CRotY,” which is defined and analyzed in Appendix “ControlledRotY: \(\text{ CRotY }({\mathbf {c}}, {\mathbf {t}}; \phi _0, f)\).”
3.4.5 Hamiltonian Simulation subroutines
Hamiltonian Simulation subroutines implement, as part of QPEA, the unitary transformation \(\exp (iA\tau t_0/T)\), which is to be applied to register \(R_2\), which together with register \(R_6\) has been prepared in quantum state \(\left b_T\right\rangle _{2,6}\), whereby this Hamiltonian evolution is to be controlled by HS control register \(R_1\) and the Hamiltonian is specified by Oracle A.
For a thorough HS analysis, see [8] and further references therein. The decomposition of the banded matrix A by band into a sum of submatrices, according to Eq. (8), and the SuzukiHigherOrder Integrator method [26] with order \(k=2\) and Trotterization [25] are all accomplished by subroutine “HamiltonianSimulation\(({\mathbf {x}}, {\mathbf {t}}; \text{ Oracle } A)\),” whose implementation is illustrated in Figs. 16 and 17. The SuzukiTrotter timesplitting factor, here denoted by r, can be determined by the formula, cf. [8]:
where \(t=\tau t_0/T\le t_0\) is the length of time the Hamiltonian evolution must be simulated, and \(\Vert A\Vert \) is the norm of the Hamiltonian matrix. As was shown in [3], to ensure algorithmic accuracy up to error bound \(\varepsilon \) for subalgorithm “Solve_x,” we must have \(t_0\sim O(\kappa /\varepsilon )\). In our analysis, the time constant for Hamiltonian Simulation was set \(t_0=7\kappa /\varepsilon \), as suggested by the problem specification in the IARPA GFI. Inserting the values \(k=2, N_b=9, \varepsilon =0.01\) and \(\Vert A\Vert t\lesssim 7\times 10^6\) into Eq. (17) yields the approximate value \(r\lesssim 8\times 10^{11}\). However, to ensure accuracy \(\varepsilon \) not only for the Hamiltonianevolution simulation but also for each of the three Amplitude Estimation subroutines that employ subalgorithm “Solve_x” in \((2^{n_0+1}1)\) calls, respectively, see Fig. 2, we would typically require a much smaller target accuracy for the implementation of the Hamiltonian evolution. Assuming errors always adding up, an obvious choice would be \(\varepsilon '=\varepsilon /(2^{n_0+1}1)\), which, when inserted into Eq. (17) in place of \(\varepsilon \), yields \(r\approx 6.35\times 10^{12}\). This is a fairly conservative and unnecessarily large estimate, though. Following the suggestions in the GFI, for the purpose of our LRE analysis, we have used the somewhat smaller (average) value \(r= 2.5\times 10^{12}\), which is roughly obtained by using the average Hamiltonianevolution time \(t_0/2\) rather than the maximum HS time \(t_0\) in Eq. (17).
Furthermore, the application of a controlled onesparse Hamiltonian transformation to any arbitrary input state in register \(R_2\) uses techniques resembling a generalization of the quantumrandomwalk algorithm [30]. Its implementation is the task of the two lowerlevel subroutines “HsimKernel\(({\mathbf {t}}, {\mathbf {x}}, \text{ band }, \text{ timestep }, \text{ Oracle } A)\)” and “Hmag\(({\mathbf {x}}, {\mathbf {y}}, \text{ m }, \phi _0)\),” which are represented and illustrated by circuits in Figs. 18 and 19, respectively.
3.4.6 Oracle subroutines
A quantum oracle is commonly considered a unitary “black box” labeled as \(U_f\) which, given the value \({\mathbf {x}}\) of an nqubit input register \({\mathscr {R}}_1\), efficiently and unitarily computes the value of a function \(f:\{0,1\}^n\rightarrow \{0,1\}^m\) and stores it in an mqubit auxiliary register \({\mathscr {R}}_2\) that has initially been prepared in state \(\left 0\right\rangle ^{\otimes m}\):
In our analysis, oracles must be employed for the purpose of state preparation (Oracle b or Oracle R) and Hamiltonian Simulation (Oracle A); they need to be constructed from mappings between the FEM global edge indices and the quantities defining the linear system, matrix A and vector \({\mathbf {b}}\), as well as the “measurement vector” \({\mathbf {R}}\) that is used to compute the RCS.
Theoretically, oracle implementations are usually not specified. The efficiency of oracular algorithms is commonly characterized in terms of their query complexity, assuming each query is given by an efficiently computable function. However, in practice oracle implementations must be accounted for. Our analysis aims at comprising all resources, including those which are needed to implement the required oracles. Their automated implementation using the programming language Quipper and its compiler is elaborated on in Sect. 4. Here we briefly discuss the highlevel tasks of these oracle functions. Their resource estimates are presented in “Appendix 3.”
Oracle b is used to prepare quantum state \(\left b_T\right\rangle _{2,6}\), see Eq. (7) and Fig. 13. Its task is accomplished by subroutine “Oracle_\({\mathbf {b}}({\mathbf {x}},{\mathbf {m}},{\mathbf {p}})\),” which takes as input the quantum state of the \(n_2\)qubit register \(R_2\) (argument \({\mathbf {x}}\); spanning the linearsystem global edge indices), computes the corresponding magnitude value \(b_j\) and phase value \(\phi _j\), and stores them in the two auxiliary computational registers \(R_4\) and \(R_5\) (labeled by arguments \({\mathbf {m}}\) and \({\mathbf {p}}\)), each consisting of \(n_4\) ancilla qubits and initialized (and later terminated) to states \(\left 0\right\rangle ^{\otimes n_4}\), respectively.
Oracle R is used to prepare quantum state \(\left R_T\right\rangle _{3,8}\) in Eq. (14). Its task is accomplished by subroutine “Oracle_\({\mathbf {R}}({\mathbf {x}},{\mathbf {m}},{\mathbf {p}})\)” which takes as input the quantum state of the \(n_2\)qubit register \(R_3\) (argument \({\mathbf {x}}\); spanning the FEM global edge indices), computes the corresponding magnitude value \(r_j\) and phase value \(\phi ^{(r)}_j\), and stores them in the two \(n_4\)qubit auxiliary computational registers \(R_4\) and \(R_5\), (labeled by arguments \({\mathbf {m}}\) and \({\mathbf {p}}\)), each initialized (and later terminated) to states \(\left 0\right\rangle ^{\otimes n_4}\), respectively.
Oracle A is needed to compute the matrix A of the linear system; it is employed as part of the HS subroutine “HsimKernel” to specify the 1sparse Hamiltonian that is to be applied. This highlevel task is accomplished by the “Oracle_\(A({\mathbf {x}},{\mathbf {y}},{\mathbf {z}};\text{ band },\text{ argflag })\)” subroutine, which takes as input the quantum state of the \(n_2\)qubit register \(R_2\) (argument \({\mathbf {x}}\); spanning the linearsystem global edge indices) and returns the connected Hamiltonian node index storing it in an \(n_2\)qubit ancilla register \(R_{12}\) (labeled by argument \({\mathbf {y}}\)); furthermore, it accesses Hamiltonian bands through the integer argument “band” and, depending on the value of the integer variable \(\text{ argflag }\in \{0,1\}\), computes the corresponding Hamiltonian magnitude or phase value, respectively, and stores it in the corresponding auxiliary \(n_4\)qubit register \({\mathbf {z}}\in \{{\mathbf {m}},{\mathbf {p}}\}\).
4 Automated resource analysis of oracles via the programming language Quipper
The logical circuits required to implement the Oracles A, b, and R were generated using the quantum programming language Quipper and its compiler. Quipper is also equipped with a gatecount operation, which enables performing automated LRE of the oracle implementations.
Our approach is briefly outlined as follows. Oracles A, b and R were provided to us in the IARPA QCS program GFI in terms of MATLAB functions, which return matrix and vector elements defining the original linearsystem problem. The task was to implement them as unitary quantum circuits. We used an approach that combines “Template Haskell” and the “classicaltoreversible” functionality of Quipper, which are explained below. This approach offers a general and automated mechanism for converting classical Haskell functions into their corresponding reversible unitary quantum gates by automatically generating their inverse functions and using them to uncompute ancilla qubits.
This Section starts with a short elementary introduction to Quipper. We then proceed with demonstrating how Quipper allows automated quantumcircuit generation and manipulation and indeed offers a universal automated LRE tool. We finally discuss how Quipper’s powerful capabilities have been exploited for the purpose of this work, namely achieving automated LRE of the oracles’ circuit implementations.
4.1 Quipper and the circuit model
The programming language Quipper [14, 15] is a domainspecific, higherorder, functional language for quantum computation. A snippet of Quipper code is essentially the formal description of a circuit construction. Being higherorder, it permits the manipulation of circuits as firstclass citizens. Quipper is embedded in the hostlanguage Haskell and builds upon the work of [31,32,33,34,35].
In Quipper, a circuit is given as a typed procedure with an input type and an output type. For example, the Hadamard and the NOT gates are typed with
They input a qubit and output a qubit. The keyword Circ is of importance: it says that when executed, the function will construct a circuit (in this case, a trivial circuit with only one gate).
Quantum datatypes in Quipper are recursively generated: Qubit is the type of quantum bits; (A,B) is a pair of an element of type A and an element of type B; (A,B,C) is a 3tuple; () is the unittype: the type of the empty tuple; [A] is a list of elements of type A.
If a program has multiple inputs, we can either place them in a tuple or use the curry notation (\(\rightarrow \)). For instance, the program
takes three inputs of type A, B and C and outputs a result of type D, while at the same time producing a circuit. Using the curry notation, the same program can also be written as
where D is the type of the output. We use the program by placing the inputs on the right, in order:
The meaning is the following: prog a is a function of type B > C > Circ D, waiting for the rest of the arguments; prog a b is a function of type C > Circ D, waiting for the last argument; finally, prog a b c is the fully applied program. If a program has no input, it has simply the type Circ B if B is the type of its output.
Using the introduced notation, we can type the controlledNOT gate:
and initialization and measure:
To illustrate explicitly how quantum circuits are generated with Quipper, let us use a wellknown example: the EPRpair generation, defined by the transformation \(\left 0\right\rangle \otimes \left 0\right\rangle \rightarrow 1/\sqrt{2}\left( \left 0\right\rangle \otimes \left 0\right\rangle +\left 1\right\rangle \otimes \left 1\right\rangle \right) \). The Quipper code which creates such an EPR pair can be written as follows:
The generated circuit is presented in Fig. 20, and each line is shown with its corresponding action. Line 1 defines the type of the piece of code: Circ means that the program generates a circuit, and (Qubit,Qubit) indicates that two quantum bits are going to be returned. Line 2 starts the actual coding of the program. Lines 3 to 6 are the instructions generating new quantum bits and performing gate operations on them, while Line 7 states that the newly created quantum bits q1 and q2 are returned to the user.
Quipper is a higherorder language, that is, functions can be inputs and outputs of other functions. This allows one to build quantumspecific circuitmanipulation operators. For example,
inputs a circuit, a qubit, and output the same circuit controlled with the qubit. It fails at runtime if some noncontrollable gates were used. So the following two lines are equivalent:
The function classical \(\_\) to \(\_\) reversible, presented in Section 4.4, is another example of highlevel operator.
The last feature of Quipper useful for automated generation of oracles is the subroutine (or box) feature. The operator box allows macros at the circuit level: it allows reuse of the same piece of code several times in the same circuit, without having to write down the list of gates each time. When a particular piece of circuit is used several times, it makes the representation of the circuit in the memory more compact, therefore more manageable, in particular for resource estimation.
4.2 Quippergenerated resource estimation
The previous section showed how a program in Quipper is essentially a description of a circuit. The execution of a given program will generate a circuit, and performing logical resource estimation is simply achieved by completing the program with a gatecount operation at the end of the circuitgeneration process. Instead of, say, sending the gates to a quantum coprocessor, the program merely counts them out. Quipper comes equipped with this functionality.
4.3 Regular versus reversible computation
An oracle in quantum computation is a description of a classical structure on which the algorithm acts: a graph, a matrix, etc. An oracle is then usually presented in the form of a regular, classical function f from n to m bits encoding the problem. It is left to the reader to make this function into the unitary of Fig. 21 acting on quantum bits.
Provided that the function f is given as a procedure and not as a mere truth table, there is a known efficient strategy to build \(U_f\) out of the description of f [36].
The strategy consists in two steps. First, construct the circuit \(T_f\) of Fig. 22. Such a circuit can be built in a compositional manner as follows. Suppose that f is given in term of g and h: \(f(x) = h(g(x))\). Then, provided that \(T_g\) and \(T_h\) are already built, \(T_f\) is the circuit in Fig. 23. NOT and AND are enough to write any Boolean function f: these are the base cases of the construction. The gate \(T_{\mathrm{NOT}}\) is the controlledNOT, and the gate \(T_{\mathrm{AND}}\) is the Toffoli gate.
Once the circuit \(T_f\) is built, the circuit \(U_f\), shown in Fig. 24 is simply the composition of \(T_f\), a fanout, followed with the inverse of \(T_f\). At the end of the computation, all the ancillas are back to 0: they are not entangled anymore and can be discarded without jeopardizing the overall unitarity of \(U_f\).
4.4 Quipper and template Haskell
As the transformation sending a procedure f to a circuit \(T_f\) is compositional, it can be automated. We are using a feature of the hostlanguage Haskell to perform this transformation automatically: Template Haskell. In a nutshell, it allows one to manipulate a piece of code within the language, produce a new piece of code and inject it in the program code. Another (slightly misleading) way of saying it is that it is a typesafe method for macros. Regardless, it allows one to do exactly what we showed in the previous section: function composition is transformed into circuit composition, and every subfunction \(\texttt {f}:A\rightarrow B\) is replaced with its corresponding circuit, whose type^{Footnote 14} is \(A \rightarrow \texttt {Circ}~B\): a function that inputs an object of type A, builds a (piece of) circuit, and outputs B. For example, the code
computing the conjunction of the three input variables x, y and z is turned into a function
computing the circuit in Fig. 25. Notice how the input wires are not touched and how the result is just one among many output wires. One can as easily encode the addition using binary integer.
As Quipper is a highlevel language, it flawlessly allows circuit manipulation. In particular, one can perform the metaoperation classical_to_reversible sending the circuit \(T_f\) to \(U_f\), of type
provided that A and B are essentially lists of qubits, and that \(T_f\) only consists of classical reversible gates: NOTs, cNOTs, ccNOTs, etc.
In the case of our my \(\_\) and function, it produces the circuit in Fig. 26 of the correct shape. One can easily check that the wire out is correctly set.
4.5 Encoding oracles
The oracles of QLSA were given to us as a set of MATLAB functions as part of the IARPA QCS program GFI. These functions computed the matrix A and the vectors b and R of [5]. They were not using any particular library: directly translating them into Haskell was a straightforward operation. As the MATLAB code came with a few tests to validate the implementation, by running them in Haskell we were able to validate our translation.
The main difficulty was not to translate the MATLAB code into Quipper, but rather to encode by hand the real arithmetic and analytic functions that were used. Figure 27 shows a snippet of translated Haskell code: it is a nontrivial operation using trigonometric functions. Another part of the oracle is also using arctan.
To be able to be processed through Template Haskell, all the arithmetic and analytic operations had to be written from scratch on integers encoded as lists of Bool. We used an encoding on fixedpoint arithmetic. Integers were coded as 32bit plus one bit for the sign, and real numbers as 32bit integer part and 32bit mantissa, plus one bit for the sign. We could have chosen to use floatingpoint arithmetic, but the operations would have been much more involved: the corresponding generated circuit would have been even bigger.
We made heavy use of the subroutine facility of Quipper: All of the major operations are boxed, that is, appear only once in the internal structure representing the circuit. This allows manageable processing (e.g., printing, or resource counting). As an example, the circuit for Oracle R of QLSA is shown in Fig. 28.
4.6 Compactness of the generated oracles
Our strategy for generating circuits with Template Haskell is efficient in the following sense: the size of the generated quantum circuit is exactly the same as the number of steps in the classical program. For example, if the classical computation consists of n conjunctions and m negations, the generated quantum circuit consists of n Toffoli gates and m CNOT gates.
The advantage of this technique is that it is fully general: with this procedure, any classical computation can be turned into an oracle in an efficient manner.
Optimizing oracle sizes As we show in this paper, the sizes of the generated oracles are quite impressive. In the current state of our investigations, we believe that, even with handcoding, these numbers could only be improved upon by a factor of 5, or perhaps at most a factor of 10. We think that accomplishing a greater reduction beyond these moderate factors would require a drastic change in the generation approach and techniques.
The reason why we think it is possible to achieve the mentioned moderate optimization is the following. Although the oracles we deal with in this work are specified and tailored to the particular problem we have been analyzing, they are also general in the sense that they are made of smaller algorithms (e.g., adders, multipliers ...). The reversible versions of these algorithms have been studied for a long time, and quite efficient proposals have been made. An analysis of the involved resources shows that for the addition of nbit integers, the number of gates involved in the automatically generated adder gate \(T_f\) is \(\lesssim 25n\) and the number of ancillas is \(\lesssim 8n\). A handmade reversible adder can be constructed [37] with, respectively, \(\lesssim 5n\) gates and \(\le n\) ancillas. If one found a way to reuse these circuits in place of our automatically generated adders, it would reduce the oracle sizes. However, it could only do so by a relatively small factor; the total number of gates would still be daunting.
Despite this drawback, our method is versatile and able to provide circuits for any desired function f without further elaborate analysis.
5 Results
Our LRE for QLSA for problem size \(N=332{,}020{,}680\) is summarized in Table 2. The following comments explain this table and our assumptions.
Unlike with QEC protocols where the distinction between “data qubits” and “ancilla qubits” is clear, here this distinction is somewhat ambiguous; indeed, all qubits involved in the algorithm are initially prepared in state \(\left 0\right\rangle \), and some qubits that we called ancilla qubits exist from the start to the end of a full quantumcomputation part (such as e.g., singlequbit registers \(R_6,\, R_8\)). We regard qubits which carry the data of the linearsystem problem and store its solution at the end of the quantum computation as data qubits; they constitute the quantum data registers \(R_2\) and \(R_3\), see Table 1. All other qubits, including those of QAE and HS control registers \(R_0\) and \(R_1\) as well as of the computational registers \(R_4\) and \(R_5\), are considered ancilla qubits.
It is important to note that the overall QLS algorithm consists of four independent quantumcomputation parts, namely the four calls of “AmpEst” subalgorithms, see Fig. 2, while the toplevel function “QLSA_main” performs a classical calculation of the RCS (by Eq. (16)) using the results \(\phi _b, \phi _x, \phi _{r0}, \phi _{r1}\) of its four quantumcomputation parts. These four independent “AmpEst” subalgorithms can either be performed in parallel or sequentially, and the actual choice should be subject to any time/space tradeoff considerations. Here we assume a sequential implementation, so that data and ancilla qubits can be reused by the four amplitude estimation parts. Hence, the qubit counts provided in Table 2 represent the maximum number of qubits in use at a time required by the most demanding of the four independent “AmpEst” subalgorithms. The maximum overall number of qubits (data and ancilla) in use at a time is also the definition for circuit width. While with a sequential implementation we aim at minimizing the circuit width (space consumption), we can do so only at the cost of increasing the circuit depth (time consumption). The overall circuit depth is the sum of the depths of the four “AmpEst” subalgorithms. By a brief look at Fig. 2 it is clear that the circuit depths are similarly large for “AmpEst_\({\phi _x}\)” and “AmpEst_\({\phi _r}\)” (where the latter is called twice), whereas compared to these the circuit depth of “AmpEst_\({\phi _b}\)” is negligible. Hence the overall circuit depth is roughly three times the circuit depth of subalgorithm “AmpEst_\({\phi _r}\).” We could just as well assume a parallel implementation of the four “AmpEst” calls. In this case the overall circuit depth would be by a factor 1 / 3 smaller than in the former case. However, this circuit depth decrease can only be achieved at the cost of incurring a circuitwidth increase. We would need up to four copies of the quantum registers listed in Table 1, and the required number of data and ancilla qubits in use at a time would be larger by a factor that is somewhat smaller than four.
QLSA has numerous iterative operations (in particular due to SuzukiHigherOrder Integrator method with Trotterization) involving ancillaqubit “generationusetermination” cycles, which are repeated, over and over again, while computation is performed on the same endtoend data qubits. Table 2 provides an estimate for both the number of ancilla qubits employed at a time and for the overall number of ancilla generationusetermination cycles executed during the implementation of all the four “AmpEst” subalgorithms. To illustrate the difference we note that, for some quantumcomputer realizations, the physical information carriers (carrying the ancilla qubits) can be reused, for others however, such as photonbased quantumcomputer realizations, the information carriers are lost and have to be created anew.
Furthermore, the gate counts actually mean the number of elementary logical gate operations, independent of whether these operations are performed using the same physical resources (lasers, interaction region, etc.) or not. The huge number of measurements results from the vast overall number of ancillaqubit uses; after each use an ancilla has to be uncomputed and eventually terminated to ensure reversibility of the circuit. Finally, Table 2 distinguishes between the overall LRE that includes the oracle implementation and the LRE for the bare algorithm with oracle calls regarded as “for free” (excluding their resource requirements).
6 Discussion
6.1 Understanding the resource demands
Our LRE results shown in Table 2 suggest that the resource requirements of QLSA are to a large extent dominated by the quantumcircuit implementation of the numerous oracle A queries and their associated resource demands. Indeed, accounting for oracle implementation costs yields resource counts which are by several orders of magnitude larger than those if oracle costs are excluded. While Oracle A queries have only slightly lower implementation costs than Oracle b and Oracle R queries, it is the number of queries that makes a substantial difference. As clearly illustrated in Fig. 2, Oracle A (required to implement the Hamiltonian transformation \(e^{iAt}\) with \(t\le t_0\sim O(\kappa /\varepsilon )\)) is queried by many orders of magnitude more frequently than Oracles b and R, which are needed only for preparation of the quantum states \(\left b\right\rangle \) and \(\left R\right\rangle \) corresponding to the column vectors \({\mathbf {b}}, {\mathbf {R}}\in {\mathbb {C}}^N\). Hence, the overall LRE of the algorithm depends very strongly on the Oracle A implementation. However, note that Oracles b and R contribute most to circuit width due to the vast number of ancilla qubits (\(\sim \)3\(\times 10^8\)) they employ at a time, see Table 10 in “Appendix 3.”
The LRE for the bare algorithm, i.e., with oracle queries and “IntegerInverse” function regarded as “for free” (excluding their resource costs), amounts to the order of magnitude \(10^{25}\) for gate count and circuit depth—still a surprisingly high number. In what follows, we explain how these large numbers arise, expanding on all the factors in more detail that yield a significant contribution to resource demands. To do so, we make use of Fig. 2.
QLSA’s LRE is dominated by series of nested loops consisting of numerous iterative operations, see Fig. 2. The major iteration of circuits with similar resource demands occurs due to the SuzukiHigherOrder Integrator method including a Trotterization with a large timesplitting factor of order \(10^{12}\) to accurately implement each run of the HS as part of QPEA. Indeed, each single call of “HamiltonianSimulation” yields the iteration factor \(r=2.5\times 10^{12}\). This subroutine is called twice during the “Solve_x” procedure, and the latter is furthermore employed twice within the (controlled) Grover iterators in three of the four QAEAs. There are \(\sum _{j=0}^{n_01} 2^j=2^{n_0}1=16{,}383\) controlled Grover iterators employed within each of the four QAEAs. Hence, the “HamiltonianSimulation” subroutine is employed \( (2^{n_0}1)\times 4\times 3=196{,}596\approx 2\times 10^5\) number of times altogether. Because each of its calls uses Trotterization with timesplitting factor \(2.5\times 10^{12}\) and a SuzukiHigherOrder Integrator decomposition with order \(k=2\) involving a further additional factor 5, we already get the factor \(\sim \)2.5\(\times 10^{18}\). Moreover, the lowestorder Suzuki operator is a product of \(2\times N_b=18\) onesparse Hamiltonian propagator terms (where \(N_b=9\) is the number of bands in matrix A); each such term calls the “HsimKernel” function, with “band” and “timestep” as its runtime parameters. In addition, each call of HsimKernel employs Oracle A six times and furthermore involves 24 applications of the procedure “Hmag” controlled by the time register \(R_1\). Thus, in total QLSA involves \(6\times 18\times 2.5\times 10^{18}\approx 2.7\times 10^{20}\) Oracle A queries and \(24\times 18\times 2.5\times 10^{18}\approx 10^{21}\) calls of controlled Hmag. Hence, even if subroutine Hmag consisted of a single gate and oracle A queries were for free, we would already have approx. \(10^{21}\) for gate count and circuit depth.
However, Hmag is a subalgorithm consisting of further subcircuits to implement the application of the magnitude component of a particular onesparse Hamiltonian term to an arbitrary state. It consists of several W gates, Toffolis and controlled rotations. Hence, a further increase of the order of magnitude is incurred by various decompositions of multicontrolled gates and/or rotation gates into the elementary set of faulttolerant gates \(\{H, S, T, X, Z, \text{ CNOT }\}\), using the wellknown decomposition rules outlined in “Appendix 2” (e.g., optimaldepth decompositions for Toffoli [38] and for controlled singlequbit rotations [39,40,41,42]). In our analysis, this yields a further factor \(\sim \)10\(^4\). Thus, even if we exclude oracle costs, we have \(10^{21} \times 10^4 = 10^{25}\) for gate count and circuit depth for the bare algorithm, simply because of a large number of iterative processes (due to Trotterization and Groveriteratebased QAE) combined with decompositions of higherlevel circuits (such as multicontrolled NOTs) into elementary gates and singlequbit rotation decompositions (factors \(\sim \)10\(^2\)–10\(^4\)).
If we include the oracle implementation costs, the dominant contribution to LRE is that of Oracle A calls, because oracle A is queried by a factor \(\sim \)10\(^{15}\) more frequently than Oracle b and even by a larger factor than Oracle R. Each Oracle A query’s circuit implementation has a gate count and circuit depth of order \(\sim \)2.5\(\times 10^8\), see “Appendix 3.” Having approx. \(2.7\times 10^{20}\) Oracle A queries, the LRE thus amounts to the order of magnitude \(\sim \)10\(^{29}\).
Let us briefly summarize the nested loops of QLSA that dominate the resource demands, while other computational components have negligible contributions. The dominant contributions result from those series of nested loops which include Hamiltonian Simulation as the most resourcedemanding bottleneck. The outer loops in these series are the firstlevel QAEA subroutines to find estimates for \(\phi _x, \phi _{r0}\) and \(\phi _{r1}\), each involving \(2^{n_0}1=16{,}383\) controlled Grover iterators. Each Grover iterator involves several implementations of Hamiltonian Simulation based on SuzukiHigherOrder Integrator decomposition and Trotterization with \(r\approx 10^{12}\) timesplitting slices. Each Trotter slice involves iterating over each matrix band whereby the corresponding part of Hamiltonian evolution is applied to the input state. Finally, for each band several oracle A implementations are required to compute the corresponding matrix elements, which moreover employs several arithmetic operations, each of which themselves require loops with computational effort scaling polynomially with the number of bits in precision.
6.2 Comparison with previous “bigO” estimations
As pointed out in the Introduction, we provide the first concrete resource estimation for QLSA in contrast to the previous analyses [3, 5] which estimated the runtime of QLSA only in terms of its asymptotic behavior using the “bigO” characterization. As the latter is supposed to give some hints on how the size of the circuit evolves with growing parameters, it is interesting to compare our concrete results for gate count and circuit depth with what one would expect according to the rough estimate suggested by the bigO (complexity) analysis. The bigO estimations proposed by Harrow et al. [3] and Clader et al. [5] have been briefly discussed in the Introduction and are given in Eqs. (1) and (3), respectively.
Complexitywise, the parameters taken into account in the bigO estimations are the size N of the square matrix A, the condition number \(\kappa \) of A, the sparseness d which is the number of nonzero entries per row/column in A, and the desired algorithmic accuracy given as error bound \(\varepsilon \). The choice of parameters made in this paper fixes these values to \(N=332{,}020{,}680, \kappa =10^4, d=7\), and \(\varepsilon =10^{2}\). If one plugs them into Eqs. (1) and (3), one gets, respectively, \(\sim \)4\(\times 10^{12}\) and \(\sim \)2\(\times 10^{12}\).
Although these numbers are large, they are not even close to compare with our estimates. This is due to the way a bigO estimate is constructed: it only focuses on a certain set of parameters, the other ones being roughly independent of the chosen set. Indeed, the “function” provided as bigO estimate is only giving a trend on how the estimated quantity behaves as the chosen set of parameters goes to infinity (or to zero, in the case of \(\varepsilon \)). Hence, only the limiting behavior of the estimate can be predicted with high accuracy, when the chosen relevant parameters it depends on tend toward particular values or infinity, while the estimate is very rough for other values of these parameter. In particular, a bigO estimate is hiding a set of constant factors, which are unknown. In the case of QLSA, our LRE analysis does not reveal a trend, it only gives one point. Nonetheless, it shows that these factors are extremely large, and that they must be carefully analyzed and otherwise taken into account for any potentially practical use of the algorithm.
Although the (unknown) constant factors implied by bigO complexity cannot be inferred from our LRE results obtained for just a single problem size, we can nevertheless consider which steps in the algorithm are likely to contribute most to these factors. With our finegrained approach we found that, if excluding the oracle A resources, the accrued circuit depth \(\sim \)10\(^{25}\) is roughly equal to \(3\times (2^{n_0}1)\) Grover iterations (as part of amplitude estimation loops for \(\phi _x, \phi _{r0}\) and \(\phi _{r1}\)) times \(4\times (2N_b)\times 5\times 2.5\times 10^{12}\) for the number of exponentials needed to implement the SuzukiTrotter expansion (as part of implementing HS, which is employed twice in Solve_x that is again employed twice in each Grover iterator) times a factor \(\sim \)24\(\times 10^4\) coming about from the circuits to implement, for each particular \(A_j\) in the decomposition [Eq. (8)], the corresponding part of Hamiltonian state transformation. In terms of CJS bigO complexity the circuit depth is \({\widetilde{O}}\left( \kappa {}d^7\log (N)/\varepsilon ^2 \right) \), which comes from \({\widetilde{O}}\left( 1/\varepsilon \right) \) QAE Grover iterations,^{Footnote 15} times \({\widetilde{O}}\left( d^4\kappa /\varepsilon \right) \) exponential operator applications to implement the SuzukiTrotter expansion,^{Footnote 16} times \(O\left( \log N\right) \) oracle A queries to simulate each query to any \(A_j\) in the decomposition [Eq. (8)], times the overhead of \(O(d^3)\) computational steps including \(O(d^2)\) Oracle A queries to estimating the preconditioner M of the linear system in order to prepare the preconditioned state \(M\left b\right\rangle \), see [5]. Here it is appropriate to note though that the HHL and CJS runtime complexities given in Eqs. (1) and (3), respectively, neglect more slowly growing terms, as indicated by the tilde notation \({\widetilde{O}}(\cdot )\). However, in a comparison with our empirical gate counts we ought to also take those slowly growing terms into account. For instance, there is another factor of \((\kappa d^2/\varepsilon ^2)^{1/4}\approx 3\times 10^2\) contributing to the number of SuzukiTrotter expansion slices, which was ignored in the \({\widetilde{O}}\) notation for HHL and CJS complexities, while it was accounted for in our LRE. By inspecting and comparing (CJS bigO vs. our LRE) the orders of magnitude of the various contributing terms, we conclude that the bigO complexity is roughly two orders of magnitude off (smaller) from our empirical counts for the SuzukiTrotter expansion step. As for the QAE steps, our LRE count is \(\sim \)5\(\times 10^4\), which is roughly two orders of magnitude higher than \(O(1/\varepsilon )\) and smaller than \(O(\kappa /\varepsilon )\), suggesting that \(O(1/\varepsilon )\) is too optimistic while \(O(\kappa /\varepsilon )\) is too conservative. Finally, the bigO complexity misses roughly 5 orders of magnitude that our finegrained approach reveals for the circuit implementation of the Hamiltonian state transformation for each \(A_j\) at the lowest algorithmic level.
In order to understand what caused such large constant factors, we estimated the resources needed to run QLSA for a smaller problem size^{Footnote 17} while keeping the same precision (and therefore the same size for the registers holding the computed values). Specifically, we chose \(N=24\), while we kept the condition number and the error bound at the same values \(\kappa =10^4\) and \(\varepsilon =10^{2}\), respectively. Despite the fact that the matrix A lost several orders of magnitude in size, the circuit width and depth ended up being of roughly the same order of magnitude as of Table 2.
What our results suggest is that the large constant factors arise as a consequence of the desired precision forcing us into choosing large sizes for the registers, whereas the LRE is not notably impacted by a change in problem size N. This can intuitively be understood as follows. First, the total number of gates required for QLSA’s nonoracle part scales as \(O(\log N)\), cf. Eq. (3); hence, using \(N=24\) in place of \(N=332{,}020{,}680\) suggests an LRE reduction only by a moderate factor \(\sim \)5. Secondly, what matters for the LRE of oracles is also mostly determined by the desired accuracy \(\varepsilon \). Each oracle query essentially computes a single (complex) value corresponding to a particular input from the set of all inputs. The oracles are oblivious to the problem size and to the actual value of each of their inputs. While oracles obtain actual input data from the data register \(R_2\) or \(R_3\), whose size \(n_2=n_3=\log _2(2N)\) clearly depends on N, these are not the ones that crucially determine the oracles’ sizes. What virtually matters for the size of the generated quantum circuit implementing an oracle query, is the size of the computational registers \(R_4\) and \(R_5\) used to compute and hold the output value of each particular oracle query. In our analysis, these registers have size \(n_4=65\), cf. Table 1; they were kept at the same size when computing QLSA’s LRE for the smaller problem size \(N=24\).
6.3 Lack of parallelism
Comparing the estimates for the total number of gates and circuit depth reveals a distinct lack of parallelism ^{Footnote 18} in the design of QLSA. As explained earlier, due to the highly repetitive structures of the algorithm primitives used, most of the gates have to be performed sequentially. Indeed, QLSA involves numerous iterative operations. The major iteration of circuits with similar resource requirements occurs due to the SuzukiHigherOrder Integrator method that also involves Trotterization, which uses a large timesplitting factor of order \(10^{12}\) to accurately implement each run of the Hamiltonianevolution simulation. In fact, the iteration factor imposed by Trotterization of the Hamiltonian propagator is currently a hard bound on the overall circuit depth and even the total LRE of QLSA, and it crucially depends on the aimed algorithmic precision \(\varepsilon \). The remarks in the following paragraph expand on this issue in more detail.
6.4 Hamiltonianevolution simulation as the actual bottleneck and recent advancements
It is worth emphasizing that the quantumcircuit implementation of the Hamiltonian transformation \(e^{iAt}\) using wellestablished HS techniques [8] constitutes the actual bottleneck of QLSA. Indeed, this step implies the largest contribution to the overall circuit depth; it is given by the factor \(r\times 5^{k1} \times (2N_b)\), see Fig. 2, which is imposed by the SuzukiHigherOrder Integrator method together with Trotterization. According to Eq. (17) and the discussion following it, \(r\sim O\left( (N_b\kappa )^{1+1/{2k}}/ \varepsilon ^{1+1/{k}}\right) \). Thus, the key dependence of the timesplitting factor r is on the condition number \(\kappa \) and the error bound \(\varepsilon \) rather than on problem size N. The dependence on the latter enters only through the number of bands \(N_b\) (in the general case, the number m of submatrices in the decomposition [Eq. (8)]), which can be small even for large matrix sizes, as is the case in our example. This feature explains why we can get similar LRE results for \(N=332{,}020{,}680\) and \(N=24\) if \(\kappa \) and \(\varepsilon \) are kept at the same values for both cases and the number of bands \(N_b\) is small (see above).
It is also important to note that there has been significant recent progress on improving HS techniques. Berry et al. [43] provide a method for simulating Hamiltonian evolution with complexity polynomial in \(\log (1/\varepsilon )\) (with \(\varepsilon \) the allowable error). Even more recent works by Berry et al. [44, 45] improve upon results in [43] providing a quantum algorithm for simulating the dynamics of sparse Hamiltonians with complexity sublogarithmic in the inverse error. Compared to [44], the analysis in [45] yields a nearlinear instead of superquadratic dependence on the sparsity d. Moreover, unlike the approach [43], the query complexities derived in [44, 45] are shown to be independent of the number of qubits acted on. Most importantly, all three approaches [43,44,45] provide an exponential improvement upon the wellestablished method [8] that our analysis is based on.^{Footnote 19} To account for these recent achievements, we estimate the impact they may have with reference to the baseline imposed by our LRE results. The modular nature of our LRE approach allows us to do this estimation. The following backoftheenvelope evaluation shows that, for \(\varepsilon =0.01\), the advanced HS approaches [43, 44] and [45] may offer a potential reduction of circuit depth and overall gate count by orders of magnitude \(10^1\), \(\sim \)10\(^4\) and \(\sim \)10\(^5\), respectively.
Indeed, let us compare the scalings of the total number of onesparse Hamiltonianevolution terms required to approximate \(e^{iAt}\) to within error bound \(\varepsilon =0.01\) for the prior approach [8] (used here) and the recent methods [43, 45]. In doing so, we arrive at contrasting
for the three approaches [8, 43] and [45], respectively. In the first term, m denotes the number of submatrices in the decomposition [Eq. (8)]; in the general case, \(m=6d^2\), in our toyproblem analysis, \(m=N_b\). In the second and third term, d is the sparsity of A, and n is the number of qubits acted on, while c is a constant. In all three expressions, \(\Vert A\Vert \) is the spectral norm of the Hamiltonian A, which in our toyproblem example is timeindependent. As stated in Sect. 3.4.5, for QLSA to be accurate within error bound \(\varepsilon \), we must have \(\Vert A\Vert t\sim O(\kappa /\varepsilon )\), cf. [3]. Using \(\Vert A\Vert t \le \Vert A\Vert t_0=7\times \kappa /\varepsilon \) and the parameter values \(m=N_b=9, k=2, d=7, n=n_2=30\) and \(c\ge 1\), expression (19) yields \(\sim \)7\(\times 10^{13}\), whereas the query complexity estimates (20) and (21) yield \(\gtrsim \)5\(\times 10^{12}\) and \(\sim \)5\(\times 10^8\), respectively. Hence, notably the advanced results in [45] imply that an improvement of our LRE by order of magnitude \(\sim \)10\(^5\) seems feasible.
7 Conclusion
A key research topic of quantumcomputer science is to understand what computational resources would actually be required to implement a given quantum algorithm on a realistic quantum computer, for the large problem sizes for which a quantum advantage would be attainable. Traditional algorithm analyses based on bigO complexity characterize algorithmic efficiency in terms of the asymptotic leadingorder behavior and therefore do not provide a detailed accounting of the concrete resources required for any given specific problem size, which however is critical to evaluating the practicality of implementing the algorithm on a quantum computer. In this paper, we have demonstrated an approach to how such a concrete resource estimation can be performed.
We have provided a detailed estimate for the logical resource requirements of the quantum linearsystem algorithm, which under certain conditions solves a linear system of equations, \(A{\mathbf {x}}={\mathbf {b}}\), exponentially faster than the best known classical method. Our estimates correspond to the explicit example problem size beyond which the quantum linearsystem algorithm is expected to run faster than the best known classical linearsystem solving algorithm. Our results have been obtained by a combination of manual analysis for the bare algorithm and automated resource estimates for oracles generated via the quantum programming language Quipper and its compiler. Our analysis shows that for a desired calculation precision accuracy \(\varepsilon =0.01\), an approximate circuit width 340 and circuit depth of order \(10^{25}\) are required if oracle costs are excluded, and a circuit width and circuit depth of order \(10^8\) and \(10^{29}\), respectively, if the resource requirements of oracles are taken into account, showing that the latter are substantial. We stress once again that our estimates pertain only to the resource requirements of a single run of the complete algorithm, while actually multiple runs of the algorithm are necessary (followed by sampling) to produce a reliable accurate outcome.
Our LRE results for QLSA are based on wellestablished quantum computation techniques and primitives [1, 6,7,8, 22] as well as our approach to implement oracles using Quipper. Hence, our estimates strongly rely on the efficiency of the applied methods and chosen approach. Improvement upon our estimates can only be achieved by advancements enabling more efficient implementations of the utilized quantumcomputation primitives and/or oracles. For example, as pointed out in Sect. 6, most recent advancements of Hamiltonianevolution simulation techniques [45] suggest that a substantial reduction of circuit depth and overall gate count by order of magnitude \(\sim \)10\(^5\) seems feasible. Likewise, more sophisticated methods to generate quantumcircuit implementations of oracles more efficiently may become available. We think though that significant improvements are going to come from inventing a better QLS algorithm, or more resourceefficient Hamiltonianevolution simulation approaches, rather than from improvements to Quipper. While we believe that our estimates may prove to be conservative, they yet provide a wellfounded “baseline” for research into the reduction of the algorithmiclevel minimum resource requirements, showing that a reduction by many orders of magnitude is necessary for the algorithm to become practical. Our modular approach to analysis of extremely large quantum circuits reduces the cost of updating the analysis when improved quantumcomputation techniques are discovered.
To give an idea of how long the algorithm would have to run at a minimum, let us suppose that, in the ideal case, all logic gates take the same amount of time \(\tau \), and have perfect performance thus eliminating the need for QC and/or QEC. Then for any assumed gate time \(\tau \), one can calculate a lower limit on the amount of time required for the overall implementation of the algorithm. For example, if \(\tau =1\)ns (which is a rather optimistic assumption; for other gate duration assumptions, one can then plug in one’s own assumptions), a circuit depth of order \(10^{25}~(10^{29}\)) would correspond to a runtime approx. \(3\times 10^8~(3\times 10^{12}\)) years, which apparently compares with or even exceeds the age of the Universe (estimated to be approx. \(13.8\times 10^9\) years). Even with the mentioned promising improvements by a factor \(\sim \)10\(^5\) for the Hamiltonianevolution simulation and by a factor \(\sim \)10 for the oracle implementations, we would still deal with runtimes approx. \(3\times 10^2~(3\times 10^{6}\)) years.
Although our results are surprising when compared to a naive analysis of the previous bigO estimations of the algorithm [3, 5], the difference can be explained by the factors hidden in the bigO estimation analyses: we infer that these factors come for the most part from the large register sizes, chosen because of the desired precision.
The moral of this analysis is that quantum algorithms are not typically designed with implementation in mind. Considering only the overall coarse complexity of a given algorithm does not make it automatically feasible. In particular, our analysis shows that bookkeeping parameters such as the size of registers have to be considered.
Our analysis highlights an avenue for future research: quantum programming languages and formal methods. In computer science, mature techniques have been developed for decades, and we ought to adapt and implement them for a finegrained analysis of quantum algorithms to pinpoint the various parameters in play and their relationships. In particular, these techniques may also allow to explicitly identify the actual bottlenecks of a particular implementation and provide useful insights on what to focus on for optimizations: in the case of QLSA, for instance, the Hamiltonianevolution simulation and oracle implementations. Combining a finegrained approach with asymptotic bigO analysis, a much fuller understanding of the bottlenecks in quantum algorithms emerges enabling focused research on improved algorithmic techniques.
Notes
Note that, if A is not Hermitian, the problem can be restated as \({\bar{A}}{\bar{\mathbf{x}}}={\bar{\mathbf{b}}}\) with a Hermitian matrix \({\bar{A}}:=\bigl ({\begin{matrix} 0&{}A\\ A^\dagger &{}0 \end{matrix}} \bigr )\), see Sect. 3.
The aspect of PLATO most closely aligned with the topic of this paper was the understanding of the resources required to run a quantum algorithm followed by research into the reduction of those resources.
The GFI for QLSA was provided by Clader and Jacobs, the coauthors of the work [5] whose supplementary material includes a considerable part of that GFI.
At the time of ENIAC and other early classical computers, it seems unlikely that considering how the size of the computer could be reduced and its power increased would make us consider the invention of the transistor. Instead, we would have considered how vacuum tubes could be designed smaller or could be made so as to perform more complex operations.
In [3] it was also shown that the runtime cannot be made \(\text{ poly }\log (\kappa )\), unless \(\mathbf{BQP}=\mathbf{PSPACE}\), which, while not yet disproven, is highly unlikely to be true in computational complexity theory. Hence, because \(\text{ poly }\log (\kappa )=o(\kappa ^\varepsilon )\) for all \(\varepsilon >0\), QLSA’s runtime is asymptotically also bounded from below as given by complexity \(\varOmega (\kappa ^{1o(1)})\).
But note that, while the CJS runtime complexity [Eq. (3)] scales quadratically better in the condition number \(\kappa \) than the original HHL complexity [Eq. (1)], the former scales quadratically worse than the latter with respect to the parameters d and \(\varepsilon \). However, the two runtime complexities should not be directly compared, because the corresponding QLS algorithms achieve somewhat different tasks. Besides, it is our opinion that the linear scaling of CJS runtime complexity in \(\kappa \) is based on an overoptimistic assumption in its derivation. Indeed, while CJS removed the QAA step from the HHL algorithm, they replaced it with the nearly equivalent QAE step, which we believe has a similar resource requirement as the former, and thus may require up to \(O(\kappa /\varepsilon )\) iterations to ensure successful amplitude estimation within multiplicative accuracy \(\varepsilon \), in addition to the factor \(O(\kappa /\varepsilon )\) resulting from the totally independent QPEA step. See also our remark in footnote 11.
The condition number of a matrix A is defined by \(\kappa _p(A)=\Vert A\Vert _p\Vert A^{1}\Vert _p\), where \(\Vert \cdot \Vert _p\) denotes the matrix norm that is used to induce a metric. Hence, the condition number is also a function of the norm which is used. The 1norm \(\Vert \cdot \Vert _1\) and 2norm \(\Vert \cdot \Vert _2\) are commonly used to define the condition number,and obviously \(\kappa _1\not =\kappa _2\) in general. But due to \(\Vert A\Vert _1/\sqrt{N}\le \Vert A\Vert _2\le \sqrt{N}\Vert A\Vert _1\) for \(N\times N\) matrices A, knowing the condition number for either of these two norms allows to bound the other. Furthermore, if A is normal (i.e., diagonalizable and has a spectral decomposition), then \(\kappa _2=\lambda _{\mathrm{{max}}}/\lambda _{\mathrm{{min}}}\), where \(\lambda _{\mathrm{{max}}}\) and \(\lambda _{\mathrm{{min}}}\) are the maximum and minimum eigenvalues of A. For a regular mesh of size \(h, \kappa _2\) generally scales as \(O(h^{2})\) [27,28,29]. Hence, because the number of degrees of freedom scales as \(N=O(h^{n}), \kappa _2\) is bounded by \(O(N^{2/n})\) (see [27, 28] for rigorous proof). In our toyproblem, \(h\approx 0.1\) whereas \(N\approx 3\times 10^8\), thus it is not evident whether a guess for \(\kappa _2\) should be based on \(O(h^{2})\) or O(N), as the two bounds indeed differ by many orders of magnitude. Besides, as our LRE analysis aims at achieving an optimistic (as opposed to an overly conservative) resource count for QLSA, it is more sensible to use the lower bound rather than the upper bound as a guess for \(\kappa _2\). Hence, we attempted to find an actual lower bound for \(\kappa _2\) numerically. To this end, because an estimate for \(\kappa _1\) can be obtained with much less computational expense than for \(\kappa _2\) for a given matrix of a very large size, we used MATLAB and extrapolation techniques to attain a rough approximation of \(\kappa _1\) from the given code specifying the matrix of our toyproblem. We found a value \(\kappa _1\approx 10^7\). This allowed us to infer a rough estimate for the lower bound for \(\kappa _2\). Indeed, using the above relation between the matrix norms \(\Vert \cdot \Vert _1\) and \(\Vert \cdot \Vert _2\) for a square matrix and realizing that both \(\Vert A\Vert _1\) and \(\Vert A\Vert _2\) have values of order O(1), we may conclude that \(\kappa _2\ge \kappa _1/\sqrt{N}\times O(1)\), which is of order approximately \(10^310^4\).
Let \(\left \psi \right\rangle =\left \psi _{\mathrm{{good}}}\right\rangle +\left \psi _{\mathrm{{junk}}}\right\rangle \) be a superposition of the good and the junk components of a (normalized) quantum state \(\left \psi \right\rangle \). The goal of QAEA [22] is to estimate \(\alpha :=\left\langle \psi _{\mathrm{{good}}}\right \psi _{\mathrm{{good}}}\rangle \), i.e., the modulus squared of the amplitude of the desired good component.
Note that we hereby use a multiplicative error bound to represent the desired precision of QAEA’s computation.
Note that \(1/\lambda _{\mathrm{{max}}}\le \kappa \sin \phi _x\le 1/\lambda _{\mathrm{{min}}}\), which suggests that \(M\sim O(\kappa /\varepsilon )\) would be sufficient to estimate \(\alpha _x:=\sin ^2\phi _x\) with multiplicative error \(\varepsilon \), see corollary (6). This is a conservative estimate, and the implied associated cost for the QAE step is indeed by a factor \(O(\kappa )\) higher than that assumed by CJS in deriving the overall complexity [Eq. (3)].
However, it ought to be noted that, by “principle of deferred measurements” (see [1]), for any quantum circuit involving measurements whose results are used to conditionally control subsequent quantum circuits, the actual measurements can always be deferred to the very end of the entire quantum algorithm, without in any way affecting the probability distribution of its final outcomes. In other words, measuring qubits commutes with conditioning on their postselected outcomes. Hence, any quantum circuit involving postselection can always be included as a subroutine using only pure states as part of a bigger algorithm with probabilistic outcomes. Nonetheless, in view of the resources used to achieve efficient simulation, measuring qubits as early as possible can potentially reduce the maximum number of simultaneously employed physical qubit systems enabling the algorithm to be run on a smaller quantum computer. In addition, we here emphasize that, with a small amount of additional effort, QAEA can be designed such that its final measurement outcomes nearly deterministically yield the desired estimates. Note that a similar concept also applies to QAA in HHLQLSA, which aims at amplifying the success probability.
The RCS in Eq. (16) is of the form \(\sigma _{\mathrm{{RCS}}}=C\frac{\alpha _1}{\alpha _2}(\alpha _3\alpha _4)\), where C is a constant and \(\alpha _i~(i=1,\ldots ,4)\) are the modulī squared of four different quantum amplitudes to be estimated using QAEA. The QAE control register size \(n_0\) has been chosen such (see Table 1) that, with a success probability greater than \(1\varepsilon \), respectively, the corresponding estimates are within \(\pm \varepsilon \alpha _i\) of the actual correct values, i.e., \({\tilde{\alpha }}_i=\alpha _i\pm \varepsilon \alpha _i\). It is straightforward to show that, with only a single run of each of the four QAEA subroutines, our estimate \({\tilde{\sigma }}_{\mathrm{{RCS}}}=C\frac{{\tilde{\alpha }}_1}{{\tilde{\alpha }}_2}({\tilde{\alpha }}_3{\tilde{\alpha }}_4) \) for RCS satisfies \({\tilde{\sigma }}_{\mathrm{{RCS}}}= \sigma _{\mathrm{{RCS}}}\pm \varepsilon \sigma _{\mathrm{{RCS}}}\pm \varepsilon \sigma _{\mathrm{{RCS}}}\pm \varepsilon \sigma _{\mathrm{{RCS}}}+O(\varepsilon ^2)\), and hence \({\tilde{\sigma }}_{\mathrm{{RCS}}}\sigma _{\mathrm{{RCS}}}\le 3\varepsilon \sigma _{\mathrm{{RCS}}}\), with a probability at least \((1\varepsilon )^4\approx 14\varepsilon \). Note that, to ensure \({\tilde{\sigma }}_{\mathrm{{RCS}}}\sigma _{\mathrm{{RCS}}}\le \varepsilon \sigma _{\mathrm{{RCS}}}\) with a probability close to 1, we actually should have chosen an even higher calculation accuracy for each of the four QAEA subroutines, achieved by using the larger QAE control register size \(n_0'=\lceil \log _2 M'\rceil \), where \(M'=2^{\lceil \log _2( 1/{\varepsilon '^2)}\rceil }\), enabling estimations with the smaller error \(\varepsilon ':=\varepsilon /4\). However, we avoided these details in our LRE analysis, which aims at estimating the optimistic resource requirements that are necessary (not imperatively sufficient) to achieve the calculation accuracy \(\varepsilon =0.01\) for the whole algorithm.
Technically, the type is \(\texttt {Circ}(A \rightarrow \texttt {Circ}~B)\). But this is only an artifact of the mechanical encoding.
However, see our remarks in footnotes 6 and 11 in which we pointed out that \(O(\kappa /\varepsilon )\) may be a more appropriate estimate for the complexity of the QAE loops.
For a dsparse A, simulating \(\exp (iAt)\) with additive error \(\varepsilon \) using HS techniques [8] requires a runtime proportional to \(d^4t(t/\varepsilon )^{o(1)}\equiv {\widetilde{O}}\left( d^4t\right) \), see [3, 8]. It is performing the phase estimation (as part of “Solve_x”), which is the dominant source of error, that requires to take \(t_0=O(\kappa /\varepsilon )\) for the various times \(t=\tau t_0/T\) defining the HS control register in order to achieve a final error smaller than \(\varepsilon \), see [3].
A smaller problem size is obtained by reducing the spatial domain size of the electromagnetic scattering FEM simulation, via reductions in parameters \(n_x\) and \(n_y\) which represent the number of FEM vertices in x and y dimensions. The immediate consequence is a reduction of the common length of quantum data registers \(R_2\) and \(R_3\), i.e., \(n_2=\lceil \log _2(2N)\rceil \), where \(N=n_x(n_y1)+(n_x1)n_y\). Such registerlength reduction is expected to affect the resource requirements for all oracles as well as all subroutines that involve the data registers \(R_2\) and \(R_3\). In fact, the input registers to all oracles are of length \(n_2\), and shortening them has the potential of reducing the oracle sizes. However, we recounted oracles’ resources using Quipper, with \(n_2=6\) in place of \(n_2=30\), and found that the only difference involves the number of ancillas and measurements required. When checking the resource change of the entire QLSA circuit, we found negligible difference. Indeed, changes in \(n_2\) have a relatively little effect on resources of the bare algorithm (excluding oracle costs), because the dominant contribution to resources in the nonoracle part is given by the timesplitting factor imposed by Hamiltonianevolution simulation, which does not directly depend on \(n_2\). Besides, since the total number of operations required for QLSA’s nonoracle part has a complexity that scales logarithmically in N, see Eqs. (1) and (3), the resources for \(n_2=6\) in place of \(n_2=30\) are expected to diminish by just a relatively small factor \(\sim \)5.
One can get a sense of the amount of parallelism of the overall circuit by comparing the total number of gates of an algorithm to its circuit depth. In our analysis, they only differ by a factor of \(\sim \)1.33 if oracles are included, and by a factor of \(\sim \)1.01 if oracles are excluded, thus most of the gates must be being applied sequentially.
As discussed previously, our circuit implementations of oracles are essentially the trace of execution of a classical program of an algorithm. Because the algorithms we used are purely sequential, the corresponding quantum circuits are not easily parallelizable on a global scale. The only possible optimizations are purely local. We therefore conclude that our computed circuit and Tdepth values are overestimates by some unknown small factor wrt. optimaldepth values.
References
Nielsen, M.A., Chuang, I.L.: Quantum Computing and Quantum Information. Cambridge University Press, Cambridge (2000)
Jordan, S.: Quantum Algorithm Zoo (2013). URL: http://math.nist.gov/quantum/zoo/
Harrow, A.W., Hassidim, A., Lloyd, S.: Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103, 150502 (2009)
Ambainis, A.: Variable time amplitude amplification and a faster quantum algorithm for solving systems of linear equations. arXiv:1010.4458 (2010)
Clader, B.D., Jacobs, B.C., Sprouse, C.R.: Preconditioned quantum linear system algorithm. Phys. Rev. Lett. 110, 250504 (2013)
Luis, A., Peřina, J.: Optimum phaseshift estimation and the quantum description of the phase difference. Phys. Rev. A 54, 4564 (1996)
Cleve, R., Ekert, A., Macchiavello, C., Mosca, M.: Quantum Algorithms Revisited. arXiv:quantph/9708016 (1997)
Berry, D.W., Ahokas, G., Cleve, R., Sanders, B.C.: Efficient quantum algorithms for simulating sparse Hamiltonians. Commun. Math. Phys. 270(2), 359 (2007)
Wiebe, N., Braun, D., Lloyd, S.: Quantum Data Fitting. Phys. Rev. Lett. 109, 050505 (2012)
Berry, D.W.: Highorder quantum algorithm for solving linear differential equations. J. Phys. A Math. Theor. 47, 105301 (2014)
Lloyd, S., Mohseni, M., Rebentrost, P.: Quantum algorithms for supervised and unsupervised machine learning. arXiv:1307.0411 (2013)
Barz, S., Kassal, I., Ringbauer, M., Lipp, Y., Dakic, B., AspuruGuzik, A., Walther, P.: Solving systems of linear equations on a quantum computer. Sci. Rep. 4, 6115 (2014). doi:10.1038/srep06115
Cai, X.D., Weedbrook, C., Su, Z.E., Chen, M.C., Gu, M.J.Z.M., Li, L., Liu, N.L., Lu, C.Y., Pan, J.W.: Experimental quantum computing to solve systems of linear equations. Phys. Rev. Lett. 110, 230501 (2013)
Green, A., Lumsdaine, P.L., Ross, N.J., Selinger, P., Valiron, B.: Quipper: a scalable quantum programming language. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 333–342 (2013)
Green, A., Lumsdaine, P.L., Ross, N.J., Selinger, P., Valiron, B.: An introduction to quantum programming in Quipper. In: Proceedings of the 5th International Conference on Reversible Computation, Lecture Notes in Computer Science, vol. 7948 , Lecture Notes in Computer Science, vol. 7948, pp. 110–124 (2013)
Intelligence Advanced Research Projects Activity (IARPA). Quantum Computer Science (QCS) Program (2010). URL http://www.iarpa.gov/index.php/researchprograms/qcs
Intelligence Advanced Research Projects Activity (IARPA). Quantum Computer Science (QCS) Program Broad Agency Announcement (BAA) (April 2010). URL http://www.iarpa.gov/index.php/researchprograms/qcs/baa
The Quipper Language (2013). URL http://www.mathstat.dal.ca/~selinger/quipper/
The Quipper System (2013). URL http://www.mathstat.dal.ca/~selinger/quipper/doc/
Shewchuk, J.R.: An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. (Technical Report CMUCS94125 School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania (1994))
Saad, Y.: Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Brassard, G., Hoyer, P., Mosca, M., Tapp, A.: Quantum amplitude amplification and estimation. In: Quantum Computation and Quantum Information, vol. 305 (AMS Contemporary Mathematics, 2002), pp. 53–74 (2002)
Jin, J.M.: The Finite Element Method in Electromagnetics. Wiley, New York (2002)
Chatterjee, A., Jin, J.M., Volakis, J.L.: Edgebased finite elements and vector ABCs applied to 3D scattering. IEEE Trans. Antennas Propagat. 41, 221 (1993)
Trotter, H.: On the product of semigroups of operators. In: Proceedings of the American Mathematical Society, vol. 10, pp. 545–551 (1959)
Suzuki, M.: Fractal decomposition of exponential operators with applications to manybody theories and MonteCarlo simulations. Phys. Lett. A 146, 319 (1990)
Brenner, S.C., Scott, L.R.: The Mathematical Theory of Finite Element Methods. Springer, New York (2008)
Bank, R.E., Scott, L.R.: On the conditioning of finite element equations with highly refined meshes. SIAM J. Numer. Anal. 26(6), 1383 (1989)
Layton, W.: Highaccuracy finiteelement methods for positive symmetric systems. Comput. Math. Appl. 12A(4/5), 565 (1986)
Childs, A.M., Cleve, R., Deotto, E., Farhi, E., Gutmann, S., Spielman, D.A.: Exponential algorithmic speedup by a quantum walk. In: Proceedings of the ThirtyFifth Annual ACM Symposium on Theory of Computing, STOC’03 (New York, NY, USA, 2003), pp. 59–68 (2003)
Ömer, B.: Quantum Programming in QCL. Master’s thesis, Institute of Information Systems, Technical University of Vienna (2000)
Claessen, K.: Embedded Languages for Describing and Verifying Hardware. Ph.D. thesis, Chalmers University of Technology and Göteborg University (2001)
Altenkirch, T., Green, A.S.: The quantum IO monad. In: Gay, S., Mackie, I. (eds.) Semantic Techniques in Quantum Computation, pp. 173–205. Cambridge University Press, Cambridge (2009)
Selinger, P., Valiron, B.: A lambda calculus for quantum computation with classical control. Math. Struct. Comput. Sci. 16(3), 527 (2006)
Selinger, P., Valiron, B.: Quantum lambda calculus. In: Gay, S., Mackie, I. (eds.) Semantic Techniques in Quantum Computation, pp. 135–172. Cambridge University Press, Cambridge (2009)
Laundauer, R.: Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 5, 261 (1961)
Draper, T.G., Kutin, S.A., Rains, E.M., Svore, K.M.: A logarithmicdepth quantum carrylookahead adder. Quantum Inf. Comput. 6, 351 (2006)
Mermin, N.D.: Quantum Computer Science: An Introduction. Cambridge University Press, Cambridge (2007)
Fowler, A.G.: Towards LargeScale Quantum Computation. Ph.D. thesis, arXiv:quantph/0506126 (2005)
Fowler, A.G.: Constructing arbitrary Steane code single logical qubit faulttolerant gates. Quantum Inf. Comput. 11, 867 (2011)
Matsumoto, K., Amano, K.: Representation of Quantum Circuits with Clifford and \(\pi /8\) Gates. arXiv:0806.3834 (2008)
Giles, B., Selinger, P.: Remarks on Matsumoto and Amano’s normal form for singlequbit Clifford+\({T}\) operators. arXiv:1312.6584 (2013)
Berry, D.W., Cleve, R., Somma, R.D.: Exponential improvement in precision for Hamiltonianevolution simulation. arXiv:1308.5424v3 (2013)
Berry, D.W., Childs, A.M., Cleve, R., Kothari, R., Somma, R.D.: Exponential improvement in precision for simulating sparse Hamiltonians. In: Proceedings of the 46th ACM Symposium on Theory of Computing (STOC 2014), pp. 283–292 (2014)
Berry, D.W., Childs, A.M., Kothari, R.: Hamiltonian simulation with nearly optimal dependence on all parameters. In: Proceedings of the 56th IEEE Symposium on Foundations of Computer Science (FOCS 2015), pp. 792809 (2015)
Pham, T.T., Meter, R.V., Horsman, C.: Optimization of the Solovay–Kitaev algorithm. Phys. Rev. A 87, 052332 (2013)
Giles, B., Selinger, P.: Exact synthesis of multiqubit Clifford+\({T}\) circuits. Phys. Rev. A 87, 032332 (2013)
Bocharov, A., Gurevich, Y., Svore, K.M.: Efficient decomposition of singlequbit gates into \({V}\) basis circuits. Phys. Rev. A 88(012313), 13 (2013)
Selinger, P.: Optimal ancillafree Clifford+\({T}\) approximation of \({Z}\)rotations. arXiv:1403.2975 (2014)
Kliuchnikov, V., Maslov, D., Mosca, M.: Asymptotically optimal approximation of single qubit unitaries by Clifford and \({T}\) circuits using a constant number of ancillary qubits. Phys. Rev. Lett. 110(190502), 5 (2013)
Kliuchnikov, V., Maslov, D., Mosca, M.: Fast and efficient exact synthesis of single qubit unitaries generated by Clifford and \({T}\) gates. Quantum Inf. Comput. 13(7–8), 607 (2013)
Selinger, P.: Efficient Clifford+\({T}\) approximation of singlequbit operators. arXiv:1212.6253 (2012)
Bennett, C.H.: Logical reversibility of computation. IBM J. Res. Dev. 17(6), 525 (1973)
Acknowledgements
This work was accomplished as part of the PLATO Project: “Protocols, Languages and Tools for resourceefficient Quantum Computation,” which was conducted within the scope of IARPA Quantum Computer Science (QCS) program and derived some of its goals from that source. PLATO was performed jointly by Applied Communication Sciences (ACS), Dalhousie University, the University of Pennsylvania, Louisiana State University, Southern Illinois University, and the University of Massachusetts at Boston. We would like to thank all PLATO team members for insightful discussions. Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center Contract Number D12PC00527. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
Author information
Authors and Affiliations
Corresponding author
Additional information
Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government.
This article is part of topical collection on Quantum Computer Science.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Singlequbit unitaries in terms of prespecified elementary gates
1.1 Implementation according to work by A. Fowler
To convert any singlequbit unitary to a circuit in terms of a prespecified set of gates \(\{ X, Y, Z, H, S, T\}\), we could use the famous Solovay–Kitaev algorithm, see, e.g., [1] and references therein. However, this work can result in unnecessarily long global phase correct approximating sequences, since the tracenorm used in the Solovay–Kitaev theorem does not ignore global phases. Some optimizations of the Solovay–Kitaev algorithm are possible, see e.g., [46]. For the singlequbit rotation gates, we base our estimates on work by A. Fowler (see [39], p. 125 and [40]). This work constructs optimal faulttolerant approximations of singlequbit phase rotation gates
Fowler shows that a phase rotation by an angle of \(\pi /128\) can be approximated by a sequence of faulttolerant gates with a distance measure
by choosing \(U_{46}\) as follows:
This sequence contains 23 H gates, 23 \(T~(\pi /8)\) gates and 13 S or \(S^\dagger \) gates. In general, the approximating sequence is of the form \(G_iTG_jT\dots \), where \(G_i,G_j\in {\mathscr {G}}\), a precomputed set of gates, which together with the Identity gate I form a group under multiplication \(\{I,G_1,G_2,\ldots ,G_{23}\}\). Here, \(G_1=H, G_2=X, G_3=Z, G_4=S, G_5=S^\dagger , G_6=XH, G_7=ZH, G_8=SH, G_9=S^\dagger H, G_{10}=ZX, G_{11}=SX, G_{12}=S^\dagger X, G_{13}=HS, G_{14}=HS^\dagger , G_{15}=ZXH, G_{16}=SXH, G_{17}=S^\dagger XH, G_{18}=HSH, G_{19}=HS^\dagger H, G_{20}=HSX, G_{21}=HS^\dagger X, G_{22}=S^\dagger H S, G_{23}=S H S^\dagger \). To represent the complete set of approximating sequences, Fowler includes \(G_{24}=T\).
The sequence given in Eq. (24) contains 46 \(G_j\) gates. The number of T gates is 23, or half the length of the approximating sequence in terms of \(G_j\) gates. The number of H gates in this particular sequence is also 23, and the rest of the 59 elementary gates are S (or \(S^\dagger \)) gates.
Fowler also investigated the approximation of arbitrary singlequbit gates
by sequences of gates from the group \({\mathscr {G}}\). 1000 random matrices were chosen, with \(\alpha , \beta \) and \(\theta \) chosen uniformly in \([0,2\pi )\). Optimal approximations \(U_l\) were constructed for each random matrix, and a line was fitted to the average distance \(\text{ dist }(U,U_l)\) plotted for each l. Fowler obtained the following fit for the average number l of singlequbit faulttolerant gates required to obtain a faulttolerant approximation of an arbitrary singlequbit unitary to within the distance:
In other words, to obtain a distance \(\delta \) on average, we need on average \(l=\frac{\log _{10}(\delta /0.292)}{0.0511}\) gates. For \(\delta =7.5\times 10^{4}\), we obtain \(l=50.69\). Compare this to the exact result \(l=46\) for \(R_{\pi /128}\). Also, we note that 46 \(G_j\) gates correspond to 59 elementary gates, of which 23 are T gates. For 51 \(G_j\) gates, we would get 26 T gates, 26 H gates and 14 S gates by extrapolation, for a total of 65 gates.
1.2 Plato implementation of gate sequence approximations
We have implemented a combination of Fowler’s method and the more recent singlequbit “normal form” representation by Matsumoto and Amano [41, 42] in Haskell, to find approximating sequences. With this Haskell implementation, for example, we found an approximating sequence for \(R_{\pi /256}\) with distance \(\delta =3.6\times 10^{4}\), and with sequence length 74:
This sequence consists of 28 (37.8%) T gates, 29 (39.2%) H gates, and 17 (23%) S gates. Smaller rotations tend to need longer sequences to reach the distance threshold \(\delta \) and/or improve on the identity as best approximation. Because our search algorithm used to find the approximating sequences, like Fowler’s method, has exponential running time, finding a specific sequence to approximate a specific arbitrary rotation is not always feasible. Recent progress on this topic aiming at optimaldepth singlequbit rotation decompositions [47,48,49,50,51,52] highlights the importance of this problem for quantum computing.
For our QLSA LRE we have made the following simple (and rather pessimistic) assumption: namely, that any arbitrary singlequbit rotation gate (a large number of such gates, with various angles of rotation, occurs in the implementation of QLSA) can be approximated using approx. 100 faulttolerant gates from the standard set \(\{ X, Y, Z, H, S, T\}\) while also achieving the desired level of algorithmic accuracy (\(\varepsilon =0.01\)). This approximation turned out to be indeed fairly conservative for all rotation gates we had found specific sequences for. Following the above stable relative fractions of approximately 40% T gates, 40% H gates, and 20% S gates in the approximating sequences found, we roughly assume that, on average, each arbitrary rotation in fact consists of 40 T gates, 40 H gates and 20 S gates.
Taking an implementation accuracy \(\varepsilon =0.01\) for each singlequbit rotation gate is not sufficient to guarantee accuracy \(\varepsilon =0.01\) for the entire algorithm. To achieve the latter, we would typically require a much smaller target accuracy for the implementation of singlequbit rotation gates. If the entire algorithm consists of \(n_R\) singlequbit rotations, requiring a target accuracy \(\varepsilon '=\varepsilon /n_R\) for each rotation would be an obvious choice. This is a fairly conservative error bound though, presuming that all rotations are performed in a sequence, with errors in different rotations adding up, never canceling each other out, and disregarding any parallelism in their implementations. However, errors may cancel each other out during the mostly sequential implementation of the gates. The LRE analysis of the bare algorithm excluding oracle resources revealed roughly \(n_R\approx 10^{23}\) singlequbit rotations (with nontrivial angles of rotation), most of which have to be performed sequentially, as implied by the distinct lack of parallelism in the design of QLSA. According to Fowler’s analysis, the number of standard gates needed on average to implement (decompose) a singlequbit rotation with accuracy \(\varepsilon '=\varepsilon /n_R\) is approximately: \(l=\frac{\log _{10}(\varepsilon /n_R)/0.292}{0.051}\), cf. Eq. (26). Inserting the values \(n_R\approx 10^{23}\) and \(\varepsilon =0.01\) yields \(l\approx 480\), which is less than by a factor 5 larger than what we assumed for our LRE analysis. Hence, while our LRE results in Table 2 provide gate counts for what is necessary (not sufficient) to achieve an accuracy \(\varepsilon =0.01\) for the entire algorithm, the more conservative error bound \(\varepsilon '=\varepsilon /n_R\) for the target rotation accuracy (to guarantee the accuracy \(\varepsilon \) for the whole algorithm) would yield estimates for H, S, and T gates as well as Tdepth that are only by a factor \(\sim \)5 larger. The overall gate count and overall circuit depth would also be increased by a slightly smaller factor close to 5.
Appendix 2: Circuits and resource estimates of lowerlevel subroutines and multiqubit gates employed by QLSA
Here we review some wellknown circuit decompositions of various multiqubit gates in terms of the standard set of elementary gates \(\{ X, Y, Z, H, S, T, \text{ CNOT } \}\) and their associated resource counts that have been used for our QLSA LRE analysis.
1.1 ControlledZ gate
ControlledZ gate can be decomposed into two H gates and one CNOT according to Fig. 29.
1.2 ControlledH gate
ControlledH gate can be implemented in terms of standard gates by using the circuit equality given in Fig. 30: The singlequbit rotations employed in this implementation can be further decomposed into sequences consisting only of T, S and H gates: \(R_z(\pi )=T^4=S^2=Z, R_z(\pi )=S^{\dagger 2}=Z, R_y(\pi /4)=SHTSHXZS\) and \(R_y(\pi /4)= S^\dagger ZXH S^\dagger T^\dagger H S^\dagger \).
1.3 Wgate
“Wgate” is a twoqubit gate whose action as well as its implementation in terms of standard gates is illustrated in Fig. 31. As described above for the “controlledH” gate, the singlequbit rotations \(R_z(\pi ), R_z(\pi ), R_y(\pi /4)\) and \(R_y(\pi /4)\) can be further decomposed in terms of sequences consisting only of T, S and H gates.
1.4 Controlled rotations
Controlled singlequbit rotations \(R_z(\theta )\) can be implemented in terms of CNOTs and unconditional singlequbit rotations according to circuit equality provided in Fig. 32. In the case of controlled singlequbit rotations \(R_y(\theta )\) we can use the circuit identity shown in Fig. 33. A similar implementation can be derived for controlled singlequbit rotations \(R_x(\theta )\). Moreover, doubly controlled rotations can be implemented in terms of Toffolis, CNOTs, and unconditional singlequbit rotations according to circuit equality given in Fig. 34.
1.5 Toffoli gate
Toffoli gate (essentially a CCNOT) can be implemented (cf., e.g., [1]) by a circuit using 6 CNOT gates, 1 S gate, 7 T (or \(T^\dagger \)) gates and 2 Hadamard gates, and having circuit depth 12, see Fig. 35.
1.6 Multicontrolled NOT
A multifold CNOT that is controlled by \(n\ge 3\) qubits can be implemented by \(2(n2)+1\;\) Toffoli gates, which must be performed sequentially, and employing \((n2)\) additional ancilla qubits [38]. Using the resources needed for Toffoli gates, we can infer the resource count of any multicontrolled NOT employing an arbitrary number of control qubits and a single target qubit, see Table 3.
1.7 Quantum Fourier Transform (QFT)
Both Quantum Fourier Transform (QFT) and its inverse \(\hbox {QFT}^{1}\) are employed in the implementation of QLSA. QFT and its representation in terms of a quantum circuit are discussed in most introductory textbooks on quantum computation, see e.g., [1]. A circuit implementation of \(\hbox {QFT}^{1}\) is shown in Fig. 36, where we use the definition \(R_k:=\bigl ({\begin{matrix} 1&{}0\\ 0&{}\exp (2\pi i/2^k) \end{matrix}} \bigr )\).
Here we expand on elementary resource requirements of QFT (and its inverse \(\hbox {QFT}^{1}\)). Let \(b\ge 2\) be the number of qubits the QFT (or its inverse) acts on, as in Fig. 36. Using the circuit decomposition rule for controlled rotations discussed in Appendix “Controlled rotations,” we can derive the circuit identity shown in Fig. 37. Using this circuit identity rule, we can express the logical resource requirements in terms of standard gates and unconditional \(R_k\) gates, see Table 4. The latter can then be implemented in terms of approximating sequences consisting only of faulttolerant gates from the set \(\{ X, Y, Z, H, S, T\}\), as discussed in “Appendix 1.”
1.8 Controlled phase: \(\text{ CPhase }({\mathbf {c}}; \phi _0,f)\)
The task of the controlledphase \(\text{ CPhase }({\mathbf {c}}; \phi _0,f)\), which is a lowerlevel algorithmic building block used in the implementations of the higherlevel subroutines “StatePrep_b,” “StatePrep_R” and “HamiltonianSimulation” (see Fig. 2), is to apply a phase shift to a signed nqubit input register \({\mathbf {c}}\), whereby the applied phase is controlled by \({\mathbf {c}}\) itself:
Note, that the first \({\mathbf {c}}\)register qubit \({\mathbf {c}}[0]\) signifies the least significant bit corresponding to the minimum phase shift \(\phi _0\), whereas the qubit \({\mathbf {c}}[n2]\) determines the most significant bit. Moreover, the last \({\mathbf {c}}\)register qubit \({\mathbf {c}}[n1]\) controls the sign of the applied phase. To implement inverse operations, it is conditionally flipped by a classical integer flag \(f\in \{0,1 \}\); for \(f=1\) the phase should be inverted. The quantum circuit is provided in Fig. 38.
When employed as part of the subroutine \(M=\hbox {Hmag}({\mathbf {x}}, {\mathbf {y}}, \text{ m }, \phi _0)\), the controlledphase \(\text{ CPhase }({\mathbf {c}}; \phi _0,f)\) is in addition to be controlled by a singlequbit \({\mathbf {t}}[j]\) that is part of the \(n_1\)qubit HS control register \({\mathbf {t}}\), see Figs. 19 and 39. For the LREs of \(\text{ CPhase }({\mathbf {c}}; \phi _0,f)\) and \(\text{ CPhase }\) that is further controlled by a singlequbit \({\mathbf {t}}[j]\), we utilized the circuit decomposition rules discussed in the previous appendix sections. In particular, we used the rough (and rather conservative) assumption that, on average, every (unconditional) singlequbit rotation gate can be approximated by sequences of approx. 100 faulttolerant gates with each sequence roughly consisting of 40 T gates, 40 H gates and 20 S gates, see “Appendix 1.” The LREs of unconditional \(\text{ CPhase }\) and conditional \(\text{ CPhase }\) are summarized in Tables 5 and 6.
1.9 ControlledRotY: \(\text{ CRotY }({\mathbf {c}}, {\mathbf {t}}; \phi _0, f)\)
The task of the subroutine \(\text{ CRotY }({\mathbf {c}}, {\mathbf {t}}; \phi _0,f)\), which is used in the implementation of higherlevel subroutines “StatePrep_b,” “StatePrep_R” and “Solve_x,” is to apply a singlequbit rotation \(R_y(\theta )\) to a singlequbit target register \({\mathbf {t}}\), where the angle of rotation \(\theta \) is controlled by a signed nqubit input register \({\mathbf {c}}\):
The first \({\mathbf {c}}\)register qubit \({\mathbf {c}}[0]\) signifies the least significant bit corresponding to the minimum angle of rotation \(\phi _0\), whereas the qubit \({\mathbf {c}}[n2]\) determines the most significant bit. The sign of the applied rotation is controlled by the last \({\mathbf {c}}\)register qubit \({\mathbf {c}}[n1]\). In addition, it is conditionally flipped by a classical integer flag \(f\in \{0,1 \}\) to enable straightforward inverse operations. The quantum circuit is provided in Fig. 40. For the LRE of subroutine \(\text{ CRotY }({\mathbf {c}}, {\mathbf {t}}; \phi _0,f)\), we utilized the circuit decomposition rules discussed in the previous appendix sections; our estimates are summarized in Table 7.
Appendix 3: Resource estimates for the oracles
Below we report our LRE results for some representative oracle queries; all other oracle queries have similar resource counts. These results depend on several choices: the internal representation for real and integer numbers, the details of the linearsystem problem definition, and the method for generating oracles. As for the internal representation of numbers, since every single operation had to be built from scratch, we used fixedpoint representation. Compared to a floatingpoint representation, it is simpler and therefore generates smaller circuits. Regarding the details of the linearsystem problem definition, they constitute the core data of this particular implementation of QLSA; provided in the GFI, we made no effort to modify them. Finally, the oracles were generated with an automated tool, turning a classical description of an algorithm into a reversible quantum circuit. We made this choice because we felt that it was the most natural (and practical) solution for the particular kind of oracles we were dealing with: general functions over real and complex numbers.
Quipper automatically generates recursive decompositions of oracles down to the level of gates such as initialization, termination, etc. and controlledNOTs (by at most one or two wires, each on either true or false). The rules for decomposing these gates into the standardbasis gates H, S, T, and X, and calculating circuit depths and Tdepths are included manually. Our rules for the depths are very conservative: we assume sequential executions unless we know better strategies. Indeed, optimaldepth decompositions are known only for fairly small gates, such as e.g., the Toffoli gate. Hence we expect overestimates both for circuit and Tdepths.^{Footnote 20} These recursive gatedecomposition rules are coded in the symbolic programming software Mathematica for computing the final estimates.
Oracle A returns either the magnitude (argflag \(\,=\,\) False) or the phase (argflag \(\,=\,\) True) of the coupling weight and the connected node index at the chosen matrixdecompositionband index (from 1 to \(N_b=9\)). As there are many combinations, we will show a representative sample and will draw conclusions from them. As is evident from Table 8 that the estimates for different bands in the argflag \(\,=\,\) False cases all agree to the subonepercent level, or to three significant figures, with the exception of the number of qubits which only agree to within about three percents of each other, or to two significant figures. Therefore anyone of them can be taken as a representative for all argflag \(\,=\,\) False Oracle A resource estimates and a representative table is also presented. Similar phenomenon is true for all the argflag \(\,=\,\) True cases and only a representative table is presented for them. As gate decompositions used are to the basisgate level, the number of ancillas and measurements should agree in every case, each with individual band index and argflag. This is indeed true in all cases for which we have performed resource counting. The two representative tables for argflag \(\,=\,\) False and argflag \(\,=\,\) True are presented in Table 9. Finally, the resource counts for Oracle r and for Oracle b are done similarly: Quipper gives logical resource estimates, then recursive gatedecomposition rules are coded in the symbolic computing software Mathematica for computing the final estimates presented in Table 10.
One may wonder why our oracle implementations require such a huge number of auxiliary qubits and measurements—namely, up to \(\sim \)10\(^8\) ancilla qubits and measurements for a problem size \(N\approx 3\times 10^8\). This indeed is a feature of our lowlevel implementation of the irreversibletoreversible transformations that is similar to the way “logical reversibility of computation” was proposed by Bennett in [53]. In essence, to ensure that the run of the entire computation can be unwound, the result of each of its elementary subcomputations is stored in an auxiliary qubit. When the final result has been computed, it is copied into a fresh quantum register, and the entire computation is reversed, with every subcomputation undone along the way, and the initial values “0” of the intermediate auxiliary qubits restored and verified by a measurement. The number of auxiliary qubits required is therefore directly proportional to the number of elementary computational steps, and thus to the number of gates in the oracle. And the number of measurements needed to ensure reversibility of computation equals the number of ancilla qubits. One might argue that such an implementation is unnecessarily verbose. While we agree that there may be more efficient implementations (e.g., by using some known efficient adders when performing addition), our proposed implementation is arguably not so inefficient, in the sense that the size of the circuit (and therefore also the number of auxiliary qubits) is directly proportional (and not, say, exponential) to the length of the classical computation that would compute the data. In particular, the size of the circuit for the oracle computing an element of the matrix A is linear in the number of bits required to store the size of the matrix. Hence, the Bennett construction we follow for our oracle implementations has good “theoretic properties” in the worst case. However, the overhead for the implementation of an arbitrary oracle is still considerable. Yet it is very useful as a first baseline resource estimate. There is scope for improvement, both from software tools and from better algorithm design using reversible gates.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Scherer, A., Valiron, B., Mau, SC. et al. Concrete resource analysis of the quantum linearsystem algorithm used to compute the electromagnetic scattering cross section of a 2D target. Quantum Inf Process 16, 60 (2017). https://doi.org/10.1007/s1112801614955
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s1112801614955