1 Introduction

The seminal paper [19] is widely known for essentially establishing the field of quantum computing. However, a lesser known offshoot is biomolecular computing, alternatively known as DNA computing. The founding notions of this computational paradigm were presented in [2], where it was proposed that regular DNA strands can represent combinatorial inputs instead of the functions of a living organism. Then standard lab operations can be applied to these strands in order to extract strands containing a solution. As a concrete application, the TSP was solved exactly through a brute force methodology in a medium sized graph \(G \,=\, \left( {V,E}\right) \) in \({\mathrm {O}}\left( {\left| {V}\right| + \left| {E}\right| }\right) \) elementary operations and \({\mathrm {O}}\left( {\log \left| {V}\right| }\right) \) DNA strands, indicating the potential of massive parallelism. This was repeated in [32] with a different algorithmic approach though in the sequence of lab operations, establishing the fact that novel and efficient algorithms are also necessary in this computational paradigm.

The primary objective of this survey is to concisely summarize the principles and notions of the paradigm of biomolecular computing with an emphasis on the potential for massively parallel computations. The latter may well serve in the dawn of the big data and 5 V era as an unconventional inspiration for the designers of parallel algorithms or distributed systems.

The remaining of this survery is structured as follows. Section 2 summarizes the principal concepts of biomolecular computing, the connections to known computational models, and describes computational applications not examined elsewhere in this survey. The elementary operations, advantages, and disadvantages of the biomolecular paradigm are explained in Sect. 3. The most important application, namely TSP, is described in Sect. 4. The parallelism potential is explored in Sect. 5, while factors working against the computation scale up are investigated in Sects. 6 and 7. The main points are summarized in Sect. 8 and certain conclusions are drawn. Finally, Table 1 summarizes the survey notation.

Table 1. Survey notation.

2 Previous Work

As stated earlier, the groundwork for the paradigm of biomolecular computing was laid in [2] followed by [32]. Soon, notions and applications emerged as noted in the early surveys [40, 45, 47]. The computational power of the paradigm is explored in [6, 9] and its limits in [18]. The enormous potential for parallelism is highlighted in [21,22,23,24, 43]. The notion of self assembly was applied to this paradigm in [1, 31, 48]. According to this principle, which is reminiscent of non-supervised learning, DNA tiles in a test tube given proper mobility conditions and time can attach themselves to tiles or strands containing fully or partial complementary sequences without human intervention in two [50, 51] or three dimensions [28]. Issues pertaining to complexity of the biomolecular paradigm are examined from various viewpoints in [4, 29, 46].

Another way to examine the potential and the complexity of biomolecular computation is through the simulation of the operation of sequential, bounded fan-in Boolean circuits with DNA strands as first shown in [3, 36] and described in detail in the follow up work [39]. Questions regarding the complexity of constructing and evaluating the output of such circuits are addressed in [34, 35, 37], whereas self assembling circuits are investigated in [38].

Among the algorithmic applications of biomolecular computation are the brute force parallel solution of k-SAT in [13] and in [5], dynamic programming on the Cell Matrix architecture [49], and splicing systems [11]. Shortest path algorithms implemented in biomolecular elementary operations are presented in [33] and in [42]. Length bounded computing with DNA strands and its connections to space complexity are explored in [20]. An evolutionary algorithm also expressed these operations is described in [12].

Finally, steps regarding the implementation of a DNA computer are given in [25, 26, 41, 44], although these proposals vary. DNA operations can be also simulated over Neo4j with properties in edges corresponding to physical or chemical DNA properties in an approach similar to the one presented in [30] for implementing persistent data structures. Concerning software, a fully functional graphical computing environment for biomolecular computation is described in [10]. It also includes DNA C, a C variant which is a combination of the C constructs pertaining to integers with an extension implementing the fundamental operations of biomolecular computing. The exclusion of floating point arithmetic should not come as a surprise, since biomolecular computation is discrete in nature and has been so far applied only in combinatorial problems. Nonetheless, should the need arise, floating point numbers can be approximated fairly well by rationals or by continued fractions. For instance, the golden ratio \(\varphi \) is represented by the infinite fractionFootnote 1

$$\begin{aligned} \varphi \,=\, 1 + \frac{1}{1 + \frac{1}{1 + \frac{1}{1 + \ldots }}} \end{aligned}$$
(1)

3 Paradigm Notions

3.1 Definition

Knowledge transfer and inspiration between computer science and biology has been fruitful. This relationship has already resulted in bioinformatics, connectomics, and computational biology. One of them, with direct reference to the everyday laboratory handling of DNA strands, is the paradigm of biomolecular computation which can be formally defined as

Definition 1

Biomolecular computing is the art and science of using DNA strands as computational medium to appropriately encode candidate solutions to (possibly intractable for a Turing machine) combinatorial problems and using standard biological laboratory techniques in order to select an exact solution.

From Definition 1 follows that any DNA encoding corresponds strictly to candidate solutions of a combinatorial problem and not to the design and functions of a living being. Also, the difference from the fields of bioinformatics and computational biology should be clear. Although the inspiration, terminology, and implementation are biological, the paradigm is definitiely computational as the objective is to codify and efficiently solve instances of intractable, at least under the conventional Turing machine model, combinatorial problems.

3.2 Abstract DNA Operations

Under the biomolecular computation paradigm the elementary operations applied to the DNA strands in the test tube are the following.

  • Initialize (\(T_0\)): Crate a test tube \(T_0\) containing each admissible candidate solution for a combinatorial problem according to a probability distribution, usually the uniform one.

  • Copy(\(T_0\), \(T_1\)): Copy the contents of \(T_0\) to the tube of \(T_1\).

  • Merge(\(T_0\), \(T_1\)): Mix the contents of \(T_0\) and \(T_1\) to \(T_0\).

  • Detect(\(T_0\)): Examine whether \(T_0\) is empty.

  • Select(\(T_0\), \(s_0\)): Extract \(s_0\), if present, from \(T_0\).

  • Extract(\(\ell \)): Extract strands of length \(\ell \).

  • Cleavage(\(s_0\), \(\sigma _0\)): Slice \(s_0\) according to the shorter template strand \(\sigma _0\).

  • Anneal(\(s_0\)): Create double DNRA strands from a single one.

  • Denature(\(s_0\)): Create single DNA strands from a double one.

  • Ligate(\(T_0\)): Create bonds between double strands inside tube \(T_0\).

The basic storage unit in this paradigm is the test tube which may contain a fixed volume of DNA strands. The latter are not necessarily of the same type. In fact, the opposite is quite common as the tube contains the results of a sequence of lab operations applied to a set of candidate solutions. It is only after the end of these operations where the solution, if any, to the instance at hand is extracted and cultivated that a test tube may contain only copies of a single strand.

4 TSP

Perhaps the most well known biomolecular algorithm is the one shown in algorithm 1 for solving TSP for a graph \(G = \left( {V,E}\right) \). Notice this is a brute force method which can be used among others to discover community structure in graphs as in [15, 27]. However, this approach could not be a basis for heuristic techniques such as those in [16, 17]. The primary characteristic of algorithm 1 is that all paths of length j are created in j steps. Assume that each vertex is encoded with \(b_v\) and each edge with \(b_e\) bases. Each vertex \(v_k\) has a unique coding \(s_k\) in DNA bases, while the edge \(\left( {v_i, v_j}\right) \) is encoded as the concatenation of \(\bar{s_i}\) and \(\bar{s_j}\).

figure a

5 Parallelism

Theoretically, the biomolecular computation paradigm can be reduced to simulating a CREW PRAM, a version of RAM with \(P > 1\) processors and M memory locations where multiple processors can read the same memory address but only one can write any given time at a given memory address. The CREW PRAM operation set LOAD, READ, WRITE as well as the memory configuration of this machine can be simulated by a sequence of biomolecular operations as shown in [46] through a sequence of successive configurations of length \({\mathrm {O}}\left( {\log \left( {PM}\right) }\right) \) each. Starting from a random configuration, mixing a sequence of admissible configurations, at the expense of a multiplicatively growing volume, finally yields one superconfiguration of concatenated configurations representing a desired sequence of computation, assuming that one can be found. By repeatedly applying cleavage operations, the individual configurations are extracted.

Another way to quantify the actual parallelism offered by the biomolecular paradigm is to use Amdahl’s law as a benchmark

(2)

where \(\varGamma \) is the actual speedup, \(\gamma _0\) is the parallelizable part of the task, and s is the speedup of that part. Thus, the size of the parallelizable part plays a crucial role and essentially defines how much can be actually sped up as for infinite s

$$\begin{aligned} \varGamma \rightarrow \frac{\displaystyle 1}{\displaystyle 1-\gamma _0} \end{aligned}$$
(3)

6 Error Free Computations

6.1 Overview

Errors in the biomolecular computation can be classified into two broad categories. The first source of errors lies in the encoding of candidate solutions as there can be partial DNA matches which may eventually create false final solutions, in other words creating false positives. Moreover, the elementary operations described above may not be executed with absolute sucess due to a number of factors. This might create depending on the nature of the operations involved either false positives or false negatives.

6.2 Encoding

Concerning the solution encoding, an obvious approach might be to choose an encoding with controllable redundancy \(\beta _0\). Namely, each bit of information is expressed with \(\alpha _0 = 1+\beta _0\) bits in total and on average. Thus, the ratio \(\rho _0\) of original to redundant information is

(4)

For instance, in a graph problem with \(\left| {V}\right| \) if each vertex normally requires \(\left\lceil \log \left| {V}\right| \right\rceil \) bits to encode, then it would require \(\alpha _0\) times as much.

Regarding the number of false negatives or false positives, it can be argued that it is proportional to the partial matches of DNA strings. While there is a case where an intended match of two DNA strands \(s_1\) and \(s_2\) fails, strand mismatches denoted by \({s_1}\not \parallel {s_2}\) can be considered accurate for the most part. On the contrary, partial matches either from left or from right are with high probability indicators of failed operations, especially if the two strands overlap in only a few bases. For a single biomolecular step in the same test tube \(T_0\), let

$$\begin{aligned} u&\,=\, \left| {{s_1}\parallel {s_2}}\right| \\u_r\left( {k}\right)&\,=\, \left| {{s_1}\sqsupset {s_2}, {s_1}\cap {s_2} \,=\, k}\right| \\u_l\left( {k}\right)&\,=\, \left| {{s_1}\sqsubset {s_2}, {s_1}\cap {s_2} \,=\, k}\right| \\u_m&\,=\, \left| {{s_1}\sqcup {s_2}}\right| \\u_o&\,=\, \left| {{s_1}\not \parallel {s_2}}\right| \end{aligned}$$
(5)

denote respectively the number of DNA perfect matches, right and left matches in k bases, middle matches and no matches. Thus, in a single biomolecular step the ratios of true matches \(Q^+\) and true mismatches \(Q^-\) are

(6)

Another way to assess the reliability of a sequence of \(n_0\) biomolecular operations is the following. Let \(\pi _k\) be the success probability of each operation. The success probability \(\pi ^+\) by the product rule, assuming operation independence, is

(7)

Thus, if \(\mu _1 \,\le \, \pi _k \,\le \, \mu _2\), then the following bounds can be derived

$$\begin{aligned} e^{-n_0 \mu _2} \,\le \, \pi ^+ \,\le \, e^{-n_0 \mu _1} \end{aligned}$$
(8)

Perhaps a more accurate way to assess the average value of \(\pi ^+\) is to consider the geometric mean of the sequence

(9)

6.3 Trials

Due to the nature of the actual biological operations, they are not always executed with abolute correctness. This can be attributed to a number of reasons including the age, technology, and condition of lab equipment, the chemistry and quantity of DNA strands themselves, the nature of bonds between strands, and lab conditions such as radiation and electomagnetic pulses of various frequences. As a result, a strand encoding an invalid solution can emerge from the test tubes or a strand containing the solution can be missed in them [7, 8].

The geometric distribution models sequences of test outcomes where the probability of success is \(p_0\). Its probability mass function is defined as

(10)

The interpretation of X is that it models the number of failed attempts of an experiment before the single successful outcome of that experiment. The mean value of this distribution is readily calculated in equation (11).

(11)

The \({{\mathrm{logit}}}\left( {\cdot }\right) \) function is the inverse of the sigmoid function extensively used to train deep recurrent and tensor stack networks [14] expresses the logarithm of the odds of a single Bernoulli trial. Also is a special case of a link function to the generalized linear model and leads to the logistic regression, commonly found in deep learning applications.

The variance of X is its mean value scaled by the success probabilty \(p_0\). This is coherent with intuition, since the larger \(p_0\), the fewer attempts are required to achieve success and the number of failed attempts will be close to zero.

(12)

7 Volume Considerations

The major practical limitation of biomolecular computation is the volume necessary to represent each candidate solution. This is limited not only by the volume of the test tube, but also by the encoding which, in turn, implies a redundancy rate to ensure higher operation success probabilites. If \(V_0\) is the hard limit for the test tube volume, \(\alpha _0\) the encoding redundancy factor, n is the number of candidate solutions, and \(\ell _0\) the solution length, then

$$\begin{aligned} V_0 \,\ge \, \left( {a_0 n \ell _0}\right) ^{1 + \epsilon } \,\Leftrightarrow \, n \,\le \, \frac{1}{a_0 \ell _0} V_0^{\frac{1}{1+\epsilon _0}} \end{aligned}$$
(13)

where \(\epsilon _0\) is a constant which depends on a number of diverse factors such as lab temperature and tube technology. If a biomolecular computation requires \(J_0\) steps to complete, then the required volume grows exponentially. Then, the above limit should be modified as

$$\begin{aligned} n \,\le \, \frac{1}{a_0 \ell _0} V_0^{\frac{1}{J_0\left( {1+\epsilon _0}\right) }} \end{aligned}$$
(14)

8 Conclusions

This survey explores the foundations, applications, and limits of biomolecular computation which represents an alternative computational paradigm. The latter is based on the handling of potentially very long strands of ordinary DNA using standard biology lab operations such as annealing, generating a complementary strand, selecting a strand, or concatenating two strands. These strands do not codify the inner workings of any living organism. Instead, they contain a suitably selected representation of a computational problem. By selecting an appropriate representation of an input instance, a plethora of various output instances are created by a series of biological operations. Although a fraction of the abovementioned output instances may contain incomplete operations and must be removed by cleansing operations, their majority will contain with high probability a solution.