Advertisement

BMC Bioinformatics

, 14:S3 | Cite as

On the inversion-indel distance

  • Eyla Willing
  • Simone Zaccaria
  • Marília DV Braga
  • Jens Stoye
Open Access
Proceedings

Abstract

Background

The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only inversions of DNA segments, can be computed thanks to a pioneering approach of Hannenhalli and Pevzner in 1995. In 2000, El-Mabrouk extended the inversion model to allow the comparison of unichromosomal genomes with unequal contents, thus insertions and deletions of DNA segments besides inversions. However, an exact algorithm was presented only for the case in which we have insertions alone and no deletion (or vice versa), while a heuristic was provided for the symmetric case, that allows both insertions and deletions and is called the inversion-indel distance. In 2005, Yancopoulos, Attie and Friedberg started a new branch of research by introducing the generic double cut and join (DCJ) operation, that can represent several genome rearrangements (including inversions). Among others, the DCJ model gave rise to two important results. First, it has been shown that the inversion distance can be computed in a simpler way with the help of the DCJ operation. Second, the DCJ operation originated the DCJ-indel distance, that allows the comparison of genomes with unequal contents, considering DCJ, insertions and deletions, and can be computed in linear time.

Results

In the present work we put these two results together to solve an open problem, showing that, when the graph that represents the relation between the two compared genomes has no bad components, the inversion-indel distance is equal to the DCJ-indel distance. We also give a lower and an upper bound for the inversion-indel distance in the presence of bad components.

Keywords

Joint Inversion Circular Chromosome Optimal Integration Blue Edge Relational Diagram 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Background

The inversion distance problem in genome comparison searches for the minimum number of signed inversions (reversals) to transform one unichromosomal genome, represented as a signed permutation, into another one with the same gene content and without duplications. The inversion sorting problem requests a sequence of inversions that achieve this minimum number. Hannenhalli and Pevzner (1995) gave the first algorithm for calculating the inversion distance and solving the inversion sorting problem in polynomial time for two linear genomes [1]. Soon after (1997), it was shown that a similar result holds for circular genomes [2]. El-Mabrouk (2000) proposed an extension to include insertions and deletions (indels) to the model [3]. The author introduced an exact algorithm for computing the minimum number of inversion and indel events for the asymmetric case where additional genes are present in only one genome. The symmetric case was treated only heuristically, though.

The double cut and join (DCJ) is an abstract rearrangement operation, introduced by Yancopoulos et al. [4] in 2005, which allows to represent most large scale mutation events, such as inversions, translocations, fusions and fissions, which can occur in genomes. If no restriction on the genome structure considering linear and/or circular chromosomes is imposed, using a simple graph data structure, the adjacency graph [5], this leads to considerable algorithmic simplifications. For example, the inversion distance problem can be tackled via the DCJ model in linear time [6].

Yancopoulos and Friedberg [7] introduced insertions and deletions (indels) into the DCJ model but left open the design of an algorithm. This is non-trivial if an indel of consecutive DNA fragments is treated as a single event. In [8] the DCJ distance with indels was considered again, and a linear time algorithm has been proposed. In that paper, the cost of an indel is the same as that of an inversion, but generalizations are possible [9].

In this paper, we combine techniques from [6] and [8] in order to revisit the problem of computing the inversion distance with indels for unichromosomal circular genomes having unequal contents but without duplications. The paper is organized as follows. In the remainder of this section we give definitions and previous results used in this work. We will then use the relational diagram introduced in [10] and prove that, when the graph that represents the relation between the two compared genomes has no bad components, the inversion distance with indels equals the DCJ distance with indels, that can be computed in linear time. We then extend the definition of the component tree from [6] in order to give a lower and an upper bound for the inversion distance with indels in the presence of bad components.

Basic definitions

Each marker in a genome is an oriented DNA fragment. The representation of a marker g in a genome A can be the symbol g, if it is read in direct orientation in A, or the symbol Open image in new window, if it is read in reverse orientation. Let A be a unichromosomal circular genome, that is a genome composed of a single circular chromosome. We represent A by a string s, obtained by the concatenation of all symbols in the chromosome of A, read in any of the two directions (we can build s starting at any marker). An example is given in Figure 1.
Figure 1

Graphic representation of the unichromosomal circular genomes A and B . Each arrow represents a marker and its orientation. The genome A, for example, could be represented by ( a w d ̄ c ̄ y b z ̄ ē f x i j h g ) Open image in new window, or by ( c d w ̄ ā h ̄ j i ¯ x ̄ f ̄ e z b ̄ ȳ ) Open image in new window, or by any circular rotation of these strings.

Common and unique markers

In this work, duplicated markers are not allowed. Given two unichromosomal circular genomes A and B, possibly with unequal contents, let G Open image in new window, A Open image in new window and B Open image in new window be three disjoint sets, such that G Open image in new window is the set of common markers which occur once in A and once in B, A Open image in new window is the set of markers which occur only in A, and B Open image in new window is the set of markers which occur only in B. The markers in sets A Open image in new window and B Open image in new window are also called unique markers. For A = ( a w d ̄ c ̄ y b z ̄ ē f x i j h g ) Open image in new window and B = ( a s b c d u v e f g h i t j r ) Open image in new window, we have G = a , b , c , d , e , f , g , h , i , j Open image in new window, A = w , x , y , z Open image in new window and B = { r , s , t , u , v } Open image in new window.

Indels

In order to sort genomes with unequal contents, we need to consider insertions and deletions of blocks of contiguous markers [3, 8]. We refer to insertions and deletions collectively as indels. Indels have two restrictions: (i) markers of G Open image in new window cannot be deleted; and (ii) an insertion cannot produce duplicated markers [8]. We illustrate an indel with the following example: the deletion of markers uv from genome B = (asbcduvefghitjr) results in B' = (asbcdefghitjr).

Observe that, if | G | 1 Open image in new window, the problem of sorting A into B becomes trivial: we simply delete at once the unique content of the chromosome of A and insert at once, in the proper orientation, the unique content of the chromosome of B. Due to this fact, we assume in this work that | G | 2 Open image in new window.

Rearrangements modeled by DCJ

A double cut and join (DCJ) [4] is the operation that cuts a genome at two different positions, creating four open ends, and joins these open ends in a different way. Consider, for example, a DCJ applied to genome A = ( a w d ̄ c ̄ y b z ̄ ē f x i j h g ) Open image in new window, that cuts before and after yb, creating the segments z ̄ ē f x i j h g a w d ̄ c ̄ Open image in new window and y b Open image in new window, where the symbol • represents the open ends. If we then join the first with the third and the second with the fourth open end, we obtain A = ( a w d ̄ c ̄ b ̄ ȳ z ̄ ē f x i j h g ) Open image in new window. This DCJ corresponds to the inversion of contiguous markers yb. The alternative would be to join the first with the second and the third with the fourth open end, giving two circular chromosomes, representing an excision. Its inverse is called an integration, completing the set of DCJ operations for circular genomes [5].

Methods

In order to find a parsimonious sequence of rearrangements (and indels) sorting one unichromosomal circular genome into the other, it is convenient to find some data structure to represent the relation between the organization of two genomes. This task can be accomplished with the help of the relational diagram, proposed in [10]. (Similarly to [11], we adopt here the term diagram, as not only the abstract graph structure, but also the linear representation of its nodes along the chromosome is used, as we will describe.) This diagram is a specific view of the master graph [12] and unifies in a single structure the breakpoint diagram, proposed in [13] to analyze the inversion distance [1] and also used for the inversion-indel distance [3], and the adjacency graph, proposed in [5] to analyze the DCJ distance, and then used for the DCJ-indel distance [8].

The relational diagram

Given two unichromosomal circular genomes A and B, their relational diagram, denoted by R(A, B), shows the elements of genome A in an upper horizontal line and the elements of genome B in a lower horizontal line. We denote the two extremities of each marker g G Open image in new window by g t (tail) and g h (head). For each extremity of g the diagram R(A, B) has an orange vertex in the upper line and a blue vertex in the lower line. Clearly, each line (that corresponds to the chromosome of one of the two genomes) has 2 | G | Open image in new window vertices, and its vertices are distributed following the same order of the corresponding chromosome. Since the chromosomes are circular, we have to choose one marker a G Open image in new window from which we start to read the chromosomes in both genomes, s.t. in both lines the leftmost vertex is a h and the rightmost is a t . Then, for each marker g G Open image in new window, we connect the orange and the blue vertices that represent g t by a dotted edge. Similarly, we connect the orange and the blue vertices that represent g h by a dotted edge.

Moreover, for each integer i from 1 to | G | Open image in new window, let γ1 and γ2 be the orange vertices (analogously blue vertices) at positions 2i - 1 and 2i of the corresponding line of the diagram. We connect the orange vertices (analogously blue vertices) γ1 and γ2 by an orange edge (analogously blue edge) labeled by , which is the substring composed of the markers of genome A (analogously genome B) that are between the extremities represented by γ1 and γ2. Observe that γ1 and γ2 are G Open image in new window-adjacent, that is, they represent extremities of occurrences of markers from G Open image in new window in genome A (analogously B), so that in-between only markers from A Open image in new window (analogously B Open image in new window) can appear. In other words, the label contains no marker of G Open image in new window. When the label of an orange (or blue) edge is empty, the edge is said to be clean, otherwise it is said to be labeled. A similar notion was introduced in [3] as direct, resp. indirect edge.

Each vertex is now connected to one dotted edge and either to one orange or to one blue edge, thus the degree of all the vertices is two and the diagram is a simple collection of cycles. Each cycle alternates a pair of orange-dotted with a pair of blue-dotted edges, consequently the length of each cycle is a multiple of 4. By walking through each of these cycles, arbitrarily in one of the two possible directions, we assign an orientation to each colored edge (see Figure 2). The relative orientations of the colored edges within one cycle are useful for classifying different types of inversions, as we will see later.
Figure 2

Example of a relational diagram. For genomes A = ( a w d ̄ c ̄ y b z ̄ ē f x i j h g ) Open image in new windowand B = ( asbcduvefghitjr ) the relational diagram contains five cycles. Only cycle C2 is clean, while cycles C1, C3, C4 and C5 are labeled.

We represent the labels according to the assigned direction instead of taking a simple left-to-right orientation for each edge, in order to avoid any ambiguity. In other words, the orientations of the edges determine the orientations in which the labels are read. Note, however, that an edge γ 1 γ 2 Open image in new window could be equivalently represented as γ 2 ̄ γ 1 Open image in new window. A cycle that contains at least one labeled edge is said to be labeled, otherwise the cycle is said to be clean.

DCJ sorting and DCJ distance

The cycles of R(A, B) containing only two dotted edges (and one orange and one blue edge) are called 2-cycles and are said to be DCJ-sorted. Longer cycles are DCJ-unsorted and have to be reduced, by applying DCJ operations, to 2-cycles. This procedure is called DCJ-sorting of A into B. A DCJ can be of three types [8]: split DCJ when it increases the number of cycles by one; neutral DCJ when it does not affect the number of cycles; and joint DCJ when it decreases the number of cycles in R(A, B) by one. It has been shown that, given any pair of orange edges (or any pair of blue edges) belonging to the same cycle, a split DCJ can be applied to these edges [14]. (However, depending on the relative orientations of the edges, the number of chromosomes may stay the same, when the DCJ corresponds to an inversion, or increase, when the DCJ corresponds to the excision of a circular chromosome.) Due to this fact, the DCJ distance of A and B, denoted by dDCJ(A, B) and defined as the minimum number of steps required to do a DCJ-sorting of A into B, is given by the following theorem.

Theorem 1 (from [4]). Given two unichromosomal circular genomes A and B over the same set of markers G Open image in new window, we have d DCJ ( A , B ) = | G | - c Open image in new window, where c is the number of cycles in R(A, B).

Inversion model

In the inversion model, circular excisions and reintegrations are not allowed, and a DCJ can only represent an inversion. In the following, without loss of generality, we will refer to operations applied to orange edges of R(A, B), but a symmetric analysis could be done using blue edges. Differently from a general DCJ operation, an inversion only increases the number of cycles in R(A, B) when it is applied to two orange edges that belong to the same cycle C and have opposite orientations according to the arbitrary direction assigned to C (see Figure 3) [1].
Figure 3

Effects of an inversion in the diagram (from [10]). Observe that the inverted segment is inside the horizontal square bracket, that shows γ2, γ3, ..., γ4, γ5 at the left side and γ5, γ4, ..., γ3, γ2 at the right side of both pictures. (i) If the edges are in the same cycle and with opposite orientations, the inversion splits the cycle. Inversely, if the edges are in different cycles, the inversion joins them (independently of the orientations of the original edges, that are omitted). (ii) If the edges are in the same cycle with the same orientation, the inversion is neutral and the number of cycles remains unchanged.

Two distinct cycles C and C Open image in new window are said to be interleaving when in the relational diagram there is at least one orange edge of C between two orange edges of C Open image in new window and at least one orange edge of C Open image in new window between two orange edges of C. An interleaving path connecting two distinct cycles C and C Open image in new window is defined as the smallest set of cycles C1, C2, ..., C k such that C1 = C, C k = C Open image in new window and C i and C i+1 are interleaving for all i, 1 i < k Open image in new window. An interleaving component or simply component is then a maximal set of cycles C Open image in new window where each C C Open image in new window is connected by an interleaving path to any other C C Open image in new window.

Components can be of three types. The first type is a 2-cycle, that can never interleave with any other cycle and is then called a trivial component. The other two types are components of DCJ-unsorted cycles. Let C be a DCJ-unsorted cycle in R(A, B). If C does not have a pair of orange edges with opposite orientations, C is called a bad cycle. Otherwise the cycle C is said to be good. A bad cycle C cannot be split by any inversion applied to its orange edges. However, if C is part of a component C Open image in new window that contains at least one good cycle, it is always possible to apply one or more inversions that split good cycles of C Open image in new window, so that C becomes good and can then be also sorted with split inversions [1]. Therefore, if a non-trivial component contains at least one good cycle, it is called a good component, otherwise it is called a bad component.

The relational graph represented in Figure 2 has four components: one good (the cycle C1), two trivial (the cycles C2 and C4) and one bad (composed of the two interleaving bad cycles C3 and C5).

When R(A, B) has no bad components, it has been long known that the inversion distance is equal to the DCJ distance:

Lemma 1 (adapted from [2, 15]). For two unichromosomal circular genomes A and B, such that R(A, B) has no bad component, d INV ( A , B ) = d DCJ ( A , B ) = | G | - c Open image in new window.

Cutting and merging bad components

While the DCJ distance is achieved with split inversions only, bad components require neutral and/or joint inversions to be sorted. Given an inversion ρ, we define the DCJ-cost of ρ, denoted by | | ρ | | Open image in new window, to be respectively 1 or 2 depending on whether ρ is a neutral or a joint inversion.

A neutral inversion, applied to any two orange edges of the same bad cycle C, turns it into a good cycle [1]. Consequently, if C is part of a bad component C Open image in new window, then C Open image in new window also becomes a good component. This type of inversion is said to be a cut of a bad component. It decreases the number of bad components by one and, since it is a neutral inversion, its DCJ-cost is one.

A joint inversion, applied to two orange edges of two distinct cycles C1 and C2, turns them into a single good cycle C. If C1 and C2 belong to two distinct components C 1 Open image in new window and C 2 Open image in new window they are merged into a single good component C Open image in new window that contains the good cycle C [1]. This type of inversion is said to be a merging of bad components. It can decrease the number of bad components by at least two, and, since it is a joint inversion, its DCJ-cost is two.

The inversion distance between two unichromosomal genomes A and B with equal content, denoted by dINV(A, B), can be then represented by the following equation:
d INV ( A , B ) = d DCJ ( A , B ) + τ INV ( A , B ) . Open image in new window

The value τINV(A, B) corresponds to the extra cost for cutting and merging bad components. It can be efficiently computed based on the direct analysis of R(A, B) [1]. In the last section of this paper we will recall an alternative approach [6, 16], based on a tree structure that represents the components of R(A, B).

Runs, indel-potential and the DCJ-indel distance

Now we go back to the general DCJ distance, in which we do not need to take care of bad components. We introduce some definitions and concepts that will help us to integrate indels into the general DCJ model. These concepts are useful to show how to use DCJ operations to minimize the number of indels to be performed. First observe that a set of labels of one genome can be accumulated with DCJs. For example, take the orange edges c t yb t and e h z ̄ b h Open image in new window from genome A in Figure 2. A DCJ applied to these two edges could result in the new edges c t b h and e h z ̄ ȳ b t Open image in new window, in which the label z ̄ ȳ Open image in new window results from the accumulation of the labels of the two original edges.

With this notion we can then recall the concept of run, introduced in [8]. Given two genomes A and B and a cycle C of R(A, B), a run is a maximal subpath of C, in which the first and the last edges are labeled and all labeled edges have the same color (belong to the same genome). A run in genome A is also called an A Open image in new window-run, and a run in genome B is called a B Open image in new window-run. We denote by Λ(C) the number of runs in cycle C. A cycle has either 0, or 1, or an even number of runs. As an example, note that the cycle C1 represented in Figure 2 has 4 runs ({a h wd h } and { e h z ̄ b h , b h c t , c t y b t } Open image in new window are A Open image in new window-runs, while { b t s ̄ a h } Open image in new window and { d h u v e t } Open image in new window are B Open image in new window-runs).When we apply split DCJs internal to a single cycle of the relational diagram, we can accumulate an entire run into a single edge [8].

In addition to being accumulated, runs can also be merged by DCJ operations. Consequently, during the optimal DCJ-sorting of a cycle C, we can reduce its number of runs. The indel-potential of C, denoted by λ(C), is defined in [8] as the minimum number of runs that we can obtain by DCJ-sorting C with split DCJ operations. The indel-potential of a cycle depends only on its initial number of runs:

Proposition 1 (from [8]). Given two genomes A and B, the indel-potential of a cycle C of R(A, B) is given by λ ( C ) = Λ ( C ) + 1 2 Open image in new window, if Λ ( C ) 1 Open image in new window. Otherwise, if Λ(C) = 0, then λ(C) = 0.

Given two unichromosomal circular genomes A and B, the DCJ distance of A and B and the indel-potential of the cycles in R(A, B) allow us to easily compute the DCJ-indel distance, that is the minimum number of DCJ and indel operations required to sort A into B, denoted by d DCJ i d ( A , B ) Open image in new window.

Theorem 2 (from [8]). Given two unichromosomal circular genomes A and B, we have
d DCJ i d ( A , B ) = d DCJ ( A , B ) + C R ( A , B ) λ ( C ) . Open image in new window

Results

The inversion-indel distance between two unichromosomal genomes A and B, denoted by d INV i d ( A , B ) , Open image in new window is the number of steps (inversions and indels) required to sort A into B. It is lower bounded by the DCJ-indel distance and can be represented by the equation
d INV i d ( A , B ) = d DCJ i d ( A , B ) + τ INV i d ( A , B ) , Open image in new window

in which the value τ INV i d ( A , B ) Open image in new window gives the extra cost to handle bad components of the relational graph.

In this section we present our results, assuming that in R(A, B) the label of each orange edge is composed of at most one marker from A Open image in new window and the label of each blue edge is composed of at most one marker from B Open image in new window. We first show how to optimally perform indels directly on the original genomes. Then we prove that τ INV i d ( A , B ) = 0 Open image in new window when R(A, B) has no bad component, and finally we give a lower and an upper bound for τ INV i d ( A , B ) Open image in new window when R(A, B) has bad components.

Finding optimal integrations

In a DCJ-indel sorting scenario there are DCJ operations, insertions of unique markers of B Open image in new window into A and deletions of unique markers of A Open image in new window from A. Although in an arbitrary scenario the order of these operations may vary, from [17] we know that insertions can always be moved ahead of the DCJ operations, s.t. they occur in the first steps, and analogously the deletions can be moved aback to occur after the DCJ operations in the last steps. This separation of insertions, DCJs and deletions within the sorting scenario also appears in [18], where an alternative approach was presented to compute the DCJ-indel distance, based on the concept of optimal completion. In this approach, each indel is modeled as a circular chromosome, called circular singleton, composed only of the markers that are inserted or deleted by this indel. A completion of genomes A and B adds i new circular singletons to A and k new circular singletons to B, yielding two multichromosomal circular genomes that have the same content G A B Open image in new window. A completion is optimal when i + k = C R ( A , B ) λ ( C ) Open image in new window.

Here we show how to build an optimal completion using the relational diagram and the concepts of run and indel-potential. Let r be a B Open image in new window-run of a cycle C in R(A, B), composed of m labels (each label is composed of a single marker, as stated earlier). Then let s be the circular singleton obtained from R(A, B) by walking through the path that corresponds to r and concatenating its m labels. We close the circular chromosome concatenating also the last to the first label. Such a singleton s is called r-singleton. The addition of the r-singleton s to genome A, yielding genome A Open image in new window, produces m - 1 new clean cycles in the diagram, that is, the number of cycles in R(A', B) is c' = c + m - 1, where c is the number of cycles in R(A, B). Since the number of common markers between A' and B is | G | = | G | + m Open image in new window, we have dDCJ(A', B) = dDCJ(A, B) + 1. Furthermore, the cycle C in R(A, B) is transformed into a cycle C' in R(A', B), containing the same labels of C except for the m labels of the run r.

Proposition 2. If we add the r-singleton of a B Open image in new window-run r to genome A yielding genome A', the overall indel-potential is achieved, that is, C R ( A , B ) λ ( C ) = C R ( A , B ) λ ( C ) - 1 Open image in new window(Analogous for the addition of the r'-singleton of an A Open image in new window-run r' to genome B.)

Proof. Let C be the cycle that contains the B Open image in new window-run r in R(A, B). We then add the r-singleton to genome A yielding genome A'. If C originally had only one or two runs, then it is clear that the sum of the indel-potentials in R(A', B) decreases by one with respect to R(A, B). If C originally had four or more runs, two A Open image in new window-runs of C are merged into a single run in R(A', B), and this also guarantees that the sum of the indel-potentials decreases by one.    □

For describing the indels in our inversion-indel model, we still need to integrate the singletons so that we obtain a unichromosomal genome. Again, let r be a B Open image in new window-run and let A' be the genome composed of A and the r-singleton. We know that dDCJ(A', B) = dDCJ(A, B) + 1 and, to integrate the singleton, we need to apply exactly one DCJ to two orange (or two blue) edges of a cycle of R(A', B), such that one is part of the chromosome of A and the other is part of the r-singleton [4, 19]. An optimal integration is then an integration that preserves the runs of the diagram.

Proposition 3. Any integration of the r-singleton of a B Open image in new window-run r into the chromosome of A that creates a new clean cycle in the relational diagram is optimal. (Analogous for the integration of an A Open image in new window-run into the chromosome of B.)

Proof. The integration only affects one cycle C of the diagram, by splitting it into two cycles. If one of these two cycles is clean, then we know that all runs of C remain together in the other cycle, that is, the runs of the diagram are preserved.    □

With the previous results we have a straight recipe for the construction of an optimal integrated completion of genomes A and B. At each step we can decide arbitrarily whether we optimally integrate the r-singleton of a B Open image in new window-run to A, or the r'-singleton of an A Open image in new window-run to B, until no more runs exist in the relational diagram. In the end we have two unichromosomal circular genomes A * and B * with the same content.

As an example, let us build one optimal integrated completion for genomes A = ( a x c ̄ y b z ̄ d ̄ ) Open image in new window and B = (aubcvd), whose relational diagram has one cycle C with four runs, see Figure 4 (i). We have λ(C) = 3, thus we need to perform three optimal integrations. We first do an integration of the singleton (zy), composed of the labels of an A Open image in new window-run, into the chromosome of genome B, creating B' = (aubcvdzy). After this step, R(A, B') has three cycles, one with two runs. In the second step, we do an integration of the singleton ( v ̄ u ) Open image in new window, composed of the labels of the last B Open image in new window-run, into the chromosome of genome A, creating A * = ( a x c ̄ y b z ̄ d ̄ v ̄ u ) Open image in new window. Now R(A*, B') has five cycles, one with an A Open image in new window-run. We finally do an integration of the singleton (x), composed of the labels of the last A Open image in new window-run, into the chromosome of genome B', creating B* = (axubcvdzy), yielding R(A*, B*) composed of six clean cycles, see Figure 4 (ii). Indeed, dDCJ(A, B) = dDCJ(A*, B*).
Figure 4

Optimal integrated completion of two genomes. (i) For genomes A = ( a x c ̄ y b z ̄ d ̄ ) Open image in new windowand B = ( aubcvd ) we show positions for optimally integrating the singletons in R ( A, B ). (ii) In the resulting genomes A * = ( a x c ̄ y b z ̄ d ̄ v ̄ u ) Open image in new window and B * = (axubcvdzy), there are five more common markers between A * and B*, but also five more cycles in R(A*, B*).

Finding safe integrations - the inversion-indel distance in the absence of bad components

Let A and B be two unichromosomal circular genomes with unequal contents such that R(A, B) has no bad component. A safe integration is an optimal integration in A yielding A' (respectively in B yielding B'), such that also R(A', B) (respectively R(A, B')) has no bad component.

In Figure 5 we perform an optimal but not safe integration, producing a bad component in the relational diagram. Even several bad components can be created by an optimal integration, but, fortunately, it is always possible to perform a safe integration, as shown in the following.
Figure 5

Optimal but not safe integration. For genomes A = ( a c ¯ b e d ) Open image in new window and B = ( abxcydze ), an optimal but not safe integration of the singleton ( xyz ) produces A'. In R(A', B) we have two clean 2-cycles (C3 and C4), one good component C 1 = { C 1 } Open image in new window and one bad component C 2 = { C 2 } Open image in new window. The marker y is a link of C 1 Open image in new window and C 2 Open image in new window and is adjacent to d in genome B. This information is used to find an alternative optimal integration for the singleton (xyz), as we will show in Figure 6.

Let the size of a component C Open image in new window in R(A, B) be the total number of orange (or blue) edges in the cycles of C Open image in new window. Furthermore, let C 1 Open image in new window and C 2 Open image in new window be two components in R(A, B). If each orange edge of C 1 Open image in new window is between two orange edges of C 2 Open image in new window, the component C 1 Open image in new window is said to be nested within C 2 Open image in new window. Otherwise, if C 1 Open image in new window is not nested within C2 and C2 is not nested within C 1 Open image in new window, the components C 1 Open image in new window and C 2 Open image in new window are said to be independent. Two independent components C 1 Open image in new window and C 2 Open image in new window are said to be linked if the leftmost orange edge of C 2 Open image in new window appears immediately after the rightmost orange edge of C 1 Open image in new window in R(A, B). In this case the rightmost orange vertex of C 1 Open image in new window and the leftmost orange vertex of C 2 Open image in new window represent extremities of the same marker g G Open image in new window. The marker g is said to be a link of C 1 Open image in new window and C 2 Open image in new window. A sequence of k linked components is called a chain of size k.

Without loss of generality, let all markers in B have the same orientation and let R(A, B) have only one component C Open image in new window, that is good. Assume that an optimal integration of a singleton s in A yielding A' creates, besides one or two trivial components, exactly one good component C 1 Open image in new window and one bad component C 2 Open image in new window in R(A', B). If necessary, we can flip genome A' so that the markers within C 2 Open image in new window in A' have the same orientation as the markers in B. Furthermore, due to the circularity of the genomes, we can rotate the diagram so that R(A', B) is a chain of exactly two linked components C 1 Open image in new window and C 2 Open image in new window. A link of C 1 Open image in new window and C 2 Open image in new window is within the optimal integration. If we then do an alternative optimal integration of s in the middle of the bad component C 2 Open image in new window (see Figure 6), we obtain A". In R(A", B) we have either a single bad component smaller than C 2 , Open image in new window or no bad component.
Figure 6

Our approach to find an alternative to an optimal integration that creates a bad component. Observe that, from R(A', B) to R(A", B), only the orange edges marked with the symbol ≀ were transformed into the orange edges marked with the symbol \ \ Open image in new window. All the other edges of the diagram were preserved. While the distinct cycles C3 and C4 of R(A', B) are merged into a single cycle in R(A", B), the cycle C2 of R(A',B) is split into two cycles in R(A", B). The hat on markers b and x indicates that we make no assumptions about the orientation of theses markers (but we know they have the same orientation in A' and A"). (i) After the first integration we have a good component C 1 Open image in new window at the left side, and a bad component C 2 Open image in new window at the right side (at the interval yz...wc...ed...a of A'). The marker y is a link of C 1 Open image in new window and C 2 Open image in new window and is adjacent to d in genome B. (ii) If we do the optimal integration inside C 2 Open image in new window, so that y is adjacent to d in genome A", we create the clean 2-cycle C 2 . Open image in new window There can be a bad component in R(A", B) (at the interval c...ez...w of A"), but it is strictly smaller than C2.

(In general, there can be other components in R(A', B) nested within C 1 Open image in new window and C 2 Open image in new window, but each one of these is either trivial or has at least one edge within and at least one edge outside the integrated cluster. In any case, since the component in R(A, B) was good, at least one component in R(A', B) has to be good. By extending the approach illustrated in Figure 6 we can show that all components but C 2 Open image in new window are merged into a single good component and only one bad component, strictly smaller than C 2 Open image in new window, can exist in R(A", B).)

Proposition 4. Let r be a B Open image in new window-run in R(A, B). At least one optimal integration of the r-singleton into the chromosome of A is safe. (Analogous for the integration of an A Open image in new window-run in B.)

Proof. Assume that each optimal integration of the r-singleton in A, yielding A', creates at least one bad component in R(A', B). Then, among all possible optimal integrations of r, assume that we take one that produces a bad component C Open image in new window of the smallest size. It is always possible to perform another optimal integration of r, as described in Figure 6, in the middle of the bad component C , Open image in new window transforming A' into A", so that we create a clean 2-cycle in R(A", B). Either R(A", B) does not have any bad component (then we have a contradiction to the assumption that all optimal integrations create bad components), or it has a bad component C Open image in new window (then C Open image in new window must be strictly smaller than C Open image in new window, and we have a contradiction to the assumption that C Open image in new window was a bad component with the smallest size).    □

The results presented above give rise to the following theorem:

Theorem 3. For two unichromosomal circular genomes A and B, such that R(A, B) has no bad component, we have d INV i d ( A , B ) = d DCJ i d ( A , B ) Open image in new window.

Proof. We know that there is at least one safe integration for each run and that by integrating one run per step we perform exactly C R ( A , B ) λ ( C ) Open image in new window integrations, yielding genomes A * and B * with the same content, such that R(A*, B*) has no bad component. Then we have dDCJ(A, B) = dDCJ(A*, B*) = dINV(A*, B*).    □

Since the DCJ-indel distance can be computed in linear time, the same is true for the inversion-indel distance in the absence of bad components.

Bounds for the inversion-indel distance in the presence of bad components

Now we will give bounds to the extra cost for handling bad components in R(A, B). Without loss of generality, let us assume that, if R(A, B) has at least two components, the first and the last orange edges of R(A, B) belong to two distinct components. Recall that R(A, B) represents the relation between two circular chromosomes, thus its first orange edge comes right after its last orange edge.

Let C 1 Open image in new window, C 2 Open image in new window and C 3 Open image in new window be three distinct components in R(A, B) such that if we take the rightmost orange edge of C 1 Open image in new window and look at the following orange edges one by one, we always find an edge of C 3 Open image in new window, before finding an edge of C 2 Open image in new window. In the same way, if we take the rightmost orange edge of C 2 Open image in new window and look at the following orange edges one by one, we always find an edge of C 3 Open image in new window, before finding an edge of C 1 Open image in new window. The component C 3 Open image in new window, is then said to separate C 1 Open image in new window and C 2 Open image in new window. (In Figure 2 the good component {C1} separates the trivial component {C2} from both the trivial component {C4} and the bad component {C3, C5}. Similarly, {C3, C5} separates {C4} from both {C2} and {C1}.) By joining two cycles C1 and C2, that belong to two distinct components C 1 Open image in new window and C 2 Open image in new window, we merge not only the components C 1 Open image in new window and C 2 Open image in new window, but also all components that separate C 1 Open image in new window and C 2 Open image in new window, into a single component C Open image in new window. Even when all merged components are bad, the new component C Open image in new window is always good [1].

The extra cost for handling bad components can be computed using an approach from [6, 16], in which a tree structure is defined representing the linking and nesting relationship of the components of R(A, B).

The component tree

The component tree T (A, B) is a rooted tree with two types of nodes, defined as follows [16]:
  1. 1.

    Each component is represented by a round node.

     
  2. 2.

    Each maximal chain is represented by a square node whose children are the round nodes that represent the components of this chain.

     
  3. 3.

    A square node is either the root, or the child of the smallest component in which this chain is nested.

     
A round node is called a bad node, drawn in white, if it represents a bad component. Otherwise it is called a good node, drawn in black. (A good node can be a trivial or a good component.) Figure 7 (i) shows an example of T (A, B).
Figure 7

Examples of component trees. (i) The tree T ( A , B ) for the relational diagram represented in Figure 2 has one bad (white) and three good (black) nodes, and (ii) the corresponding colored tree T o ( A, B ). Here, the indel-type of each cycle is given. In both cases the trees T Open image in new window and T o Open image in new window are composed of a single bad node. (iii) An example of a T o Open image in new window to show that a greedy strategy, of maximizing the merging of leaves with the same colored dot, does not work. If we merge the two leaves with blue dots the cost of the cover is 5. However, if we merge twice a leaf with a blue dot and a leaf with no dot (the longer paths), the cost is 4. (iv) Another example of a T o Open image in new window to show that, on the other hand, if we merge the leaves of the longer path we have a cost of 3. But if instead we merge the two nodes with blue dots and the two nodes with orange dots, the cost is 2.

Reducing T to T'. Let T' be the unrooted tree that corresponds to the smallest subgraph of T (A, B) that contains all bad nodes. Let a long branch be a branch in T' that contains two or more bad nodes.

Covering the bad nodes. A path P in T' can be short, if P contains only one vertex, or long, if P contains at least two vertices. A cover of T' is defined as a set of paths that contain all bad nodes of T'. The cost of a cover is given by the sum of the costs of its paths and an optimal cover of T' is a cover with the minimum cost.

Computing τ INV ( A , B ). For the inversion model, by assigning the cost of one to each short path and the cost of two to each long path, it has been shown in [6, 16] that the cost of an optimal cover of T' corresponds exactly to the value τINV(A, B) and can be computed as follows:

Theorem 4 (from [6, 16]). Let w be the number of leaves of T'. Then
τ INV ( A , B ) = w + 1 i f w i s o d d a n d a l l l e a v e s a r e o n l o n g b r a n c h e s , w o t h e r w i s e . Open image in new window

The costs of cutting and merging bad components in the inversion-indel model

Recall that the DCJ-cost of an inversion ρ is denoted by | | ρ | | Open image in new window and corresponds respectively to 1 or 2 depending on whether ρ is a neutral or a joint inversion. Furthermore, let λ0 and λ1 be, respectively, the sum of the indel-potentials for the components of the relational diagram before and after the inversion ρ. We then have Δλ(ρ) = λ1 - λ0 and we also define the cost of ρ to be Δ d ( ρ ) = | | ρ | | + Δ λ ( ρ ) Open image in new window.

Each cut is a neutral inversion ρ that has | | ρ | | = 1 Open image in new window. If ρ cuts a bad component C Open image in new window that contains only cycles with at most two runs, it is clear that ρ cannot save indels. In this case, Δd(ρ) = 1. However, if C Open image in new window contains a cycle C with at least four runs, it is possible to apply ρ such that two A Open image in new window-runs and two B Open image in new window-runs are merged. This reduces the number of runs by two, that is, ΔΛ(ρ) = -2, hence Δλ(ρ) = -1 and Δd(ρ) = 0.

Each merging is a joint inversion ρ that has | | ρ | | = 2 Open image in new window. The cost of each merging depends on the runs of the affected cycles. A cycle with no run is represented by C ε . Let C A Open image in new window (respectively C B Open image in new window) be a cycle with an A Open image in new window-run (respectively a B Open image in new window-run). Similarly, let C A B Open image in new window, be a cycle with two or more runs. In Table 1 we show the costs of the different types of joint inversions.
Table 1

Types of joint inversions (C* represents a cycle with any number of runs, Δd(ρ) = 2 + Δλ(ρ)).

sources

resultant

Δλ(ρ)

Δd(ρ)

C ε + C *

C *

0

2

C A + C B Open image in new window

C A B Open image in new window

0

2

C A B + C A B Open image in new window

C A B Open image in new window

-2

0

C A + C A Open image in new window

C A Open image in new window

-1

1

C B + C B Open image in new window

C B Open image in new window

-1

1

C A + C A B Open image in new window

C A B Open image in new window

-1

1

C B + C A B Open image in new window

C A B Open image in new window

-1

1

The colored component tree

All components that have a cycle of type C A B Open image in new window can be merged together into a single (good) component with cost 0, thus we assume that R(A, B) has at most one component C Open image in new window of this type. Furthermore, if C Open image in new window is bad, we also assume that it has no cycle with four or more runs. (Otherwise it could be cut with cost 0.)

With these assumptions, we build the component tree T (A, B) as described previously. Then we transform T (A, B) into To(A, B), by adding at most two colored dots to each round node, as follows: we add an orange dot, if at least one cycle of the corresponding component has an A Open image in new window-run; and a blue dot, if at least one cycle of the corresponding component has a B Open image in new window-run. Figure 7 (ii) shows an example of To(A, B).

Reducing T o to T o Open image in new window Let T o Open image in new window be the unrooted tree that corresponds to the smallest subgraph of To(A, B) that contains all bad nodes. The leaves of T o Open image in new window are bad components. Let v be a leaf of T o Open image in new window and let t be the subtree of To(A, B) rooted at v. In T o Open image in new window, the leaf v will then have the union of all colored dots from t.

Computing τ INV i d ( A , B ) Open image in new window. The cost of a short path here is also one. On the other hand, the cost of a long path is either one, if its endpoints share at least one colored dot, or two otherwise. An optimal cover of T o Open image in new window corresponds to the value of τ INV i d ( A , B ) Open image in new window. However, the problem of computing this value is very intricate, even when each node has at most one colored dot, as we can see in Figure 7 (iii) and (iv).

Below we give a lower and an upper bound for τ INV i d ( A , B ) Open image in new window, but finding an exact formula to compute this value is left as an open problem.

Proposition 5. Let τ INV i d ( A , B ) Open image in new windowbe the cost of an optimal cover of T o Open image in new window. We then have:
w 2 τ INV i d ( A , B ) w + 1 , Open image in new window

where w is the number of leaves in T o Open image in new window.

Proof. The lower bound can be obtained when w ≤ 1 or when all leaves share at least one colored dot (in this case, all paths have cost 1). The upper bound occurs when w is odd, all leaves are clean (have no colored dot) and are on long branches (the greatest value of Theorem 4).    □

Conclusions

In this work we have revisited the inversion-indel distance between two unichromosomal genomes A and B with unequal contents. We have shown that, when the relational diagram R(A, B) has no bad component, the inversion-indel distance is equal to the DCJ-indel distance of A and B and can be computed in linear time. We also gave a lower and an upper bound for the extra cost τ INV i d ( A , B ) Open image in new window of handling bad components in R(A, B). However, finding an exact formula to compute this value is very intricate and was left as an open problem.

Notes

Acknowledgements

The authors would like to thank Paola Bonizzoni for suggestions how to improve the presentation of the proof of Theorem 3.

Declaration

MDVB is funded by the Brazilian research agency CNPq grant PROMETRO 563087/10-2. The Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University Library supported the Article Processing Charge.

This article has been published as part of BMC Bioinformatics Volume 14 Supplement 15, 2013: Proceedings from the Eleventh Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S15.

References

  1. 1.
    Hannenhalli S, Pevzner PA: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999, 46: 1-27. 10.1145/300515.300516. [A preliminary version appeared in Proc. of STOC 1995]CrossRefGoogle Scholar
  2. 2.
    Meidanis J, Walter MEMT, Dias Z: Reversal distance of signed circular chromosomes. Relatório Técnico IC-00-23, Institute of Computing, University of Campinas, Brazil. 2000Google Scholar
  3. 3.
    El-Mabrouk N: Sorting signed permutations by reversals and insertions/deletions of contiguous segments. Journal of Discrete Algorithms. 2001, 1: 105-122. [A preliminary version appeared in Proc. of CPM 2000, LNCS 1848]Google Scholar
  4. 4.
    Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by Translocation, Inversion and Block Interchange. Bioinformatics. 2005, 21 (16): 3340-3346. 10.1093/bioinformatics/bti535.CrossRefPubMedGoogle Scholar
  5. 5.
    Bergeron A, Mixtacki J, Stoye J: A Unifying View of Genome Rearrangements In Proceedings of WABI 2006, Volume 4175 of LNBI. 2006, 163-173.Google Scholar
  6. 6.
    Bergeron A, Mixtacki J, Stoye J: A new linear time algorithm to compute the genomic distance via the double cut and join distance. Theor Comput Sci. 2009, 410 (51): 5300-5316. 10.1016/j.tcs.2009.09.008.CrossRefGoogle Scholar
  7. 7.
    Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include Insertions, Deletions, and Duplications. J Comput Biol. 2009, 16 (10): 1311-1338. 10.1089/cmb.2009.0092.CrossRefPubMedGoogle Scholar
  8. 8.
    Braga MDV, Willing E, Stoye J: Double cut and join with insertions and deletions. J Comput Biol. 2011, 18 (9): 1167-1184. 10.1089/cmb.2011.0118. [http://dx.doi.org/10.1089/cmb.2011.0118]CrossRefPubMedGoogle Scholar
  9. 9.
    da Silva PH, Machado R, Dantas S, Braga MDV: DCJ-indel and DCJ-substitution distances with distinct operation costs. Alg for Mol Biol. 2013, 8: 21-10.1186/1748-7188-8-21.CrossRefGoogle Scholar
  10. 10.
    Braga MDV: An overview of genomic distances modeled with indels. Proceedings of Computation in Europe, Volume 7921 of LNCS. 2013, 22-31.Google Scholar
  11. 11.
    Setubal JC, Meidanis J: Introduction to Computational Molecular Biology. PWS Publishing Company. 1997Google Scholar
  12. 12.
    Friedberg R, Darling A, Yancopoulos S: Genome rearrangement by the double cut and join operation. Bioinformatics, Methods in Molecular Biology. 2008, 452: 385-416. 10.1007/978-1-60327-159-2_18.CrossRefGoogle Scholar
  13. 13.
    Bafna V, Pevzner P: Genome rearrangements and sorting by reversals. Proc of FOCS. 1993, 148-157.Google Scholar
  14. 14.
    Braga MDV, Stoye J: The solution space of sorting by DCJ. J Comp Biol. 2010, 17 (9): 1145-1165. 10.1089/cmb.2010.0109.CrossRefGoogle Scholar
  15. 15.
    Hannenhalli S, Pevzner PA: Transforming Men Into Mice (Polynomial Algorithm for Genomic Distance Problem). Proc 36th Annu Symp Found Comput Sci, FOCS 1995. 1995, IEEE Press, 581-592.Google Scholar
  16. 16.
    Bergeron A, Mixtacki J, Stoye J: The Inversion Distance Problem. Mathematics of Evolution and Phylogeny. Edited by: Gascuel O. 2005, Oxford, UK: Oxford University Press, 262-290.Google Scholar
  17. 17.
    da Silva PH, Machado R, Dantas S, Braga MDV: Restricted DCJ-indel model: sorting linear genomes with DCJ and indels. BMC Bioinformatics. 2012, 13 (S19): S14-PubMedCentralPubMedGoogle Scholar
  18. 18.
    Compeau PEC: DCJ-Indel sorting revisited. Algorithms for Molecular Biology. 2013, 8 (6):Google Scholar
  19. 19.
    Kovác J, Warren R, Braga MDV, Stoye J: Restricted DCJ Model (The Problem of Chromosome Reincorporation). Journal of Computational Biology. 2011, 18 (9): 1231-1241. 10.1089/cmb.2011.0116.CrossRefPubMedGoogle Scholar

Copyright information

© Willing et al.; licensee BioMed Central Ltd. 2013

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Authors and Affiliations

  • Eyla Willing
    • 1
    • 2
  • Simone Zaccaria
    • 3
  • Marília DV Braga
    • 4
  • Jens Stoye
    • 1
    • 2
  1. 1.Faculty of TechnologyBielefeld UniversityBielefeldGermany
  2. 2.Institute for Bioinformatics, Center for Biotechnology, Bielefeld UniversityBielefeldGermany
  3. 3.Dip. Informatica Sistemistica e Comunicazione (DISCo)Univ. Milano-BicoccaMilanItaly
  4. 4.Inmetro - Instituto Nacional de Metrologia, Qualidade e TecnologiaDuque de CaxiasBrazil

Personalised recommendations