# A novel min-cost flow method for estimating transcript expression with RNA-Seq

- 4.4k Downloads

## Abstract

### Background

Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms), depending on the individual, on the tissue the cell is in, or in response to some stimuli. Recent RNA-Seq technology allows for new high-throughput ways for isoform identification and quantification based on short reads, and various methods have been put forward for this non-trivial problem.

### Results

In this paper we propose a novel radically different method based on minimum-cost network flows. This has a two-fold advantage: on the one hand, it translates the problem as an established one in the field of network flows, which can be solved in polynomial time, with different existing solvers; on the other hand, it is general enough to encompass many of the previous proposals under the least sum of squares model. Our method works as follows: in order to find the transcripts which best explain, under a given fitness model, a splicing graph resulting from an RNA-Seq experiment, we find a min-cost flow in an offset flow network, under an equivalent cost model. Under very weak assumptions on the fitness model, the optimal flow can be computed in polynomial time. Parsimoniously splitting the flow back into few path transcripts can be done with any of the heuristics and approximations available from the theory of network flows. In the present implementation, we choose the simple strategy of repeatedly removing the heaviest path.

### Conclusions

We proposed a new very general method based on network flows for a multiassembly problem arising from isoform identification and quantification with RNA-Seq. Experimental results on prediction accuracy show that our method is very competitive with popular tools such as Cufflinks and IsoLasso. Our tool, called Traph (Transcrips in gRAPHs), is available at: http://www.cs.helsinki.fi/gsa/traph/.

## Keywords

Annotate Transcript Candidate Path Normalize Compression Distance Convex Cost Function Splice Graph## Background

Recent RNA-Seq technology [1, 2] opened a new high-throughput, low cost way for isoform identification and quantification, leading to new understanding of gene regulation in development and disease (e.g., [3]). In an RNA-Seq experiment a set of short reads is produced from mRNA transcripts. The difficulty in assembling these short reads into the transcripts from which they were sampled is non-trivial due to the fact that the transcripts (isoforms) may share exons. As a result, all methods for solving this problem rely on an explicit or implicit graph model. The nodes represent individual reads (overlap graph [4]), or contiguous stretches of DNA uninterrupted by spliced reads (splicing graph [5, 6, 7], connectivity graph [8, 9, 10]), while the edges are derived from overlaps or from spliced read alignments. Each node and edge has an associated observed coverage, and the problem of isoform identification and quantification is seen as separating the coverage of the graph into individual path components, under different models. Furthermore, this problem was also coined under the broad name 'Multiassembly Problem' [11], a hint that it can arise not only with RNA-Seq data, but also in other biological settings, such as assembling metagenomics reads [12].

Except for Cufflinks [4], all tools mentioned above rely on some optimization engine, whose solving is generally difficult. IsoInfer/IsoLasso [8, 9], SLIDE [7], Scripture [10], and CLIIQ [6] exhaustively enumerate all possible candidate paths. For efficiency reasons, each has some restrictions on what a valid candidate path might be, and for each candidate isoform, they define a fitness function. IsoInfer/IsoLasso and SLIDE use a least sum of squares fitness function; IsoLasso and SLIDE both add different shrinkage terms to the fitness function in order to favor isoforms with fewer transcripts, which is computed with a modified LASSO algorithm, or a quadratic program; CLIIQ uses a least sum of absolute differences fitness function, solved by a linear integer program. Cufflinks avoids the problem of exhaustively enumerating all possible paths by returning a minimum path cover, and then assigning expression levels to each path in this cover based on a statistical model. Incidentally, note that computing a minimum path cover (in an acyclic digraph) is done by computing a maximum matching, which can be easily reduced to a flow problem. However, such reduction solves a different (implicitly defined) optimization problem that can be considered as a consensus model in the literature [6, 7, 8, 9, 10], mostly because the fitting of expression levels is separated in the process.

### Our contribution

In this paper we propose a radically different and very general method relying on the established field of minimum-cost network flow problems [13]. This will not only provide a simple method and a fast polynomial time algorithm for solving it (as opposed to exhaustively enumerating all possible candidate paths, and then solving a quadratic/integer linear program for evaluating the fitness of each candidate isoform), but it can also lean on the ample literature on splitting a (min-cost) flow into paths, e.g., [14, 15, 16, 17].

As in the case of the other tools, our method assumes that a splicing graph has been built for each gene. Each node of the graph corresponds to a stretch of DNA uninterrupted by any spliced read alignment; such sequences are called *segments* in [9], but for simplicity we just call them *exons*. Each edge of the graph corresponds to two exons consecutive in some transcript, that is, to some spliced read whose prefix aligns to the suffix of one exon, and whose suffix aligns to the prefix of another exon. Observe that such a graph can be seen as a directed acyclic graph (DAG, for short), the direction of the edges being according to the absolute position of the exons in the genome. For each exon *v* we can deduct its coverage *cov*(*v*) as the total number of reads aligned to the exon divided by the exon length, and the coverage *cov*(*u, v*) of an edge (*u, v*) as the total number of reads split aligned to that junction between exons *u* and *v*. An mRNA transcript candidate thus becomes a path from some source node to some sink node. The requirement that the transcripts start in a source node and end in a sink node is no restriction, as we can add dummy source/sink nodes as in-/out-neighbors to the nodes where we have indication that some transcript might start/end. Indeed, our splicing graph creation tool uses splicing alignments and coverage information to discover such start/end nodes and accordingly indicates them to our tool.

In order to define a fitness function in the broadest possible terms, let us assume that for each node *v* and edge (*u, v*) of the graph we have convex cost functions ${f}_{v},{f}_{uv}:\mathbb{R}\to \mathbb{R}$ modeling how close that node and edge must be explained by the candidate isoform. Then, we can state the problem of isoform identification and quantification as following problem.

**Problem 1 (UTEC)**

*Given a splicing DAG G*= (

*V, E*)

*with coverage values cov*(

*v*)

*and cov*(

*u, v*),

*and cost functions f*

_{ v }(·)

*and f*

_{ uv }(

*·*),

*for all v*∈

*V and*(

*u, v*) ∈

*E, the*Unannotated Transcript Expression Cover

*problem is to find a tuple*$\phantom{\rule{0.5em}{0ex}}\mathcal{P}$

*of paths from the sources of G to the sinks of G, with an estimated expression level e*(

*P*)

*for each path*$P\in \mathcal{P}$,

*which minimize*

For example, if for all nodes *v* and edges (*u, v*), *f*_{ v }(*x*) = *x, f*_{ uv }(*x*) = *x*, then we have a least sum of absolute differences model as in CLIIQ. If *f*_{ u }(*x*) = *x*^{2}, *f*_{ uv }(*x*) = *x*^{2}, then we have a least sum of squares model as in IsoInfer/IsoLasso and SLIDE; this is the model which we also use in the implementation reported in this paper. Another cost function, suggested by [18], is ${f}_{v}\left(x\right)=x/\sqrt{cov\left(v\right)}$, ${f}_{uv}\left(x\right)=x/\sqrt{cov\left(u,v\right)}$ for all nodes *v* and edges (*u, v*). Observe that many of the other biological assumptions of the other tools can be incorporated in the model of Problem UTEC.

We will show that Problem UTEC can be solved in polynomial time, by a reduction to a min-cost flow problem with convex cost functions. We will argue that finding the optimal tuple of paths explaining the graph is equivalent to finding the optimal flow in an offset flow network. Moreover, any splitting of this optimal flow into paths attains the minimum of Problem UTEC. In the same way as some of the other tools try to limit the number of paths explaining a splicing graph by a LASSO approach, we can rely on established methods for splitting any flow into few paths (e.g., [14, 15, 16, 17]). In this paper, we employ only the simple linear-time heuristic of repeatedly removing the heaviest path, see e.g., [15].

We give experimental results to study how well the predictions match the ground-truth on simulated data, and how well it fares on real-data, compared to Cufflinks [4] and IsoLasso [9]; our method is very competitive, providing in many cases better precision and recall. We expect our lead to be even greater once we incorporate paired-end read information.

## Methods

*flow network*(or simply

*network*) is a tuple

*N*= (

*G, b, q*), where

*G*= (

*V, E*) is a directed graph,

*b*is a function assigning a

*capacity*${b}_{uv}\in \mathbb{N}$ to every arc (

*u, v*) ∈

*E*, and

*q*is a function assigning an

*exogenous flow*${q}_{v}\in \mathbb{N}$ to every node

*v*∈

*V*, such that ${\sum}_{v\in V}{q}_{v}=0$. We say that a function

*x*assigning to every arc (

*u, v*) ∈

*E*a number${x}_{uv}\in \mathbb{N}$ is a

*flow*over the network

*N*, if the following two conditions are satisfied:

- 1.
0 ≤

*x*_{ uv }≤*b*_{ uv }, for every (*u, v*) ∈*E*, - 2.
$\sum _{u\in V}}{x}_{vu}-{\displaystyle \sum _{u\in V}}{x}_{uv}={q}_{v$, for every

*v*∈*V*,

*c*

_{ uv }(·), for every arc (

*u, v*) ∈

*E*, and is required to find a flow which minimizes:

It is well-known that, under the assumption that all the flow cost functions *c*_{ uv }(·) are convex, a min-cost flow can be found in polynomial time [19] (see also [20] for the real-valued flow case).

### The reduction to a min-cost flow problem

We will model Problem UTEC as a min-cost flow problem, thus showing that it can be solved in polynomial time. First, we argue that it can be transformed into the following equivalent problem, where the input exon chaining graph has measured coverages only on arcs.

**Problem 2 (UTEJC)**

*Given a splicing DAG G*= (

*V, E*)

*with coverage values cov*(

*u, v*),

*and cost functions f*

_{ uv }(·),

*for all*(

*u, v*) ∈

*E, the*Unannotated Transcript Expression Junction Cover

*problem is to find a tuple*$\phantom{\rule{0.5em}{0ex}}\mathcal{P}$

*of paths from the sources of G to the sinks of G with an estimated expression level e*(

*P*)

*for each path*$P\in \mathcal{P}$,

*which minimize*

Given an input *G* = (*V, E*) for Problem UTEC, we construct an input for Problem UTEJC by replacing every node *v* ∈ *V* with two new nodes, *v*_{ in } and *v*_{ out }, and an arc (*v*_{ in }*, v*_{ out }), with *cov*(*v*_{ in }*, v*_{ out }) = *cov*(*v*), and ${f}_{{v}_{in}{v}_{out}}\left(x\right)={f}_{v}\left(x\right)$. Furthermore, for every arc (*u, v*) ∈ *E*, we replace arc (*u, v*) with the arc (*u*_{ out }*, v*_{ in }), with the same coverage as (*u, v*). It is immediate that optimal solutions for *G* to Problem UTEC are in bijection with the optimal solutions for the transformed graph to Problem UTEJC.

To solve Problem UTEJC, we build an auxiliary offset network with convex costs of the form *c*_{ uv }(*x*) = *f*_{ uv }(*x*). An optimal flow for this network will model the offsets (positive or negative) between the measured coverages of the exon chaining graph and their actual expression levels in an optimal solution. Then, we argue that a min-cost flow on this network naturally induces a solution for the UTEJC problem.

Onwards, we denote by ${N}_{G}^{+}\left(v\right)$ the set of *out-neighbors* of *v* in the directed graph *G*, that is, the set {*w* : (*v, w*) ∈ *E*(*G*)}. Similarly, we denote by ${N}_{G}^{-}\left(v\right)$ the set of *in-neighbors* of *v* in the directed graph *G*, that is, the set {*u* : (*u, v*) ∈ *E*(*G*)}. When *G* is clear from the context, we will skip it as subscript.

*G*with coverage values

*cov*(

*u, v*), and cost functions

*f*

_{ uv }, for all (

*u, v*) ∈

*E*, we construct the

*offset network N**= (

*G*, b, q*) with cost function

*c*, as follows (see Figure 1 for an example):

- 1.we add to
*G** all nodes and edges of*G*, together with- (a)
a new source

*s*_{0}and a new sink*t*_{0}with ${q}_{{s}_{0}}:={q}_{{t}_{0}}:=0$, - (b)
arcs (

*s*_{0},*s*), for every source*s*of*G*, and arcs (*t, t*_{0}) for every sink*t*of*G*, each with infinite capacity and null cost function, - (c)
arc (

*t*_{0},*s*_{0}) with infinite capacity and null cost function, - (d)
nodes

*s** and*t**, with initial exogenous flow ${q}_{{s}^{*}}:={q}_{{t}^{*}}:=0$;

- (a)
- 2.for every arc (
*u, v*) ∈*E*(*G*),- (a)
*b*_{ uv }:= ∞,*c*_{ uv }(*x*) : =*f*_{ uv }(*x*), - (b)
we add the reverse arc (

*v, u*) to*G** with*b*_{ vu }:=*cov*(*u, v*),*c*_{ vu }(*x*) : =*f*_{ uv }(*x*);

- (a)
- 3.for every
*v*∈*V*(*G*),- (a)
its exogenous flow

*q*_{ v }is zero, - (b)if ${\sum}_{u\in {N}^{+}\left(v\right)}cov\left(v,\phantom{\rule{2.77695pt}{0ex}}u\right)-{\sum}_{u\in {N}^{-}\left(v\right)}cov\left(u,\phantom{\rule{2.77695pt}{0ex}}v\right)>0$, we add arc (
*v, t**) to*G** where:- i.
$\phantom{\rule{2.77695pt}{0ex}}{b}_{vt*}:={\sum}_{u\in {N}^{+}\left(v\right)}cov\left(v,\phantom{\rule{2.77695pt}{0ex}}u\right)-{\sum}_{u\in {N}^{-}\left(v\right)}cov\left(u,\phantom{\rule{2.77695pt}{0ex}}v\right),\phantom{\rule{2.77695pt}{0ex}}{c}_{vt*}\left(x\right)\phantom{\rule{2.77695pt}{0ex}}:=0,$

- ii.
ii we update ${q}_{t*}:={q}_{t*}+{\sum}_{u\in {N}^{-}\left(v\right)}cov\left(u,v\right)-{\sum}_{u\in {N}^{+}\left(v\right)}cov\left(v,u\right)$;

- i.
- (c)if ${\sum}_{u\in {N}^{+}\left(v\right)}cov\left(v,\phantom{\rule{2.77695pt}{0ex}}u\right)-{\sum}_{u\in {N}^{-}\left(v\right)}cov\left(u,\phantom{\rule{2.77695pt}{0ex}}v\right)<0$, we add arc (
*s**,*v*) to*G** where:- i.
${b}_{s*v}\phantom{\rule{2.77695pt}{0ex}}:={\sum}_{u\in {N}^{-}\left(v\right)}cov\left(v,u\right)-{\sum}_{u\in {N}^{+}\left(v\right)}cov\left(u,v\right),\phantom{\rule{0.3em}{0ex}}{c}_{s*v}\left(x\right)\phantom{\rule{2.77695pt}{0ex}}:=0$,

- ii.
we update ${q}_{{s}^{*}}:={q}_{{s}^{*}}+{\sum}_{u\in {N}^{-}\left(v\right)}cov\left(v,\phantom{\rule{2.77695pt}{0ex}}u\right)-{\sum}_{u\in {N}^{+}\left(v\right)}cov\left(u,\phantom{\rule{2.77695pt}{0ex}}v\right)$.

- i.

- (a)

The next lemma shows that there exists a min-cost flow *x** on *N**.

**Lemma 1** *Given a digraph G with arc coverages cov*(·,·), *the offset network N** = (*G**, *b, q*) *constructed as* *above is a flow network, i.e*., ${\sum}_{v\in V\left(G*\right)}{q}_{v}=0$.

**Proof:**Since

*q*

_{ v }= 0, for all

*v*∈

*V*(

*G**) \ {

*s*, t**}, it remains to show that ${q}_{{s}^{*}}+{q}_{{t}^{*}}=0$. Indeed,

□

*x**, we construct the function

*x*on the edges

*G*as follows. First, observe that for every arc (

*u, v*) ∈

*E*(

*G*), at most one of ${x}_{uv}^{*}$ or ${x}_{vu}^{*}$is nonnull. Indeed, if this were not the case, then a flow

*y** which coincides with

*x**, except for ${y}_{uv}^{*}}:={x}_{uv}^{*}-\mathsf{\text{min}}\left({x}_{uv}^{*},{x}_{vu}^{*}\right)$ and ${y}_{vu}^{*}:={x}_{vu}^{*}-\mathsf{\text{min}}\left({x}_{uv}^{*},{x}_{vu}^{*}\right),$ is also a flow on

*N** and has a strictly smaller cost than

*x**, contradicting the fact that

*x** is of minimum cost. Then, for each arc (

*u, v*) ∈

*E*(

*G*) we set:

### From a flow to a set of paths

*x*is a flow on

*G*(points (1), (2)), whose arcs we consider to have unbounded capacities and whose nodes, apart from the sources and sinks, have exogenous flow 0. It is a well-known result from classical network flow theory that such a flow can be decomposed into paths, that is, there exist paths

*P*

_{1}, . . . ,

*P*

_{ t }from the sources of

*G*to the sinks of

*G*, having weights

*w*

_{1}, . . . ,

*w*

_{ t }, respectively, such that, for every (

*u, v*) ∈

*E*(

*G*) we have

Moreover, a decomposition of *x* into at most |*E*(*G*)| paths always exists and can be found in time |*V* (*G*)| · *E*(*G*). Theorem 1 also shows that the paths of any decomposition of *x* are an optimal solution for *G* to Problem UTEJC (point (3)).

**Theorem 1** *Given an optimal flow x** *on G***, the function × on G just constructed satisfies the properties, where S denotes the set of sources of G, and T denotes the set of sinks of G:*

*(1) for all v* ∈ *V* (*G*) \ (*S* ∪ *T*), ${\sum}_{u\in {N}_{G}^{-}\left(v\right)}{x}_{uv}={\sum}_{u\in {N}_{G}^{+}\left(v\right)}{x}_{vu}$;

*(2)* ${\sum}_{s\in S}{\sum}_{v\in {N}_{G}^{+}\left(s\right)}{x}_{sv}={\sum}_{t\in T}{\sum}_{u\in {N}_{G}^{-}\left(t\right)}{x}_{ut}$

*(3) any decomposition of × into paths attains the minimum of the objective function of Problem UTEJC, on input G*.

**Proof:**(1): Let

*v*∈

*V*(

*G*) \ (

*S*∪

*T*); by the definition of

*x*, we can write

Observe that for all edges entering *t** (exiting *s**), their flow equals their capacity, as we have adjusted the exogenous flow of *t** (of *s**) at point 3.(b)ii. (and 3.(c)ii.) in the construction of *G**. We distinguish three cases.

First, if ${\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{uv}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{vu}^{*}>0$, then ${\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{uv}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{vu}^{*}={x}_{v{t}^{*}}^{*}$. Since the flow *x** uses the arc (*v, t**) with its maximum capacity, we have that ${x}_{v{t}^{*}}^{*}={b}_{v{t}^{*}}={\sum}_{u\in {N}_{G}^{+}\left(v\right)}cov\left(v,u\right)-{\sum}_{u\in {N}_{G}^{-}\left(v\right)}cov\left(u,v\right)$, which shows that ${\sum}_{u\in {N}_{G}^{-}\left(v\right)}{x}_{uv}-{\sum}_{u\in {N}_{G}^{+}\left(v\right)}{x}_{vu}=0$, proving the claim.

Second, if ${\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{uv}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{vu}^{*}<0$, then ${\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{uv}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{vu}^{*}-=-{x}_{{s}^{*}v}^{*}$. Since the flow *x** uses the arc (*s**, *v*) with its maximum capacity, we have that ${x}_{s*v}^{*}={b}_{s*v}={\sum}_{u\in {N}_{G}^{-}\left(v\right)}cov\left(v,u\right)-{\sum}_{u\in {N}_{G}^{+}\left(v\right)}cov\left(u,v\right)$, which again proves the claim.

Finally, if ${\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{uv}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(v\right)\cup {N}_{G}^{+}\left(v\right)}{x}_{vu}^{*}=0,$ then, by construction there is no edge between *v* and *t** or *s**, implying, again by construction, that ${\sum}_{u\in {N}_{G}^{-}\left(v\right)}cov\left(u,v\right)={\sum}_{u\in {N}_{G}^{+}\left(v\right)}cov\left(v,u\right)$, from which the claim follows.

*x*, we have

*q*

_{ s }= 0 for all

*s*∈

*S*, we have ${x}_{s{t}^{*}}^{*}+{\sum}_{v\in {N}_{G}^{+}\left(s\right)}{x}_{sv}^{*}={\sum}_{v\in {N}_{G}^{+}\left(s\right)}{x}_{vs}^{*}+{x}_{{s}_{0}s}^{*}$. Therefore, ${\sum}_{v\in {N}_{G}^{+}\left(s\right)}{x}_{sv}^{*}-{\sum}_{v\in {N}_{G}^{+}\left(s\right)}{x}_{vs}^{*}={x}_{{s}_{0}s}^{*}-{x}_{s{t}^{*}}^{*}={x}_{{s}_{0}s}^{*}-{b}_{s{t}^{*}}={x}_{{s}_{0}s}^{*}-{\sum}_{v\in {N}_{G}^{+}\left(s\right)}cov\left(s,v\right)$. Plugging this into (2), we obtain

*q*

_{ t }= 0 for all

*t*∈

*T*, we have ${x}_{{s}^{*}t}^{*}+{\sum}_{u\in {N}_{G}^{-}\left(t\right)}{x}_{ut}^{*}={x}_{t{t}_{0}}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(t\right)}{x}_{tu}^{*}$. Therefore, ${\sum}_{u\in {N}_{G}^{-}\left(t\right)}{x}_{ut}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(t\right)}{x}_{tu}^{*}={x}_{t{t}_{0}}^{*}-{x}_{{s}^{*}t}^{*}={x}_{t{t}_{0}}^{*}-{b}_{{s}^{*}t}={x}_{t{t}_{0}}^{*}-{\sum}_{u\in {N}_{G}^{-}\left(t\right)}cov\left(u,t\right)$. Plugging this into (5), we prove the claim, since by (3) we get

*G*to sinks of

*G*, induces a flow on

*G*, where the exogenous flow of all nodes which are not sources nor sinks is zero, and any such flow can be split into paths from sources to sinks, the minimum value of

*k*, all

*k*-tuples of paths $\mathcal{P}=\left({P}_{1},{P}_{2},...,{P}_{k}\right)$ from a source of

*G*to a sink of

*G*, and over all expression levels

*e*

_{ i }for each

*P*

_{ i }, is equal to min

_{ y }is a flow on G${\sum}_{\left(u,v\right)\in E\left(G\right)}{f}_{uv}\left(|cov\left(u,v\right)-{y}_{{v}^{\prime}v}|\right)$. Since any flow on

*G*induces a flow on

*G**, and vice versa, the above is equal to

and from minimality, for all arcs (*u, v*) ∈ *E*(*G*), at most one of *z*_{ uv } or *z*_{ vu } is non null, we have that *x** also attains the minimum in (7), proving the theorem. □

In our implementation we use the min-cost flow engine available in the LEMON Graph Library [21]. If no engine for arbitrary convex cost functions is available, or, more generally, if the cost functions themselves happen not to be convex, one can approximate any cost function with piecewise constant or convex cost functions: e.g., one can replace an arc (*u, v*) of capacity *b*_{ uv }, with |*b*_{ uv }| arcs of capacity 1, such that first arc has cost *f*(1), and the *i* th arc, *i >* 1, has cost *f*(*i*) - *f*(*i* - 1) (this reduction is only pseudo-polynomial but reveals quite effective in practice), see [13] for further details.

### Decomposing the min-cost flow into few paths

As already shown by the other tools, we are generally interested in *parsimoniously* explaining an RNA-seq experiment, that is, in finding, among the optimal solutions to Problem UTEC, one with a low number of paths. At a closer analysis it can be seen that any flow on a graph *G* = (*V, E*) can be decomposed into at most |*E*| *-* |*V*| + 2 paths [14]. However, decomposing a flow into a minimum number of paths is an NP-hard problem in the strong sense, even when restricted to DAGs [14, 15]. To overcome this limitation, various heuristics and approximations have been put forth, see, e.g., [14, 15, 16, 17] and the references therein. The advantage of our method is that once we have obtained the optimal flow, we can apply any of these methods to split the flow into few paths. For simplicity, in this paper we employ the policy of repeatedly removing the heaviest path, see, e.g., [15]: until the network has null flow, we select a path from the sources to the sinks whose minimum flow on its edges is maximum, report it as transcript, and remove it from the flow network.

## Results and discussion

We call our tool Traph (*T* ranscripts in G*raph* s). We compared Traph to the most used isoform prediction tool Cufflinks [4] and with IsoLasso [9]. We also tried to include SLIDE [7] and CLIIQ [6], but we could not make the former work reliably, and for the latter the publicly available version was not yet available. Full experiment data is available at [22].

We should point out from the start that Traph is not yet employing paired-end read information. Nonetheless, the experiments we report (both simulated and real) are with paired-end reads, Cufflinks and IsoLasso having access to the paired-end information. Moreover, since Traph is not yet employing existing gene annotation information, we ran Cufflinks and IsoLasso without annotation. As already mentioned, we use a least sum of squares model. We experimented in the current implementation with other cost functions, as mentioned in the introduction, ${f}_{z}\left(x\right)=x,{f}_{z}\left(x\right)=x/\sqrt{cov\left(z\right)}$, or ${f}_{z}\left(x\right)=x/cov\left(z\right)$, respectively, for all nodes and edges *z*, but they currently give worse results.

### Matching criteria

In order to match the predicted transcripts with the true transcripts, we take into account the DNA sequences but also the expression levels. For each gene, we construct a bipartite graph with the true transcripts $\mathcal{T}=\left({T}_{1},{T}_{2},\dots \right)$ as nodes in one set of the bipartition, and the predicted transcripts $\mathcal{P}=\left({P}_{1},{P}_{2},\dots \right)$ as nodes in the other set of the bipartition. Empty sequences with 0 expression level were added so that both sets of the bipartition had an equal number of nodes.

*T*and its expression level

*e*(

*T*) with respect to a predicted transcript

*P*with expression level

*e*(

*P*)

where γ(*x*) = 0^{|bin(x)-1}1*bin*(*x*), *bin*(*x*) being the binary encoding of *x >* 0, *j* is the index of *P* in the list of predicted transcripts, *d* is the unit cost (Levenshtein) edit distance of *T* and *P*, editsencoded(*T, P*) lists the edits and gaps between edits using 2-bit fixed code for edit type, 2-bit fixed code for substituted/inserted symbol, and γ(*x*+1) for gap (run of identities) of length *x*, and *f*(*x*) is a bijection between {0, 1, -1, 2, -2, . . .} and {1, 2, 3, 4, 5, . . .} defined as *f*(*x*) = 2*x* for *x* > 0 and *f*(*x*) = 2(*-x*) + 1 otherwise.

Then, the edge cost between nodes ${T}_{i}\in \mathcal{T}$ and ${P}_{j}\in \mathcal{P}$ is defined as |code(*T*_{ i } | *P*_{ j }*, j*)| - |γ(*j*)|. The closer to zero this number is, the better the match between true transcript *T*_{ i }, with true expression level *e*(*T*_{ i }) and predicted transcript *P*_{ j } with predicted expression level *e*(*P*_{ j }). The minimum weight perfect matching was then computed; this gives a one-to-one mapping between true and predicted transcripts, therefore true transcripts can be ordered in the same order as they match predicted transcripts and code for the index, γ(*j*), is no longer required. Let *edit code length* for an edge between *T*_{ i } and *P*_{ j } be *|* γ(*d* + 1) editsencoded(*T*_{ i }*, P*_{ j })|, where *d* is the edit distance. Let *bitscore* be edit code length divided by |γ(|*T*_{ i }| + 1) editsencoded(*T*_{ i }*, ε*)|; bitscore is asymmetric, and possibly greater than 1 if *ε* would be a better match to *T*_{ i } than to *P*_{ j }, but minimum weight perfect matching chose otherwise for global minimality. Let us also call *relative expression level difference* the ratio |*e*(*P*_{ j })-*e*(*T*_{ i })|/*e*(*T*_{ i }). Each matched node pair with relative expression level difference and (edit) bitscore under some given thresholds define a true positive event (TP). The other kind of nodes define false positive (FP) and false negative (FN) events depending on which side of the bipartite graph they reside. Prediction efficiency based on precision, recall and F-measure is also employed in [6, 9].

### Simulated human data

As in the case of the other tools, we deem that validating against simulated data is a prerequisite, since, in general, on real data, we do not have available ground-truth. We designed the following validation experiment, closely following the approaches in [6, 9]. We chose a set of genes at random, and looked up the corresponding annotated transcripts from the Ensembl database. Out of these genes, we selected only those having between 2 and 5 transcripts. In all, we had 29 genes. For each transcript, we simulated reads with the RNASeqReadSimulator [24]. This simulator chooses an expression level at random from lognormal distribution with mean -4 and variance 1. For each gene, it simulated paired-end reads, with fragment length mean 300 and standard deviation 20, as follows: a transcript was chosen randomly using its expression level as distribution weight, while the position of the read within the transcript was chosen uniformly. As argued in the case of IsoLasso [9], various error models can be incorporated in these steps, but we chose to compare the performance of the methods in neutral conditions. We mapped the reads with TopHat [25]: these read mapping results were given as input to the tested prediction software, and to a Python program which we wrote to construct the splicing graphs needed for Traph. Cufflinks and IsoLasso were ran with the default parameters, because the parameters they offer relate to RNA-seq lab protocol, which was not simulated; we could not see changes to other parameters which could be relevant to the prediction. We use FPKM values as expression levels.

We devised two experiment setups. In the first one, which we call single genes, 300, 000 paired-end reads were generated independently from the transcripts of each of the 29 genes, with the already assigned expression levels. They were independently given to TopHat for alignment, and these independent alignment results were fed to each tool. In the second, more realistic experiment, which we call batch mode, the transcripts and their assigned expression levels were combined into one file, and from this file 29 * 300, 000 paired-end reads were simulated. All these reads were fed to TopHat for alignment, and these combined alignment results were fed to the tools. The fragment length mean and standard deviation were passed to the tools, except for Cufflinks in batch mode, when it was able to infer them automatically.

Performance of the three tools

Precision | Recall | F-measure | Avg. run time/gene | ||||
---|---|---|---|---|---|---|---|

(0.1, 0.2) | (0.9, 0.2) | (0.1, 0.2) | (0.9, 0.2) | (0.1, 0.2) | (0.9, 0.2) | ||

IsoLasso | 0.0183 | 0.2126 | 0.0258 | 0.3109 | 0.0214 | 0.2132 | 25 s |

Cufflinks | 0.0545 | 0.3623 | 0.0482 | 0.2931 | 0.0507 | 0.3060 | 40 s |

Traph | 0.0862 | 0.3729 | 0.0689 | 0.3988 | 0.0747 | 0.3665 | 72 s |

### Real human data

We used the same real dataset from the IsoLasso paper [9], Caltech RNA-Seq track from the ENCODE project [GenBank:SRR065504], consisting of 75bp paired-end reads. Out of these reads, we picked the 2,406,339 which mapped to human chromosome 2. We selected the 674 genes where all three tools made some prediction; these genes have 6075 annotated transcripts.

### Running times

On the real dataset, Cufflinks finished in 20 min, IsoLasso in 2 min, and Traph in 30 min. We should however stress that for solving the min-cost flow problem and for identifying the transcripts, Traph uses in fact 6 min, the rest of the time being spent by our graph creation tool, which is written in Python. We could not make such a detailed analysis in the case of the other two tools. The running time of our Python script is as well included in the last column of Table 1, where we listed the average running time per gene with simulated reads of each tool.

## Conclusions

All tools for isoform identification and quantification use an explicit or implicit graph model. Resorting to such a representation, the main contribution of this paper consists in a novel, radically different method based on minimum-cost flows, an established problem, for which there exist polynomial-time algorithms and solvers. We implemented this method into our tool Traph. Even though Traph is not using paired-end information at this moment, Traph is competitive with state-of-the-art tools.

This leads us to expect that once we incorporate paired-end read information, the performance of Traph will increase significantly. Note also that in the current implementation, each exon equally contributes to the fitness function, independently of its length; we plan to include exon lengths in the fitness function in a future implementation. We also plan to integrate existing gene annotation into a more refined construction of the splicing graph and into the fitness model. Our method is general enough to easily accommodate other biological assumptions. In order to evaluate the tools against real ground-truth data, we have started a process of acquiring long sequencing reads (PacBio) of the true isoforms of a gene.

## Notes

### Acknowledgements

We wish to thank Antti Honkela for many insightful discussions on transcript prediction. We would also like to thank Teemu Kivioja and the anonymous reviewers for hinting us related literature on refined cost functions for coverage counts, as well as for constructive comments that improved our experiment setup.

**Declarations**

Publication of this article was supported by the Academy of Finland under grant 250345 (CoECGR).

This article has been published as part of *BMC Bioinformatics* Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.

## References

- 1.Mortazavi A, Williams BAA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.CrossRefPubMedGoogle Scholar
- 2.Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nature methods. 2009, 6: (11):s22-s32.CrossRefPubMedGoogle Scholar
- 3.Shah S: The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature. 2012, 486: (7403):395-399.PubMedGoogle Scholar
- 4.Trapnell C: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010, 28: 511-515. 10.1038/nbt.1621.PubMedCentralCrossRefPubMedGoogle Scholar
- 5.Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA: Splicing graphs and EST assembly problem. Bioinformatics. 2002, 18 (suppl 1): S181-S188. 10.1093/bioinformatics/18.suppl_1.S181.CrossRefPubMedGoogle Scholar
- 6.Lin YY, Dao P, Hach F, Bakhshi M, Mo F, Lapuk A, Collins C, Sahinalp SC: CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population. Proc Algorithms in Bioinformatics - 12th International Workshop, WABI 2012, Volume 7534 of Lecture Notes in Computer Science. 2012, Springer, 178-189.Google Scholar
- 7.Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ: Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci USA. 2011, 108 (50): 19867-19872. 10.1073/pnas.1113972108.PubMedCentralCrossRefPubMedGoogle Scholar
- 8.Feng J, Li W, Jiang T: Inference of Isoforms from Short Sequence Reads. RECOMB, Volume 6044 of Lecture Notes in Computer Science. Edited by: Berger B. 2010, Springer, 138-157.Google Scholar
- 9.Li W, Feng J, Jiang T: IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J Comput Biol. 2011, 18 (11): 1693-707. 10.1089/cmb.2011.0171.PubMedCentralCrossRefPubMedGoogle Scholar
- 10.Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C, Rinn JL, Lander ES, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010, 28 (5): 503-510. 10.1038/nbt.1633.PubMedCentralCrossRefPubMedGoogle Scholar
- 11.Xing Y, Resch A, Lee C: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 2004, 14: (3):426-441.PubMedCentralCrossRefPubMedGoogle Scholar
- 12.Namiki T, Hachiya T, Tanaka H, Sakakibara Y: MetaVelvet: an extension of Velvet assembler to
*de novo*metagenome assembly from short sequence reads. Nucleic Acids Res. 2012, 40: e155-10.1093/nar/gks678.PubMedCentralCrossRefPubMedGoogle Scholar - 13.Ahuja RK, Magnanti TL, Orlin JB: Network Flows: Theory, Algorithms, and Applications. 1993, Prentice-Hall, IncGoogle Scholar
- 14.Vatinlen B, Chauvet F, Chrétienne P, Mahey P: Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths. European Journal of Operational Research. 2008, 185 (3): 1390-1401. 10.1016/j.ejor.2006.05.043.CrossRefGoogle Scholar
- 15.Hartman T, Hassidim A, Kaplan H, Raz D, Segalov M: How to split a flow?. INFOCOM. Edited by: Greenberg AG, Sohraby K, IEEE. 2012, 828-836.Google Scholar
- 16.Koch R, Skutella M, Spenke I: Maximum
*k*-Splittable*s, t*-Flows. Theory of Computing Systems. 2008, 43: 56-66. 10.1007/s00224-007-9068-8. [http://dx.doi.org/10.1007/s00224-007-9068-8]CrossRefGoogle Scholar - 17.Salazar F, Skutella M: Single-source
*k*-splittable min-cost flows. Oper Res Lett. 2009, 37 (2): 71-74. 10.1016/j.orl.2008.12.004.CrossRefGoogle Scholar - 18.Van Der Heijden PG, Cruyff M, Van Houwelingen HC: Estimating the Size of a Criminal Population from Police Records Using the Truncated Poisson Regression Model. Statistica Neerlandica. 2003, 57 (3): 289-304. 10.1111/1467-9574.00232.CrossRefGoogle Scholar
- 19.Minoux M: Solving integer minimum cost flows with separable convex cost objective polynomially. Netflow at Pisa, Volume 26 of Mathematical Programming Studies. Edited by: Gallo G, Sandi C. 1986, Springer Berlin Heidelberg, 237-239. [http://dx.doi.org/10.1007/BFb0121104]CrossRefGoogle Scholar
- 20.Weintraub A: A Primal Algorithm to Solve Network Flow Problems with Convex Costs. Management Science. 1974, 21: 87-97. 10.1287/mnsc.21.1.87. [http://EconPapers.repec.org/RePEc:inm:ormnsc:v:21:y:1974:i:1:p:87-97]CrossRefGoogle Scholar
- 21.Lemon Graph Library: [http://lemon.cs.elte.hu/trac/lemon/]
- 22.Traph source code and experiment data: [http://cs.helsinki.fi/gsa/traph/]
- 23.Cilibrasi R, Vitányi PMB: Clustering by compression. IEEE Transactions on Information Theory. 2005, 51 (4): 1523-1545. 10.1109/TIT.2005.844059.CrossRefGoogle Scholar
- 24.RNASeqReadSimulator: [http://www.cs.ucr.edu/~liw/rnaseqreadsimulator.html]
- 25.Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.PubMedCentralCrossRefPubMedGoogle Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.