Skip to main content

Integer Linear Programming in Computational Biology: Overview of ILP, and New Results for Traveling Salesman Problems in Biology

  • Chapter
  • First Online:

Part of the book series: Computational Biology ((COBO,volume 29))

Abstract

Integer linear programming (ILP) is a powerful and versatile technique for framing and solving hard optimization problems of many types. In the last several years, ILP has become widely used in computational biology, although predominantly by computationally and mathematically trained researchers, such as Bernard Moret. In an effort to reach a broader set of researchers, this chapter begins with an introduction to ILP, illustrated by the phenomena of cliques and independent sets in biological graphs. Then, the focus shifts to new research results on the use of ILP to solve traveling salesman problems, using compact ILP formulations. Such formulations have been largely declared useless in the optimization literature. However, in this chapter, I argue that the correct compact formulation can be very effective for problems of the size and structure that arise in computational biology. These empirical results, and some additional arguments, then bring into question the relevance of the concept of strength of an ILP formulation as a predictor of the speed that it will be solved.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    And the other Bernard (the bookish nerd) in Death of a Salesman.

  2. 2.

    The introduction is related and partly derived from several sections in [20].

  3. 3.

    Terminology note: We use “TS” as an abbreviation for “traveling salesman”, which is sometimes followed by “tour” or “path”, as appropriate.

  4. 4.

    This is a practice I recommend for all empirical, computation-based papers.

  5. 5.

    Another noncommercial ILP-solver (which is not an LP-solver) that has a good reputation is called SCIP, but I have not had much experience with it.

  6. 6.

    A binary variable can only be set to value 0 or 1.

  7. 7.

    The introductory material on TSP, and the descriptions of TSP formulations, are extracted from [20]. The research results and conclusions are new.

  8. 8.

    Moreover, a freely available, highly engineered program called Concorde mixes many techniques and tricks to solve very large TS problems in practice.

  9. 9.

    This remains true in 2018, according to people who teach courses on ILP.

  10. 10.

    Certainly, if one states the TS problem to students without including that constraint, an alert student will ask about it.

  11. 11.

    Including one that I came up with, which had nearly the worst performance of all.

  12. 12.

    Certainly, even my laptop is a faster machine than the one used in 2003 by the author of [40]. But, the increased machine speed does not account for the difference between the observation today, and the understanding in 2003. It should be noted however, that my experience with MTZ on br17 also contradicts the statement in [40], since the MTZ formulation for br17 solved in 0.54 s.

  13. 13.

    On the instance where MTZ took over 1 h, the GG formulation took 2.38 s to solve (with Gurobi 8), and the DFJ formulation with separation took 0.02 s. So, DFJ with separation is unquestionably dominant, but the speed of GG here contributes to the new understanding that the right compact TSP formulation is practical, while the instability of MTZ makes it much less reliable.

  14. 14.

    A later attempt with Gurobi 8 suffered the same fate.

  15. 15.

    For example, in the benchmark data ch130 of 130 cities, the ILP optimal is 6110, the LP-opt for GG is 5608 but the assignment optimal is only 4377.

  16. 16.

    However, I have never seen this stated in the literature.

  17. 17.

    In fact, I put that prohibition into my first ILP implementations without even thinking about it. Then, when I looked at the empirical results and saw cases where the LP results violated (correct) mathematical theory, I was perplexed until I realized that the theory is only established for the pure formulations.

  18. 18.

    But remember that the path computation for the input graph G is actually a tour computation on the derived graph \(G'\). Hence, what we learn from these computations concerns TSP tours and ILP formulations for the TS tour problem.

  19. 19.

    Note however, what looks like a contradiction in the case of DFJ with separation. In the case of ch130, the LP-Opt reported is 5582. However, the LP-opt for the assignment ILP for ch130 is only 4377, and the two values should be the same if LP-opt was computed exactly as discussed in Sect. 15.7.1. A possible explanation is the fact that the computation of DFJ with separation, implemented by a Gurobi program, added subtour-elimination constraints even before the LP-opt was reported. So, the Gurobi code implementing the separation approach seems not to exactly follow the description given in Sect. 15.7.1. However, in all experiments that did not use the DFJ formulation, the LP-opt value was identical to the value obtained by running the LP-relaxation of the ILP formulation.

References

  1. Agarwala, R., Applegate, D.L., Maglott, D., Schuler, G.D., Schäffer, A.A.: A fast and scalable radiation hybrid map construction and integration strategy. Genome Res. 10(3), 350–364 (2000)

    Article  Google Scholar 

  2. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall (1993)

    Google Scholar 

  3. Alizadeh, F., Karp, R.M., Weisser, D., Zweig, G.: Physical mapping of chromosomes using unique probes. J. Comput. Biol. 2, 159–184 (1995)

    Article  Google Scholar 

  4. Althaus, E., Klau, G.W., Kohlbacher, O., Lenhof, H.P., Reinert, K.: Integer linear programming in computational biology. In: Festschrift Mehlhorn, LNCS 5760, pp. 199 – 218. Springer (2009)

    Google Scholar 

  5. Álvarez-Miranda, E., Ljubić, I., Mutzel, P.: The maximum weight connected subgraph problem. In: Junger, M., Reinelt, G. (eds.) Facets of Combinatorial Optimization, pp. 245–270. Springer (2013)

    Google Scholar 

  6. Bertsimas, D., Weismantel, R.: Optimization Over Integers, vol. 13. Dynamic Ideas, Belmont (MA) (2005)

    Google Scholar 

  7. Blanchette, M., Bourque, G., Sankoff, D.: Breakpoint phylogenies. In: Miyano, S., Takagi, T. (eds.) Genome Informatics, pp. 25–34. University Academy Press (1997)

    Google Scholar 

  8. Blum, C., Festa, P.: Metaheuristics for String Problems in Bio-informatics. Wiley (2016)

    Google Scholar 

  9. Chimani, M., Rahmann, S., Bocker, S.: Exact ILP solutions for phylogenetic minimum flip problems. In: Proceedings of the First ACM-BCB Conference, pp. 147–153 (2010)

    Google Scholar 

  10. Claus, A.: A new formulation for the travelling salesman problem. SIAM J. Algebr. Discr. Methods 5, 21–25 (1984)

    Article  MathSciNet  Google Scholar 

  11. Conforti, M., Cornuejols, G., Zambelli, G.: Integer Programming. Springer (2014)

    Google Scholar 

  12. Dantzig, G.B., Fulkerson, D.R., Johnson, S.M.: Solution of a large-scale travelling-salesman problem. Oper. Res. 2, 393–410 (1954)

    Google Scholar 

  13. Felsenstein, J.: Inferring Phylogenies. Sinauer (2004)

    Google Scholar 

  14. Forrester, R., Greenberg, H.J.: Quadratic binary programming models in computational biology. Alg. Oper. Res. 3, 110129 (2008)

    MathSciNet  MATH  Google Scholar 

  15. Fox, K., Gavish, B., Graves, S.: An n-constraint formulation of the (time-dependent) traveling salesman problem. Oper. Res. 28, 101821 (1980)

    Article  MathSciNet  Google Scholar 

  16. Frumkin, J.P., Patra, B.N., Sevold, A., Ganguly, K., Patel, C., Yoon, S., Schmid, M.B., Ray, A.: The interplay between chromosome stability and cell cycle control explored through gene-gene interaction and computational simulation. Nucleic Acids Res. 44, 8073–8085 (2016)

    Article  Google Scholar 

  17. Gavish, B., Graves, S.: The travelling salesman problem and related problems. Working Paper OR 078-78. Technical Report. MIT, Operations Research Center (1978)

    Google Scholar 

  18. Gouveia, L., Vos, S.: A classification of formulations for the (time-dependent) traveling salesman problem. Europ. J. Oper. Res. 83, 69–82 (1995)

    Article  Google Scholar 

  19. Gusfield, D.: Algorithms on Strings, Trees and Sequence. Computer Science and Computational Biology. Cambridge University Press (1997)

    Google Scholar 

  20. Gusfield, D.: Integer linear programming in computational and systems biology: an entry-level text and course. Cambridge University Press (2019)

    Google Scholar 

  21. Gusfield, D., Frid, Y., Brown, D.: Integer programming formulations and computations solving phylogenetic and population genetic problems with missing or genotypic data. In: Proceedings of 13th Annual International Conference on Combinatorics and Computing, pp. 51–64. LNCS 4598, Springer (2007)

    Google Scholar 

  22. Huttlin, E.L., Ting, L., Bruckner, R.J., Gebreab, F., Gygi, M.P., Szpyt, J., Tam, S., Zarraga, G., Colby, G., Baltier, K., Dong, R., Guarani, V., Vaites, L.P., Ordureau, A., Rad, R., Erickson, B.K., Whr, M., Chick, J., Zhai, B., Kolippakkam, D., Mintseris, J., Obar, R.A., Harris, T., Artavanis-Tsakonas, S., Sowa, M.E., Camilli, P.D., Paulo, J.A., Harper, J.W., Gygi, S.P.: The BioPlex network: a systematic exploration of the human interactome. Cell 162, 425–440 (2015)

    Article  Google Scholar 

  23. Johnson, M., Hummer, G.: Interface-resolved network of protein-protein interactions. PLoS Comput. Biol. 9, e1003,065 (2013)

    Article  Google Scholar 

  24. Johnson, O., Liu, J.: A traveling salesman approach for predicting protein functions. Source Code Biol. Med. 1, (2006)

    Google Scholar 

  25. Kingsford, C.L., Chazelle, B., Singh, M.: Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics 21, 1028–1036 (2005)

    Article  Google Scholar 

  26. Korostensky, C., Gonnet, G.: Near optimal multiple sequence alignments using a traveling salesman problem approach. In: Proceedings of String Processing and Information Retrieval Symposium, p. 105. IEEE (1999)

    Google Scholar 

  27. Korostensky, C., Gonnet, G.: Using traveling salesman problem algorithms for evolutionary tree construction. Bioinformatics 16, 619–627 (2000)

    Article  Google Scholar 

  28. Lancia, G.: Integer programming models for computational biology problems. J. Comp. Sci. Tech. 19, 6077 (2004)

    MathSciNet  Google Scholar 

  29. Lancia, G.: Mathematical programming in computational biology: an annotated bibliography. Algorithms 1, 100129 (2008)

    Article  MathSciNet  Google Scholar 

  30. Langevin, A., Soumis, F., Desrosiers, J.: Classification of travelling salesman problem formulations. Oper. Res. Let. 9, 12732 (1990)

    Article  MathSciNet  Google Scholar 

  31. Lorenzo, E., Camacho-Caceres, K., Ropelewski, A.J., Rosas, J., Ortiz-Mojer, M., Perez-Marty, L., Irizarry, J., Gonzalez, V., Rodríguez, J.A., Cabrera-Rios, M., Isaza, C.: An optimization-driven analysis pipeline to uncover biomarkers and signaling paths: cervix cancer. Microarrays 4(2), 287–310 (2015)

    Article  Google Scholar 

  32. Mazza, A., Klockmeier, K., Wanker, E., Sharan, R.: An integer programming framework for inferring disease complexes from network data. Bioinformatics 32, i271–i277 (2016)

    Article  Google Scholar 

  33. Miller, C., Tucker, R., Zemlin, R.: Integer programming formulation of traveling salesman problems. J. Assoc. Comput. Mach. pp. 326–329 (1960)

    Article  MathSciNet  Google Scholar 

  34. Moret, B., Bader, D.A., Warnow, T.: High-performance algorithm engineering for computational phylogenetics. J. Supercomput. 22, 99–111 (2002)

    Article  Google Scholar 

  35. Oncan, T., Altnel, I., Laporte, G.: A comparative analysis of several asymmetric traveling salesman problem formulations. Comp. Oper. Res. 36, 637654 (2009)

    Article  MathSciNet  Google Scholar 

  36. Orman, A., Williams, H.: A survey of different integer programming formulations of the travelling salesman problem. Technical Report, Department of Operational Research, London School of Economics and Political Science (2004)

    Google Scholar 

  37. Orman, A., Williams, H.P.: A survey of different integer programming formulations of the travelling salesman problem. In: Kontoghiorghes, E., Gatu, C. (eds.) Optimisation, Econometric and Financial Analysis, vol. 9, pp. 91–104. Springer, Berlin, Heidelberg (2007)

    Google Scholar 

  38. Padberg, M., Sung, T.Y.: An analytical comparison of different formulations of the travelling salesman problem. Math. Prog. 52, 315–357 (1991)

    Article  MathSciNet  Google Scholar 

  39. Pataki, G.: The bad and the good-and-ugly. Technical Report, Columbia University, IEOR (2000). CORC 2000-1

    Google Scholar 

  40. Pataki, G.: Teaching integer programming formulations using the traveling salesman problem. SIAM Rev. 65, 116–123 (2003)

    Article  MathSciNet  Google Scholar 

  41. Reinelt, G.: TSPLIB-A traveling salesman problem library. ORSA J. Comp. 3, 376–384 (1991)

    Article  Google Scholar 

  42. Reiter, J., Makohon-Moore, A., Gerold, J., Bozic, I., Chatterjee, K., Iacobuzio-Donahue, C., Vogelstein, B., Nowak, M.: Reconstructing metastatic seeding patterns of human cancers. Nat. Commun. 8, (2017)

    Article  Google Scholar 

  43. Sankoff, D., Blanchette, M.: Multiple genome rearrangement and breakpoint phylogeny. J. Comp. Biol. 5, 555–570 (1998)

    Article  Google Scholar 

  44. Sawik, T.: A note on the Miller-Tucker-Zemlin model for the asymmetric traveling salesman problem. Bull. Polish Acad. Sci. Tech. Sci. 64, 517–520 (2016)

    Article  Google Scholar 

  45. Shao, M., Lin, Y., Moret, B.M.: An exact algorithm to compute the DCJ distance for genomes with duplicate genes. J. Comput. Biol. 22(5), 425–435 (2015)

    Article  MathSciNet  Google Scholar 

  46. Shao, M., Moret, B.M.E.: Comparing genomes with rearrangements and segmental duplications. Bioinformatics 31(12), i329–i338 (2015)

    Article  Google Scholar 

  47. Shao, M., Moret, B.M.E.: A fast and exact algorithm for the exemplar breakpoint distance. J. Comput. Biol. 23(5), 337–346 (2016)

    Article  MathSciNet  Google Scholar 

  48. Shao, M., Moret, B.M.E.: On computing breakpoint distances for genomes with duplicate genes. J. Comput. Biol. 24(6), 571–580 (2017)

    Article  Google Scholar 

  49. Wong, R.: Integer programming formulations of the traveling salesman problem. In: Rabbat, G. (ed.) Proceedings of ICCC 80, IEEE Conference on Circuits and Computing, pp. 149–152 (1980)

    Google Scholar 

Download references

Acknowledgements

This research was supported by NSF grant 1528234. The research was done partly while on sabbatical at the Simons Institute for Computational Theory, UC Berkeley. I would also like to thank Thong Le for help on understanding proofs about strength; Jim Orlin, T. L. Magnanti, and David Shmoys for helpful communications. Finally, I thank Tandy Warnow, Mohammed El-Kebir, and the anonymous reviewers who provided many helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Gusfield .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Data for Random Graphs

See Table 15.1 for results of experiments with different compact TSP formulations on a range of random graphs that differ in the number of nodes they contain, and their edge density.

Table 15.1 The edge density, d, and the number of datasets (replicates), r, generated for this case. The three columns show the results for the GG, MTZ, and CLAUS formulations respectively. All three formulations contain inequalities that explicitly prohibit an edge from being traversed in both directions. The times and standard deviation are all reported in seconds. Any number larger than ten has been rounded to the nearest one-decimal point accuracy. Any number larger than 200 has been rounded to the nearest integer. The large inefficiency of the CLAUS formulation, compared to MTZ and GG, is clearly established with graphs containing only 100 nodes, so no computations with the CLAUS formulation were done for a larger number of nodes. Similarly, due to the inefficiency of MTZ with 400 nodes and edge density of 0.25, more challenging computations were not done with the MTZ formulation. All the computations were done on a MacBook Pro laptop, except for the entry for 100, 0.5, marked with a ‘*’. Those ran on a somewhat faster iMac desktop

Appendix 2: Data from Benchmark Tests

Experiments on several well-known TSP benchmark test sets covering a range of sizes are shown in Table 15.2. All the formulations, except FGG4, have inequalities that prohibit an edge from being traversed in both directions. The numerals in each ID give the number of cites, from 17 to 229. The letter ‘A’ or ‘S’ indicates whether the problem is for a directed (asymmetric) or an undirected (symmetric) graph. Both the optimal tour cost and the optimal path cost (with no designated start or stop nodes) were computed and written next to the problem ID. Each of the ILP formulations is for an optimal TS path, unless “tour” is indicated.Footnote 18

The entry in the column for “gap” is empty if the computation ran to completion, and otherwise is the gap when the computation was terminated. An entry for “Time” is the time at completion or termination of the ILP computation; and an entry for “LP-Opt” gives the optimal cost of the LP-relaxation of the TS problem, as reported by Gurobi. The LP-Opt cost can be compared to the cost indicated next to the problem ID, as a measure of the strength of the ILP formulation.Footnote 19 These LP-opt values can also be compared to each other to validate known theory about the strength of ILP formulations, or to question the relevance of that theory. This is discussed in Sect. 15.11.

Table 15.2 Data for benchmark TS problems. The ID gives the name of the test, and is followed by its optimal TSP cost, and whether the test problem is for a path or a tour, and whether the test problem is symmetric (S) or asymmetric (A). The next entry is the ILP formulation used to solve the test. Each of these formulations is for the TS path problem, unless “tour” is stated. The next column, labeled GAP, is empty if the computation ran to completion but shows the size of the gap if the computation was terminated before completion. The next column, time, gives the time for the computation, either to its completion or to its termination. The final column, LP-Opt, gives the optimal value for LP relaxation of the test problem

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Gusfield, D. (2019). Integer Linear Programming in Computational Biology: Overview of ILP, and New Results for Traveling Salesman Problems in Biology. In: Warnow, T. (eds) Bioinformatics and Phylogenetics. Computational Biology, vol 29. Springer, Cham. https://doi.org/10.1007/978-3-030-10837-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-10837-3_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-10836-6

  • Online ISBN: 978-3-030-10837-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics