Skip to main content

Scalable Computing for Evolutionary Genomics

  • Protocol
  • First Online:
Evolutionary Genomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 856))

Abstract

Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving a quick overview of advanced programming techniques. Unfortunately, parallel programming is difficult and requires special software design. The alternative, especially attractive for legacy software, is to introduce poor man’s parallelization by running whole programs in parallel as separate processes, using job schedulers. Such pipelines are often deployed on bioinformatics computer clusters. Recent advances in PC virtualization have made it possible to run a full computer operating system, with all of its installed software, on top of another operating system, inside a “box,” or virtual machine (VM). Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop PCs, and even in the Cloud, to create a “virtual” computer cluster. Many bioinformatics applications in evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available hardware, anytime it is required. BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200 bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics software through one central project, BioNode encourages creating free and open source VM images, for multiple targets, through one central project. BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode in different environments, as well as information for future initiatives, on creating and building such images.

Availability: The 32-bit and 64-bit BioNode desktop images for VirtualBox and the BioNode Cloud images are based on free and open source software and can be found at http://www.evolutionarygenomics.net/ and http://biobeat.org/bionode.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ronquist F & Huelsenbeck J P (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–1574

    Article  PubMed  CAS  Google Scholar 

  2. Eddy S R (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 4:e1000069p

    Article  Google Scholar 

  3. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555–556

    PubMed  CAS  Google Scholar 

  4. Doctorow C (2008) Big data: welcome to the petacentre. Nature 455:16–21.

    Article  PubMed  CAS  Google Scholar 

  5. Durbin R M, Abecasis G R, Altshuler D L et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073

    Article  CAS  Google Scholar 

  6. Kosiol C & Anisimova M (2012) Selection on the protein coding genome. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York

    Google Scholar 

  7. Schadt E E, Linderman M D, Sorenson J, Lee L & Nolan G P (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet. 11:647–657

    Article  PubMed  CAS  Google Scholar 

  8. Trelles O, Prins P, Snir M & Jansen R C (2012) Big data, but are we ready?. Nat Rev Genet. 12:224p. http://www.ncbi.nlm.nih.gov/pubmed/21301471

  9. Patterson D A & Hennessy J L (1998) Computer organization and design (2nd ed.): the hardware/software interface. Morgan Kaufmann Publishers Inc

    Google Scholar 

  10. Mattson T, Sanders B & Massingill B (2004) Patterns for parallel programming. Addison-Wesley Professional, 384 pages. http://portal.acm.org/citation.cfm?id=1406956

  11. Graham R L, Woodall T S & Squyres J M (2005) Open MPI: a flexible high performance MPI

    Google Scholar 

  12. Stamatakis A & Ott M (2008) Exploiting fine-grained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp: a performance study. Pattern Recognition in Bioinformatics, Springer Berlin/Heidelberg, 424–435. http://dx.doi.org/10.1007/978-3-540-88436-1_36

  13. Tierney L, Rossini A & Li N (2009) Snow: a parallel computing framework for the R system. International Journal of Parallel Programming 37:78–90. http://dx.doi.org/10.1007/s10766-008-0077-2

    Google Scholar 

  14. Cesarini F & Thompson S (2009) Erlang programming. 1st. O'Reilly Media, Inc.

    Google Scholar 

  15. Peyton Jones S (2003) The Haskell 98 language and libraries: the revised report. Journal of Functional Programming 13:0--255

    Google Scholar 

  16. Odersky M, Altherr P, Cremet V et al. (2004) An overview of the Scala programming language. LAMP-EPFL

    Google Scholar 

  17. Okasaki C (1998) Purely functional data structures. Cambridge University Press, doi:10.2277/0521663504

  18. Alexandrescu A (2010) The D programming language. 1st. Addison-Wesley Professional, 460p

    Google Scholar 

  19. Griesemer R, Pike R & Thompson K (2009) The Go programming language. http://golang.org

  20. Hoare C A R (1978) Communicating sequential processes. Commun. ACM 21:666--677. doi:http://doi.acm.org/10.1145/359576.359585

    Google Scholar 

  21. Welch P, Aldous J & Foster J (2002) Csp networking for java (jcsp. net). Computational ScienceICCS 2002. 695--708

    Google Scholar 

  22. Sufrin B (2008) Communicating scala objects. Communicating Process Architectures. 35p

    Google Scholar 

  23. Dean J & Ghemawat S (2008) MapReduce: Simplified data processing on large clusters. Communications of the ACM 51:107--113

    Article  Google Scholar 

  24. White T (2009) Hadoop: the definitive guide. first edition. O'Reilly, http://oreilly.com/catalog/9780596521981

  25. May P, Ehrlich H & Steinke T (2006) Zib structure prediction pipeline: composing a complex biological workflow through web services. Euro-Par 2006 Parallel Processing, Springer Berlin/Heidelberg, 1148–1158. http://dx.doi.org/10.1007/11823285_121

  26. Mungall C J, Misra S, Berman B P et al. (2002) An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 3:RESEARCH0081p. http://www.ncbi.nlm.nih.gov/pubmed/12537570

  27. Prins P, Smant G, & Jansen R (2012) Genetical genomics for evolutionary studies. In: Anisimova M (ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media New York

    Google Scholar 

  28. Möller S, Krabbenhoft H N, Tille A et al. (2010) Community-driven computational biology with debian linux. BMC Bioinformatics 11(Suppl 12):S5p. http://www.ncbi.nlm.nih.gov/pubmed/21210984

  29. Li P (2009) Exploring virtual environments in a decentralized lab. ACM SIGITE Newsletter 6:4--10

    Article  Google Scholar 

  30. Tikotekar A, Ong H, Alam S et al. (2009) Performance comparison of two virtual machine scenarios using an hpc application: a case study using molecular dynamics simulations. Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, ACM, 33--40. doi:http://doi.acm.org/10.1145/1519138.1519143

  31. Prins P, Belhachemi D & Möller S (2011) BioNode tutorial. http://biobeat.org/bionode

  32. Altschul S F, Madden T L, Schaffer A A et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.

    Article  PubMed  CAS  Google Scholar 

  33. Edgar R C (2004) Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792–1797. doi:10.1093/nar/gkh340

    Article  PubMed  CAS  Google Scholar 

  34. Schneider A, Souvorov A, Sabath N et al. (2009) Estimates of positive darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol. 1:114–118. doi:10.1093/gbe/evp012

    Article  PubMed  Google Scholar 

  35. Pond S L, Frost S D & Muse S V (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676–679. http://www.ncbi.nlm.nih.gov/pubmed/15509596

    Google Scholar 

  36. Gentzsch W (2002) Sun grid engine: towards creating a compute power grid. Cluster Computing and the Grid, 2001. Proceedings. First IEEE/ACM International Symposium on, IEEE, 35--36

    Google Scholar 

  37. Staples G (2006) Torque resource manager. Proceedings of the 2006 ACM/IEEE conference on Supercomputing, ACM, doi:http://doi.acm.org/10.1145/1188455.1188464

  38. Openstack open source cloud computing software. http://www.openstack.org

  39. Nurmi D, Wolski R, Grzegorczyk C et al. (2009) The Eucalyptus open-source cloud-computing system. Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, IEEE Computer Society, 124--131

    Google Scholar 

  40. Matthews S J & Williams T L (2010) Mrsrf: an efficient mapreduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics 11 Suppl 1:S15p

    Google Scholar 

Download references

Acknowledgments

The European Commission’s Integrated Project BIOEXPLOIT (FOOD-2005-513959 to G.S. and P.P.); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to P.P.).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pjotr Prins .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Prins, P., Belhachemi, D., Möller, S., Smant, G. (2012). Scalable Computing for Evolutionary Genomics. In: Anisimova, M. (eds) Evolutionary Genomics. Methods in Molecular Biology, vol 856. Humana Press. https://doi.org/10.1007/978-1-61779-585-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-585-5_22

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-61779-584-8

  • Online ISBN: 978-1-61779-585-5

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics