Abstract
The continuous advances in DNA sequencing technologies are driving a constantly accelerating accumulation of nucleotide sequence data at the whole-genome scale. As a consequence, evolutionary biology researchers have to rely on a growing number of increasingly complex software. All widely used tools in the field have grown considerably, in terms of the number of features as well as lines of code and consequently also with respect to software complexity. Complexity is further increased by exploiting parallelism on multi-core and hardware accelerator architectures. Moreover, typical analysis pipelines now include a substantially larger number of components than 5–10 years ago. A topic that has received little attention in this context is that of code quality and verification of widely used data analysis software. Unfortunately, the majority of users still tend to blindly trust the software and the results it produces. To this end, we assessed the software quality of three highly cited tools in population genetics (Genepop, Migrate, Structure) that are being routinely used in current data analysis pipelines and studies. We also review widely unknown problems associated with floating-point arithmetics in conjunction with parallel processing. Since the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools but also list techniques that can be deployed for developing reliable, high-quality scientific software from scratch. Finally, we also discuss some general policy issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alachiotis N et al. OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics 2012;28(17):2274–5.
Barone L, Williams J, Micklos D. Unmet needs for analyzing biological big data: a survey of 704 NSF principal investigators. bioRxiv 2017. https://doi.org/10.1101/108555. http://biorxiv.org/content/early/2017/02/15/108555
Beerli P. Comparison of Bayesian and maximum-likelihood inference of population genetic parameters. Bioinformatics 2006;22(3):341–5.
Beerli P, Felsenstein J. Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 1999;152(2):763–73.
Beerli P, Felsenstein J. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc Natl Acad Sci 2001;98(8):4563–8.
Beerli P, Palczewski M. Unified framework to evaluate panmixia and migration direction among multiple sampling locations. Genetics 2010;185(1):313–26.
Briand LC, Wüst J, Ikonomovski SV, Lounis H. Investigating quality factors in object-oriented designs: an industrial case study. In: Proceedings of the 21st international conference on software engineering. New York: ACM; 1999. p. 345–54.
Briand LC, Wüst J, Daly JW, Porter DV. Exploring the relationships between design measures and software quality in object-oriented systems. J Syst Softw 2000;51(3):245–73.
Casalnuovo C, Devanbu P, Oliveira A, Filkov V, Ray B. Assert use in GitHub projects. In: Proceedings of the 37th international conference on software engineering - volume 1, ICSE ’15. Piscataway: IEEE Press; 2015. p. 755–66. http://dl.acm.org/citation.cfm?id=2818754.2818846
Czech L, Huerta-Cepas J, Stamatakis A. A critical review on the use of support values in tree viewers and bioinformatics toolkits. Mol Biol Evol 2017;34(6):1535.
Darriba D, Flouri T, Stamatakis A. The state of software for evolutionary biology. Mol Biol Evol 2018;35(5):1037–46.
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 2003;164(4):1567.
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol Ecol Notes 2007;7(4):574–8.
Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Mol Biol Evol 2010;27(10):2257.
Flouri T, Kobert K, Rognes T, Stamatakis A. Are all global alignment algorithms and implementations correct? bioRxiv (2015). https://doi.org/10.1101/031500. http://biorxiv.org/content/early/2015/11/12/031500
Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982;162(3):705–8. https://doi.org/10.1016/0022-2836(82)90398-9. http://www.sciencedirect.com/science/article/pii/0022283682903989
Hoare CAR. An axiomatic basis for computer programming. Commun ACM 1969;12(10):576–80
Holder MT, Lewis PO, Swofford DL, Larget B. Hastings ratio of the LOCAL proposal used in Bayesian phylogenetics. Syst Biol 2005;54(6):961–5.
Hubisz MJ, Falush D, Stephens M, Pritchard JK. Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009;9(5):1322–32.
Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, Ho SY, Faircloth BC, Nabholz B, Howard JT et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 2014;346(6215):1320–31.
Khoshgoftaar TM, Seliya N. Fault prediction modeling for software quality estimation: comparing commonly used techniques. Empir Softw Eng 2003;8(3):255–83.
McCabe TJ. A complexity measure. IEEE Trans Softw Eng 1976;SE-2(4):308–20.
Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, Frandsen PB, Ware J, Flouri T, Beutel RG, et al. Phylogenomics resolves the timing and pattern of insect evolution. Science 2014;346(6210):763–7.
Nagappan N, Ball T. Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. New York: ACM; 2005. p. 580–6.
Pavlidis P, Jensen JD, Stephan W, Stamatakis A. A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans. Mol Biol Evol 2012;29(10):3237–48.
Pavlidis P, Z˘ivkovic D, Stamatakis A, Alachiotis N. SweeD: likelihood-based detection of selective sweeps in thousands of genomes. Mol Biol Evol 2013;30(9):2224.
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000;155(2):945.
Raymond M, Rousset F. Genepop (version 1.2): population genetics software for exact tests and ecumenicism. J Hered 1995;86(3):248–9.
Redelings B. Erasing errors due to alignment ambiguity when estimating positive selection. Mol Biol Evol 2014;31(8):1979.
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61(3):539–42.https://doi.org/10.1093/sysbio/sys029. http://sysbio.oxfordjournals.org/content/61/3/539.abstract
Rousset F. genepop’007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour 2008;8(1):103–6.
Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 2014;30(9):1312–3.
Acknowledgements
This work was financially supported by the Klaus Tschira Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Stamatakis, A. (2018). Population and Evolutionary Genetic Inferences in the Whole-Genome Era: Software Challenges. In: Rajora, O. (eds) Population Genomics. Population Genomics. Springer, Cham. https://doi.org/10.1007/13836_2018_42
Download citation
DOI: https://doi.org/10.1007/13836_2018_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04587-6
Online ISBN: 978-3-030-04589-0
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)