Population and Evolutionary Genetic Inferences in the Whole-Genome Era: Software Challenges
The continuous advances in DNA sequencing technologies are driving a constantly accelerating accumulation of nucleotide sequence data at the whole-genome scale. As a consequence, evolutionary biology researchers have to rely on a growing number of increasingly complex software. All widely used tools in the field have grown considerably, in terms of the number of features as well as lines of code and consequently also with respect to software complexity. Complexity is further increased by exploiting parallelism on multi-core and hardware accelerator architectures. Moreover, typical analysis pipelines now include a substantially larger number of components than 5–10 years ago. A topic that has received little attention in this context is that of code quality and verification of widely used data analysis software. Unfortunately, the majority of users still tend to blindly trust the software and the results it produces. To this end, we assessed the software quality of three highly cited tools in population genetics (Genepop, Migrate, Structure) that are being routinely used in current data analysis pipelines and studies. We also review widely unknown problems associated with floating-point arithmetics in conjunction with parallel processing. Since the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools but also list techniques that can be deployed for developing reliable, high-quality scientific software from scratch. Finally, we also discuss some general policy issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software.
KeywordsNumerical stability Parallel computing Reproducibility Software quality Software verification
This work was financially supported by the Klaus Tschira Foundation.
- Barone L, Williams J, Micklos D. Unmet needs for analyzing biological big data: a survey of 704 NSF principal investigators. bioRxiv 2017. https://doi.org/10.1101/108555. http://biorxiv.org/content/early/2017/02/15/108555
- Briand LC, Wüst J, Ikonomovski SV, Lounis H. Investigating quality factors in object-oriented designs: an industrial case study. In: Proceedings of the 21st international conference on software engineering. New York: ACM; 1999. p. 345–54.Google Scholar
- Flouri T, Kobert K, Rognes T, Stamatakis A. Are all global alignment algorithms and implementations correct? bioRxiv (2015). https://doi.org/10.1101/031500. http://biorxiv.org/content/early/2015/11/12/031500
- Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982;162(3):705–8. https://doi.org/10.1016/0022-2836(82)90398-9. http://www.sciencedirect.com/science/article/pii/0022283682903989 PubMedCrossRefGoogle Scholar
- Nagappan N, Ball T. Static analysis tools as early indicators of pre-release defect density. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. New York: ACM; 2005. p. 580–6.Google Scholar
- Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61(3):539–42. https://doi.org/10.1093/sysbio/sys029. http://sysbio.oxfordjournals.org/content/61/3/539.abstract PubMedPubMedCentralCrossRefGoogle Scholar