A distributed framework for aligning short reads to genomes
KeywordsFeasible Solution Combinatorial Library Main Computation Independent Task Multiple Machine
Computational methods that employ next-generation sequencing technologies often depend on the alignment of short reads  to genomes. In a typical workflow, such methods might require millions of independent alignment operations. Although using a high-performance cluster (HPC) to distribute these computational independent tasks can speed up the process significantly, a HPC can be expensive, wasteful and sometime not a feasible solution. We propose a distributed framework that aims specifically at distributing the task of aligning short reads to genomes to multiple machines efficiently and effectively. This framework aims to be simple to set up and grow.
Materials and methods
To accomplish this, we introduce the framework using the Go programming language, which has primitive support for concurrent computation, and utilizes a high performance network library called ZeroMQ [2, 3] for effective distribution of queries. Specifically, we use the Pipeline pattern from ZeroMQ. This pattern includes three main parts: (1) ventilator (which distributes reads to workers), (2) worker (which does the main computation and sends results to a sink) and (3) sink (which collects results from workers). There are three stages in our design. In the listening stage, the system sets up. The ventilator sends the REQ message including other important information to workers. The workers load the index into the RAM. In the query stage, the ventilator distributes the reads to the workers. The workers work on aligning the reads to the index loaded in the listening stage. In the last stage, the system closes. The ventilator sends an END message to the workers after it distributes all the reads so that workers can close sockets after processing all reads.
Simulation showed that the running time of alignment decreased linearly with the number of the workers. This system is easy to use and deploy.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.