Keywords

1 Introduction

Computational methodology has become a significant component in pharmaceutical industry for drug design and discovery [1,2,3,4]. Typically, molecular docking and molecular structure similarity are two frequently used computational approaches. High-Throughput Virtual Screening (HTVS) is known to computationally screen large compound libraries. These libraries contain a huge number of small-molecules varying from tens of thousands to millions, require a high volume of lots-of-small files scenario for a virtual screening campaign. With the development of high-performance computers, the virtual drug screening is accelerating. However, HTVS still faces challenges while a large scale virtual screening application is executed on High-Performance Computing (HPC) resources, such as distributing massive tasks, analyzing lots-of-small molecular structure files, and implementing fault tolerance.

Tools have been developed to accelerate the process of HTVS on HPC resources. Falkon [5] is a lightweight execution framework to enable loosely coupled program to run on peta-scale systems. The benchmark [6] shows that DOCK5 can scale up to 116,000 cores with high efficiency by Falkon. VinaMPI is a MPI version program based on ADV package, which uses a large number of cores to speed-up individual docking tasks. VinaMPI successfully ran on 84,672 cores on Kraken supercomputer and efficiently reduce the total time-to-completion. While all the above works focus on performance and efficiency of distributing tasks, ignoring the whole HTVS process, for instance, robustness, recoverability and result analysis. FireWorks (FWS) [7] is a workflow software for high-throughput calculation running on supercomputer, effectively solve the problem of concurrent task distribution and fault tolerance management, and provide an intuitive graphical interface. However, FWS pays more attention on versatility and usability. DVSDMS [8] is a distributed virtual screening data management system, only focusing on high-throughput docking process in the data management issues. Therefore, the architecture of high-performance computers, as well as the computational characteristics of the application, needs to be considered to design the framework for HTVS on high-performance computers.

In this work, we report a general framework - High-performance High-throughput Virtual Screening Framework (HHVSF) - to enable large-scale, multitasking and small-size input and output (IO) applications to efficiently execute on HPC resources. This framework contains task management and data management systems, which can handle thousands of tasks, manage a large volume of lots-of-small files, and reduce the long processing time for analyzing. The purpose of HHVSF is to provide high computational performance based on portability, availability, serviceability and stability (PASS).

2 Experimental and Computational Details

The framework of HHVSF is comprised of two parts: task management and distributed data management (see Fig. 1). In order to access and store data efficiently and flexibly, the executions of a program are coupled loosely by MongoDB C driver, while the application codes do not need to be modified. The following three subsections document the overall framework of HHVSF, the simulation parameters and the data sets of the experiments are introduced at the end of this section. ADV [9] and WEGA [10] are chosen as typical applications to carry out the experiments, and others can be integrated into the HHVSF in similar way.

Fig. 1.
figure 1

The hardware and relevant operations in HHVSF.

2.1 Task Management

The followings are mainly considered in the task management system: two-level task scheduling, preemptive scheduling algorithm for worker and failed tasks recovery.

2.1.1 Task Scheduling

HTVS employs massive computing resources to support a large number of independent computing tasks. Because most molecular docking and molecular structure similarity tools, for instance, ADV, Gold [11], Glide [12], FlexX [13] and WEGA, are serial or threaded codes, these computing tasks are typical fine-grained Many-Task Computing (MTC) [6]. Such MTC tasks cannot take full advantage of the static scheduling solution with coarse scheduling granularity, while most traditional large-scale HPC resources are configured with coarse scheduling granularity under a control of batch queuing system such as Simple Linux Utility for Resource Management (SLURM) [14], Portable Batch System (PBS)/Torque [15], and Sun Grid Engine (SGE) [16].

Multi-level scheduling method can effectively solve the different application requirements for scheduling granularity, while maintaining the unified management of computing resources. The first level scheduler applies for a number of resources to the second level for task distribution. The second level scheduler can refine the computing resources and then distributes the tasks. HTCondor [17] is chosen to be the second level scheduler to dispatch tasks. HTCondor is full-featured batch workload management system for coordinating a large number of independent serial or parallel jobs in High-Throughput Computing (HTC) environment. We configure the HTCondor with one core per slot to provide more flexible task scheduling.

2.1.2 Preemptive Scheduling Algorithm

Molecular docking and molecular structure similarity are typical MTC applications, while maintaining millions of tasks by HTCondor to screen a large database with millions of ligands or conformers is still a touch work. Thus, we transform MTC into HTC by wrapping ADV or WEGA program with MongoDB C driver (version 1.4.2) as a worker. Each worker accesses database preemptively to get input files until all data is traversed. MongoDB provides atomic operation with “inc” to ensure data security when multitudinous workers start concurrently, so that each worker can get unique job. After the worker obtains data from the database, the data is written to a file and stored on the local file system implemented in RAMDISK. The kernel function’s computational procedure is shown in the Fig. 2.

Fig. 2.
figure 2

The pseudo code of vina_wrapper and wega_wrapper.

2.1.3 Fault Tolerance

Fault tolerance in this context can be simplified as the ability to automatically restart a task when the original run fails. When the HTVS scales to millions of tasks or run a long-time task, it is easy to get failures for bad input parameters, such as computing node fault, IO blocking, and network latency. There are two ways to consider fault tolerance in this case, one is monitoring the job status during the running by job management system, another is making a successful or failed tag on each task after the task is finished. HTCondor provides checkpoint mechanism in the standard universe by using condor_compile to relink the execution with the HTCondor libraries, while those coupled program vina_wrapper and wega_wrapper, containing system calls like system(), cannot provide check pointing services with HTCondor. As a result, we choose the second method. When a worker calls the execution of ADV or WEGA successfully, a tag that represents the task status will insert into the corresponding document in MongoDB database. After the job is finished, it needs to check the document’s failed tag and then restart the failed jobs.

2.2 Data Management

Data storage, data relocation and data analysis are the bottlenecks when a virtual screening is scaled up to handle millions of tasks on thousands of cores. Such scattered lots-of-small files can overburden the shared file system with abundant IO operations if the plain files are accessed and stored directly. Database offers an attractive solution to both the storage and the data querying. In our framework, we avoid using shared file system by replacing it with the combination of MongoDB and local RAMDISK. The IO files are stored in MongoDB, while they are cached in the local RAMDISK during computation. The following three subsections describe the details on the data storage, data relocation and data analysis in HHVSF.

2.2.1 NoSQL Database for Storage

Chemical databases are the critical components of HTVS, which provide the basic information to build knowledge-based models for discovering and designing drug. Such as, PubChem [18], ZINC [19], ChEMBL [20], ChemDB [21], contain millions of compounds, provide shape data, physical properties, biological activities, and other information for pharmaceutical evaluations. Currently, many molecular docking programs and molecular structure similarity algorithms read the input and store the output in plain text files, which is not suitable for management when data grow up rapidly. Maintaining and analyzing such data are difficult.

MongoDB [22] is used as the data repository engine, which is a high performance, high availability, automatic scaling, open source NoSQL (Not Only SQL) database. This architecture is suitable for sparse and document-like data storage. By using MongoDB “index”, molecules can be queried and ranked easily. In addition, MongoDB uses “sharding” (a method for distributing data across multiple machines) to support deployments with large data sets in high throughput manner, enhancing the computational performance by balancing query loading as the database growing. Finally, MongDB accepts big data up to 16 MB. MongDB is employed for WEGA to access the big conformation SDF input file.

2.2.2 Data Relocation

ADV and WEGA involve in processing large sized plain text files. Without modifying their source codes, the programs have to process huge number of small molecular structure files by moving on the shared file disks when screening a large-scale compound library. Hence, the RAMDISK in computing nodes are used to temporarily store the IO files needed by the applications (see Fig. 3). The RAMDISK provides high-speed, low-latency IO operations for handling lots-of-small files, while the high storage capacity of shared file disk is still fully occupied to store the resulting data. By relocating data between MongDB and RAMDISK, the IO pressure for shared file storage is effectively mitigated.

Fig. 3.
figure 3

The flowchart of the data relocation.

2.2.3 Data Analysis

For virtual screening using molecular docking and molecular structure similarity approaches, scores and molecular similarities have to be calculated before ranking the molecules in a large sized compound library. In consideration of high-performance computing systems with shared file storage, it is necessary to avoid IO overloading problems which are caused by a great number of small files. Thus, it is not wise to analyze the output files on the shared storage disk. When the computations are accomplished in the RAMDISK, the output files are analyzed and the compounds in the library are ranked based upon scores or similarities. This protocol minimizes the IO stress when the number of small files increases dramatically.

2.3 Simulation Parameters and Data Sets

2.3.1 ADV

About twenty million ligands with mol2 format file were obtained from ZINC database FTP server (http://zinc.docking.org/db/bysubset/6/). The pymongo (version 3.2.1) Python library was used for database operations. A python script was developed to insert mol2 files into MongoDB. MGLTools (version 1.5.6) was used to convert mol2 files into pdbqt file for docking. We prepared five different sized data sets (from zinc_ligand_1~5), as shown in Table 2. All data sets are sorted by heavy atom number arranged in ascending order. After finishing molecular docking, the result pdbqt file format was converted to mol file format by Open Babel package [23] (version 2.4.0).

The protein target is a crystal structure of the alpha subunit of glycyl tRNA synthetase (PDB codes: 5F5 W). The (x, y, z) coordinates (in Å) for the center of the docking site is (−94.666, 51.401, 8.991), and the side of the cubic box is (14, 18, 12). The argument of num_modes is set to 1.

2.3.2 WEGA

A SDF file containing about twenty million molecules was obtained from ZINC database FTP server. Approximately 958 million conformers were generated from the SDF file using the CAESAR algorithm [24] in discovery studio (version 3.5) [25] for shape-feature similarity calculation. In order to take advantage of the 16 MB storage space in MongoDB, the conformer files were split into smaller files which occupied 15 MB for each file, and then inserted into the database. Table 2 gives two data sets for WEGA (zinc_conformer_1 and zinc_conformer_2).

The query molecule is 4-amino-1-[4,5-dihydroxy-3-(2-hydroxyethyl)-1-cyclopent-2-enyl]-pyrimidin-2-one (ZINC ID: ZINC03834084). The method for molecular overlay is set to 2 (combing the shape similarity and pharmacophore similarity). Each SDF file corresponds up to 100 similar molecules. The Table 1 shows the detailed information of the data sets which are used throughout the article.

Table 1. Data sets for testing. The zinc_ligand_1~5 databases are prepared for Audock_vina, the zinc_ligand_2~5 databases were extracted from zinc_ligand_1 in accordance with a certain proportion. The zinc_conformer_1~2 databases are prepared for WEGA, and the zinc_conformer_2 are extracted from zinc_conformer_1 randomly.

All tests run on Tianhe-2 (MilkyWay-2) supercomputer, which consists of 16,000 computing nodes connected via the TH Express-2 interconnect. Each computing node is equipped with two Intel Xeon E5-2692 CPUs (12-core, 2.2 GHz), and configured with 64 GB memory. The storage subsystem contains 64 storage servers with a total capacity of 12.4 PB. The LUSTRE storage architecture is used as a site-wide global file system.

3 Results and Discussion

3.1 Load Balance

Time for screening a compound ranges from minutes to hours depending on the complexity of the molecular structure. Reports [9, 26] indicate that the compound complexity, for instance, number of active torsions or number of heavy atoms, dominates the computing time of molecular docking. In order to determine the relation between the number of heavy atoms computing time and the computing complexity, the zinc_lignad_5 data set was chosen to record the number of heavy atom in a ligand and computing time, as depicted in Fig. 4. The number of heavy atoms presents a linear relationship with time (logarithmic form), which indicates more heavy atoms in a small molecule requires longer computing time. For zinc_ligand_2~5 data sets are scaled down from zinc_ligand_1 by a certain percentage, the other data sets will also benefit from this approach. Based on this information, the zinc_ligand_4 data set was tested on 8,000 cores. Figure 5a and b demonstrate that the average computing time per worker is reduced by 8.83 s when load balancing protocol was used.

Fig. 4.
figure 4

The number of heavy atoms in a compound (x-axis), the computing time (logarithmic form) of a molecular docking (y-axis). The results are based upon zinc_ligand_5 data set.

Fig. 5.
figure 5

(a) The computing time of each worker without load balancing. The red line is the average computing time per task. (b) The computing time per worker with load balance. The red line is the average computing time per task. (Color figure online)

3.2 Throughput with Data

The MongoDB monitors the status of a running instance. When a worker starts, a “connection” operation is activated by MongoDB’s server. After the computing task is accomplished, the resulting data (score, structural conformation, and running status) will be inserted into MongoDB’s collection. Figure 6 shows the “connection” and “insert” operations of MongoDB’s server every second with vina_wrapper during the whole computing period, the points of inverted triangle clearly reveal the three stages of running tasks: startup, steady-running and finish. The total time during the startup was 1,396 s to start 16,000 workers, averaging 11 tasks per second. The points of rhombus become higher gradually as time progresses, reaching up to 1,324 molecules per second and averaging 145 molecules per second. Table 2 gives the results for other data sets. As for WEGA, Fig. 7 shows that the data throughput can reach up to 9,448 molecules per second, averaging 6,430 molecules per second, indicating a high performance and a high data throughput.

Fig. 6.
figure 6

The number of “insert” operation and “connection” operation in MongoDB’s server when running ADV application. The zinc_ligand_3 data set was used to run on 16,000 cores.

Table 2. Data throughput for ADV and WEGA on different data sets.
Fig. 7.
figure 7

The number of “insert” operation and “connection” operation in MongoDB’s server when running WEGA application. The zinc_conformer_1 data set was used to run on 4,000 cores.

3.3 Scalability

To test scalability, we perform the experiments of speedup and parallel efficiency with zinc_ligand_4 data set and zinc_ligand_3 data set. Figure 8a shows zinc_ligand_4 data set can be scaled to 8,000 cores with parallel efficiency of 0.84, and the zinc_ligand_4 data set can be scaled to 16,000 cores with parallel efficiency of 0.83 (see Fig. 8b). It is shown that the parallel efficiency decreases sharply when computing resource is scaled up to more than 8,000 cores. This is because more cores represent more workers, and thus, more time will be cost by HTCondor to start those workers.

Fig. 8.
figure 8

(a) Speedup (right triangle) and parallel efficiency (read dot) of molecular docking experiment on zinc_ligand_4 data set. (b) Speedup (block dot) and parallel efficiency (upper triangle) of molecular docking experiment on zinc_ligand_3 data set.

3.4 Fault Tolerance

Fault tolerance management can avoid task failures due to external environments, for instances, compute node fault, network blocking, IO latency, etc. Table 3 gives the information of failed tasks on different data sets. The zinc_ligand_2 data set has a high failure rate than others due to ten million ligands containing more ligands with high molecular weight which are not suitable for docking space of the protein (PDB code: 5W5F). In addition, longer calculations can lead to higher failures. The zinc_conformer_1 data set has fewer files (95712 SDF files in total) and less computing time, as a result, produces a low failure rate.

Table 3. The failure rate and computing time for ADV and WEGA on different data sets.

4 Conclusions

HHVSF includes task management and relocating data management, and supports the high-throughput applications of large-scale, multitasking and small sized IO files running on HPC resources. There are two types of virtual drug screening applications: (1) computation-intensive applications (such as molecular docking), and (2) data-intensive applications (such as molecular structure similarity based virtual screening campaigns). With HHVSF, two types of applications can run on Tianhe-2 supercomputer with high performance. Testing results show that when use ADV for molecular docking, the protein target (PDB code: 5W5F) was used to screen nearly half of compounds from the ZINC database within one day on 16,000 cores. For WEGA, 958 million conformations were screened by using about a half hour on 4,000 cores. The ranked ligands or conformers can be accessed in milliseconds by specifying the “sort” method from the database. Meanwhile, the IO pressure of shared file storage affected by lots-of-small files in HPC resources can be mitigated. Thus, HHVSF can significantly accelerate HTVS campaigns on HPC resources.