GCAC: galaxy workflow system for predictive model building for virtual screening
Traditional drug discovery approaches are time-consuming, tedious and expensive. Identifying a potential drug-like molecule using high throughput screening (HTS) with high confidence is always a challenging task in drug discovery and cheminformatics. A small percentage of molecules that pass the clinical trial phases receives FDA approval. This whole process takes 10–12 years and millions of dollar of investment. The inconsistency in HTS is also a challenge for reproducible results. Reproducible research in computational research is highly desirable as a measure to evaluate scientific claims and published findings. This paper describes the development and availability of a knowledge based predictive model building system using the R Statistical Computing Environment and its ensured reproducibility using Galaxy workflow system.
We describe a web-enabled data mining analysis pipeline which employs reproducible research approaches to confront the issue of availability of tools in high throughput virtual screening. The pipeline, named as “Galaxy for Compound Activity Classification (GCAC)” includes descriptor calculation, feature selection, model building, and screening to extract potent candidates, by leveraging the combined capabilities of R statistical packages and literate programming tools contained within a workflow system environment with automated configuration.
GCAC can serve as a standard for screening drug candidates using predictive model building under galaxy environment, allowing for easy installation and reproducibility. A demo site of the tool is available at http://ccbb.jnu.ac.in/gcac
KeywordsPredictive model building Reproducible results Galaxy workflow system High throughput screening Drug discovery R statistical package Cheminformatics
Galaxy for compound activity Classification
Generalised linear model
Hepatitis c virus
High throughput screening
K- Nearest- Neighbour
Ligand based virtual screening
Leave one out cross validation
Quantitative structure activity relationship
Recursive feature elimination
Receiver operating characteristic
Structure based virtual screening
Support vector machine
Over the past few decades, the time and cost of drug development have increased. Today, it typically takes about 10–15 years and costs up to $1300 - $1500 million to convert a promising new compound into a drug in the market, which reflects the complexity of the drug discovery process . One challenge for the scientific community is to bring down cost and time for drug development. The computational studies of biological and chemical molecules for drug-like properties falls under a separate branch of science called Cheminformatics. It includes high-throughput screening of chemical molecules, which is useful to screen large chemical library using knowledge-based rules to narrow down chemical space for identifying promising drug-like molecules with certain physico-chemical properties. In Cheminformatics, two major computational screening approaches are available in an early phase of drug discovery. First, the Structure-based Virtual Screening (VS) and second, Ligand-based VS . The structure based VS includes high-throughput docking of candidate molecules to target receptors to rank them based on their predicted binding affinity . This approach is relatively fast compared to conventional methods such as whole cell bioassay and in-vivo screening of individual candidates. However, it is not as accurate due to a multilevel preparation of ligands and insufficient information about the local and global environment for efficient binding prediction besides being time consuming when the compound library is large . Studies reveal that ligand-based VS methods have the higher potency of hits identified than the structure-based VS [5, 6]. The Ligand-based VS includes 2D, 3D similarity search, pharmacophore mapping and Quantitative Structure Activity Relationship (QSAR) modelling. The 2D similarity based methods outperform 3D similarity search methods. However, the accuracy of search results heavily relies on a number of available positive cases because the fundamental idea of ligand-based VS is to correlate structure similarity to functional similarity. In the present study, we focus on Ligand-based VS method, especially on QSAR based modelling, and describe the development of an installable platform containing all the steps required for predictive model building and screening using a web-interface deployed using the Galaxy Workflow system.
Predictive model building in drug discovery process
Ligand-based VS is an example of empirical research where prediction is made for the new case, based on the observed pattern in data under study. The empirical vHTS include predictive model building in which different Machine Learning (ML) methods are combined with data mining to extract hidden patterns and important physical factors for drug-likeness. Predictive model building is a widely used term in the field of economics and has been used in cheminformatics for vHTS of drug-like molecules for various diseases [7, 8, 9, 10]. There are several standalone packages and executables available for many ML methods to perform data mining and predictive model building such as Weka , RapidMiner  and KNIME  but their applications in bioinformatics and cheminformatics are not comprehensive, leaving the scope for alternatives. None of the above mentioned tools provides a completely reproducible platform for descriptor calculation, model building, prediction tasks as well as user-friendly appearance.
Galaxy workflow system
List of Galaxy Tools developed as part of GCAC: The GCAC suite comprises mainly four major tasks. Each task contains one repository and at least one tool associated with it. The GCAC tools are available in galaxy main toolshed (https://toolshed.g2.bx.psu.edu/repository?repository_id=351af44ceb587e54)
calculates descriptors for active and inactive datasets.
Merge Activity Files
assigns response values and merges positive and negative datasets.
removes redundant entries of molecules.
selects best features subset
Model Building and Prediction
Prepare input file
converts csv_files to RData format
R-Caret Classification Model-Builder
builds classification model using ‘caret’ R package
predicts activity of molecules using their descriptor file (prediction set) and supplied model.
Candidate Compound Extraction
Candidate Compound Selector
selects compound name or ids of interesting molecules based on certain cutoff range.
extracts compound IDs to be used in downstream compound extraction from sdf files
provides sdf file of extracted compounds from the prediction set
We developed a wrapper for the PaDEL-descriptor package for descriptor calculation and R-caret  package for predictive model building. The power of caret package is embedded within sweave template scripts for generation of dynamic reports. Sweave allows statistical analysis and documentation simultaneously which is reproducible over any similar or identical computing environment. For extraction of compounds from rest of molecules which qualifies in prediction set, we developed a MayaChemTools  based wrapper. To choose the optimal set of features we developed the caret based feature selection tool. There are many additional intermediate helper tools designed to connect these major tools. The GCAC tools are hosted on galaxy main tool shed (https://toolshed.g2.bx.psu.edu/repository?repository_id=351af44ceb587e54) as well as a virtual machine (http://ccbb.jnu.ac.in/gcac). Providing an open source resource for QSAR predictive analysis will undoubtedly improve accessibility, transparency and reproducibility in in-silico drug discovery process.
Results and discussion
GCAC tool repositories
The GCAC tools are organized into three main directories within one Git Repository: descriptor_calculation, model_building_and_prediction and candidate_compound_extraction. Each of them comprises of one or more subdirectories containing a tool for the particular job. The underlying idea of creating directories is a separation of primary tasks and associated tools - namely 1) descriptor calculation 2) feature selection and model building 3) screening to extract potent candidates.
In recent years, descriptor based model building are encouraged for faster virtual screening. Many commercial and open source descriptor calculation software like CDK, JOELib, Dragon, and PowerMV [29, 30, 31, 32], etc., are available for the user community. PaDEL is open source software for descriptor calculation . It calculates 1785 various 1D, 2D, and 3D descriptors. Additionally, it also calculates 12 types of chemical fingerprints. The input file can be a smile, mol or sdf and output is CSV file of calculated descriptors. We developed Galaxy wrapper for PaDEL-descriptor consisting it’s all essential parameters. There are two helper tools also designed to concatenate files after adding class information, and for eliminating repeated entries which ultimately returns a merged descriptor file having labels (i.e., Class information).
The objectives of feature (also known as a predictors, attributes, variables, etc.) selection can be summed into three points. First, for improving the prediction performance of the predictors. Second, providing faster and cost-effective predictors for quick learning, and thirdly, providinga better understanding of the underlying rules that generated the data . Featureselection techniques can be summarized into three categories, depending on their integration with the model building process: filter methods, wrapper methods, and embedded methods . Filter methods are computationally fast and easy to implement, but most filter techniques are univariate which fails to identify any interactions among features. Embedded methods are more computationally efficient than wrapper methods but rely on a specified learning algorithm. Wrapper methods outperform filter methods as they search for the optimal set of features, and are sufficient to classify data at the expense of computational cost. Moreover, wrapper methods have the benefits of identifying dependencies among features and the relationships between the feature subset and model selection. As filter methods are insufficient for the optimal feature set and caret has many classifiers with built-in feature selection, we have developed a feature selection tool, using the Recursive Feature Elimination (RFE), a wrapper method for feature selection provided within caret package. After conversion of csv file into RData one can employ feature selection tool for identifying optimal feature subset for model building step. Currently, a user can choose any of four functions (random forest, linear, treebag, and naive bayes) for model fitting and several options for cross-validation measures.
Model building is an important aspect of GCAC pipeline. We focused on ensuring its reproducibility and added a feature of automated dynamic report creation that has not been available in most of the predictive analysis pipeline. The report thus generated is vital in context containing information about the computing environment, data properties, statistical measures and their significance. The merged descriptor file obtained after “Descriptor Calculation” step is converted into required input data format (i.e.,RData) and then may optionally be subjected to the feature selection tool or can be used solely for building a model using the tool, “Build-Classification-Model”. At the backend, it uses an R Sweave template which creates a runtime script having information of applied classification method and various other parameters set by the user. It produces three outputs: Model, Document, and Datasets used (i.e., train and test set).
For classification purpose, GCAC provides 15 machine learning (ML) methods for model building including Generalised Linear Model (GLM), Random Forest(RF), Naive Bayes(NB), K- Nearest- Neighbours (KNN), Support Vector Machine (SVM), C5.0, Adaptive Boosting (AdaBoost), etc. Additional file 1: Table S3 contains a list of methods tested, along with tunable parameters. If an imbalanced dataset is used for modelling, GCAC provides sampling methods like “downsample” and “upsample” to ameliorate class information. GCAC also provides options for selecting resample methods such as CV, LOOCV, repeated CV, Bootstrap 632 and boot for cross-validation study. A model is evaluated over many performance metrics like accuracy, sensitivity, specificity, kappa statistics and ROC curve analysis. The pdf document generated consists of preliminary information and properties of the data under study, the applied pre-processing steps, performance measures, graph(s), table(s), and confusion matrix. A well-formatted PDF generation is one of the major features of GCAC pipeline. Additionally, the user has access to train and test datasets, which are used for model building. The model generated can be utilized to predict the activity of unlabelled compound library and may also be employed for making ensembles of various models to improve the predictive power of data mining applications. The prediction result consists of identifier or molecule names along with probability score of being a positive or negative case. A high value indicates a higher chance of belonging to the particular class. Predicted molecules can be extracted from a prediction set at the later stage.
Extract potential Lead like compounds
Once prediction result obtained, it is essential to extract potential molecules from prediction set for further analysis. We developed Galaxy wrapper tools for three important tasks: selecting interesting molecules using probability score cut-off, input preparation and extraction of molecules into a distinct file. The required format for the prediction set is structure data file (sdf). Based on prediction score, a user may choose interesting molecules which are extracted from prediction set and written into different sdf file using the “MayaChemTools” based Galaxy tool.
The cost and time are the greatest bottlenecks in drug discovery process. It’s essential that drug discovery stages remain as replicable, transparent, reviewable and accessible as much as possible. The GCAC platform in Galaxy helps to facilitate all of these goals. In the present study, the PaDEL-descriptor tools can be used to calculate molecular descriptors using publicly available chemical datasets (PubChem, zinc, ChEMBL etc.). The most influencing feature subset can be obtained by using the RFE based feature selection tool. The model building module provides many commonly used state-of-the-art classifiers available in caret package. The workflow uses R-caret - where parameters specific to a model-building method are optimised within the model building process. Though the default model used is PLS, the user may choose from a large range of model-building methods, which is dependent of available computational time and expected accuracy. From our preliminary results on the use of the protocol, different models may perform better with different data sets. To address the problem of large class imbalance in datasets, we implemented downsampling and upsampling methods to optimise ratio of positive and negative cases. Each model can be evaluated using widely accepted performance measures like AuROC, Accuracy, Sensitivity, Specificity and Kappa statistics. The best model selected can be used to predict the activity of unknown compound library and predicted active or positive cases can be extracted using maya tool which may further be subjected to computational analysis.
If the scientific community succeeds to lower the cost and time required for initial drug discovery processes without losing confidence about the reproducibility of results, millions of dollars and many lives will be saved. By applying QSAR based virtual screening, we can reduce the time taken for virtual screening. In silico ADMET test can also be subjected to automation and parallelization using Galaxy workflow system which again will result in lowering time. One of the limiting factors for QSAR based model building is the availability of data for training for “global” model. This problem can be addressed by making “local” models exclusively for given target or chemical-series-specific data.
Future development of GCAC will comprise of three major additions: A wide range of ML methods for classification, open source ADMET tools development and provisioning of target specific models via shared data. As improved and efficient open source packages will be published for descriptor calculation, ADMET prediction and model building, We integrate them accordance to their utility. As more users participate in GCAC user community, sharing of data, tools, and models will eventually bring more attention of the scientific community. The Galaxy workflow system is well adapted for cloud-based resources and which make Galaxy a more reasonable choice for developing pipelines for drug discovery as well as other biological sciences.
Availability and requirements
Project name: GCAC.
Project home page: https://github.com/LynnLab-JNU/gcac-galaxy-tools
Demo Server: http://ccbb.jnu.ac.in/gcac
Operating system(s): Linux - Developed, tested and distributed as VM with CentOS 7.
Programming language: R, Python, Shell, Bash.
Other requirements: None.
License: MIT License.
Any restrictions to use by non-academics: None.
A list of required dependencies, more information and download links can be found in the documentation available on the demo site at http://ccbb.jnu.ac.in/gcac.
Provided via VirtualBox VM: - This is the easiest means to get the GCAC module in a standalone VM environment. Users are required to download and import the VM to their VirtualBox environment.
Provided via Toolshed: - The GCAC module galaxy tools are made available via publicly available Toolshed repository which can be installed via admin interface on running Galaxy server. Users are also required to install system-level dependencies on the Galaxy host machine.
We would like to thank Dr. Max Kuhn, the author of the R/caret package, for providing the initial template for model building. We would like to acknowledge Mr. M.M. Harris for his technical support. AJH and DRB gratefully acknowledge the CSIR and UGC for supporting their research fellowships respectively.
Publication costs were funded by DST-PURSE grant awarded to Jawaharlal Nehru University.
Availability of data and materials
The GCAC is available as toolshed repository (https://toolshed.g2.bx.psu.edu/repository?repository_id=351af44ceb587e54) and as demo server (http://ccbb.jnu.ac.in/gcac), which contains a list of required dependencies, and documentation with more information and download links. The data set used for validating the workflow is also available as shared data within galaxy. All data used here are publically available.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 13, 2018: 17th International Conference on Bioinformatics (InCoB 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-13.
AML conceived the general project and supervised the project. AJH and DRB planned and developed the GCAC. DRB built the source code package of GCAC and AJH prepared the VM. AJH and DRB jointly developed the galaxy wrappers and wrote the draft manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: New estimates of R&D costs. J Health Econ. 2016. https://doi.org/10.1016/j.jhealeco.2016.01.012.
- 2.Muegge I, Oloff S. Advances in virtual screening. Drug Discov Today Technol. 2006. https://doi.org/10.1016/j.ddtec.2006.12.002.
- 3.Waszkowycz B. Towards improving compound selection in structure-based virtual screening. Drug Discov Today. 2008. https://doi.org/10.1016/j.drudis.2007.12.002.
- 4.Scior T, Bender A, Tresadern G, Medina-Franco JL, Martínez-Mayorga K, Langer T, Agrafiotis DK. Recognizing pitfalls in virtual screening: A critical review. J Chem Inf Model. 2012. https://doi.org/10.1021/ci200528d.
- 5.Cumming JG, Davis AM, Muresan S, Haeberlein M, Chen H. Chemical predictive modelling to improve compound quality. Nat Rev Drug Discov. 2013. https://doi.org/10.1038/nrd4128.
- 6.Ripphausen P, Nisius B, Peltason L, Bajorath J. Quo vadis, virtual screening? A comprehensive survey of prospective applications. J Med Chem. 2010. https://doi.org/10.1021/jm101020z.
- 7.Sundaramurthi JC, Brindha S, Reddy TBK, Hanna LE. Informatics resources for tuberculosis - Towards drug discovery. Tuberculosis. 2012. https://doi.org/10.1016/j.tube.2011.08.006.
- 8.Ekins S, Freundlich JS. Computational models for tuberculosis drug discovery. Methods Mol Biol. 2013. https://doi.org/10.1007/978-1-62703-342-8_16.
- 9.Ekins S, Reynolds RC, Kim H, Koo MS, Ekonomidis M, Talaue M, Freundlich JS. Bayesian models leveraging bioactivity and cytotoxicity information for drug discovery. Chem Biol. 2013. https://doi.org/10.1016/j.chembiol.2013.01.011.
- 10.Jamal S, Periwal V, Scaria V. Predictive modeling of anti-malarial molecules inhibiting apicoplast formation. BMC Bioinf. 2013:2013. https://doi.org/10.1186/1471-2105-14-55.
- 11.Holmes G, Donkin A, Witten IH (1994). Weka: A machine learning workbench. In Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference (pp. 357–361). https://doi.org/10.1109/ANZIIS.1994.396988.
- 12.Hofmann M, Klinkenberg R. RapidMiner: Data Mining Use Cases and Business Analytics Applications; 2013. https://isbnsearch.org/isbn/9781482205497.
- 13.Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Wiswedel B. KNIME - The Konstanz Information Miner. SIGKDD Explorations. 2009. https://doi.org/10.1145/1656274.1656280.
- 14.Reynolds CR, Amini AC, Muggleton SH, Sternberg MJE. Assessment of a rule-based virtual screening technology (INDDEx) on a benchmark data set. J Phys Chem B. 2012. https://doi.org/10.1021/jp212084f.
- 15.Coma I, Clark L, Diez E, Harper G, Herranz J, Hofmann G, Macarron R. Process validation and screen reproducibility in high-throughput screening. J Biomol Screen. 2009. https://doi.org/10.1177/1087057108326664.
- 16.Kohlbacher O. CADDSuite – a workflow-enabled suite of open-source tools for drug discovery. J Cheminf. 2012. https://doi.org/10.1186/1758-2946-4-S1-O2.
- 17.Hughes-Oliver JM, Brooks AD, Welch WJ, Khaledi MG, Hawkins D, Young SS, Chu MT. ChemModLab: A web-based cheminformatics modeling laboratory. In Silico Biol. 2011. https://doi.org/10.3233/CI-2008-0016.
- 18.Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010. https://doi.org/10.1186/gb-2010-11-8-r86.
- 19.Kutner MH, Nachtsheim CJ, Neter J, Li W. Applied Linear Statistical Models. Journal Of The Royal Statistical Society Series A General (Vol. Fifth). 1996; https://doi.org/10.2307/2984653.
- 20.Friedrich Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. Compstat 2002 - Proceedings in Computational Statistics. 2002; https://doi.org/10.1.1.20.2737.Google Scholar
- 21.Kuhn M. Building Predictive Models in R Using the caret Package. Journal Of Statistical Software. 2008. https://doi.org/10.1053/j.sodo.2009.03.002.
- 22.Sud M. MayaChemTools: An Open Source Package for Computational Drug Discovery. J Chem Inf Model. 2016. https://doi.org/10.1021/acs.jcim.6b00505.
- 23.Bolton EE, Wang Y, Thiessen PA, Bryant SH. PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry. 2008; https://doi.org/10.1016/S1574-1400(08)00012-1.
- 24.Irwin JJ. Shoichet BK. ZINC - A free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005. https://doi.org/10.1021/ci049714.
- 25.De Matos P, Alcántara R, Dekker A, Ennis M, Hastings J, Haug K, Steinbeck C. Chemical entities of biological interest: An update. Nucleic Acids Res. 2009;38(SUPPL.1). https://doi.org/10.1093/nar/gkp886.
- 26.Fontaine F, Pastor M, Zamora I, Sanz F. Anchor-GRIND: Filling the gap between standard 3D QSAR and the GRid-INdependent descriptors. J Med Chem 2005; https://doi.org/10.1021/jm049113+.
- 27.Poupart MA, Cameron DR, Chabot C, Ghiro E, Goudreau N, Goulet S, Tsantrizos YS. Solid-phase synthesis of peptidomimetic inhibitors for the hepatitis C virus NS3 protease. J Org Chem. 2001. https://doi.org/10.1021/jo010164d.
- 28.Carbonell T, Masip I, Sánchez-Baeza F, Delgado M. Identification of selective inhibitors of acetylcholinesterase from a combinatorial library of 2, 5-piperazinediones. Mol Divers; 2000. https://link.springer.com/article/10.1023%2FA%3A1016230600162?LI=true.
- 29.Guha R. The CDK descriptor calculator; 1991.Google Scholar
- 31.Mauri A, Consonni V, Pavan M, Todeschini R. Dragon software: An easy approach to molecular descriptor calculations. Match Communications In Mathematical And In Computer. Chemistry. 2006;56(2):237–48.Google Scholar
- 32.Liu K, Feng J, Young SS. PowerMV: A software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. J Chem Inf Model. 2005. https://doi.org/10.1021/ci049847v.
- 33.Yap CW. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011. https://doi.org/10.1002/jcc.21707.
- 34.Guyon I, Elisseeff A. An Introduction to Variable and Feature Selection. J Mach Learn Res. 2003. https://doi.org/10.1016/j.aca.2011.07.027.
- 35.Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007. https://doi.org/10.1093/bioinformatics/btm344.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.