Connecting genetics and gene expression data for target prioritisation and drug repositioning
Developing new drugs continues to be a highly inefficient and costly business. By repurposing an existing compound for a different indication, drug repositioning offers an attractive alternative to traditional drug discovery. Most of these approaches work by matching transcriptional disease signatures to anti-correlated gene expression profiles of drug perturbations. Genome-wide association studies (GWASs) are of great interest to researchers in the pharmaceutical industry because drug programmes with supporting genetic evidence are more likely to successfully progress through the drug discovery pipeline.
Here, we present a systematic approach to generate drug repositioning hypothesis based on disease genetics by mining public repositories of GWAS data and drug transcriptomic profiles. We find that genes genetically associated with a certain disease are more likely to be differentially expressed in the same disease (p-value = 1.54e-17 and AUC = 0.75) and that, in existing drug – disease combinations, genes significantly up- or down-regulated after drug treatment are enriched for genes genetically associated with that disease (p-value = 1.1e-79 and AUC = 0.64). Finally, we use this framework to generate and rank novel GWAS-driven drug repositioning predictions.
KeywordsDrug discovery Drug repositioning Genomics Transcriptomics
Area under the curve
Differentially expressed gene
Experimental Factor Ontology
False discovery rate
Genome-wide association study
Library of Integrated Network-based Cellular Signatures
Phenome-wide association study
Receiver operating characteristic
The discovery, development and commercialisation of a new drug is a long, expensive and often failure-prone process [1, 2, 3]. Drug repositioning can be a time- and cost-effective alternative where existing compounds are repurposed for diseases different from the original indication [4, 5]. These approaches can be subdivided into multiple classes, though a majority of recent computational work has focussed on two: drug-based, relying on chemical structure similarity and predictions of drug – target interactions, and disease-based, where transcriptomic readouts of disease samples and drug perturbations are combined .
The latter was popularised by the Connectivity Map [7, 8], an in silico pipeline to reverse-match transcriptional disease signatures with gene expression profiles obtained by perturbing cellular systems with a large panel of compounds. The Library of Integrated Network-based Cellular Signatures (LINCS)  project greatly expanded the pool of compound profiles, triggering further development of computational methods for drug repositioning [10, 11] as well as approaches for the validation of these in silico predictions .
Selecting the right targets is a key decision early in the drug discovery pipeline : a large proportion of the efficacy failures in clinical programmes are due to lack of a clear link between the therapeutic target and the disease under investigation . There is growing recognition that supporting genetic evidence from genome-wide association studies (GWASs) or phenome-wide association studies (PheWASs) linking target and disease can significantly increase the chances of success of drug discovery programmes [15, 16]. The large number of GWASs conducted over the past decade have delivered insights into the causal links of several diseases  and more and more genes are expected to be implicated in disease as the size of these studies grow, even though not all associations might be as meaningful as previously thought .
GWASs , PheWASs , Connectivity Map approaches [21, 22, 23] and Open Targets  have all been used to repurpose drugs. Here, we combine disease data from GWASs with drug perturbation transcriptional profiles and a Connectivity Map-inspired method to generate repositioning hypotheses that, unlike those in standard expression-based repurposing workflows, are supported by genetics evidence.
Software and code
STOPGAP  is a database containing associations between DNA mutations occurring in diseases and likely target genes. This includes rare disease associations as well as data from GWASs. For single nucleotide polymorphisms (SNPs) in regulatory regions, associations to target gene are performed on the basis of supporting evidence including eQTL and regulatory genomics data. The complete dataset (294,505 associations between 20,015 genes and 1746 medical terms) was downloaded from https://github.com/StatGenPRD/STOPGAP/blob/master/STOPGAP_data/stopgap.gene.mesh.RData. Open Targets  maps diseases to relevant genes using a number of evidence types, including genes differentially expressed in the disease, germline and somatic mutations, curated pathway, animal model and literature data as well as known drugs approved for the treatment of the disease. The Open Targets API at http://api.opentargets.io/v3/platform/docs was accessed on 20th June 2017 and used to download lists of genes differentially expressed in disease (216,942 associations between 22,190 genes and 148 diseases) and links between 595 diseases and 1555 approved drugs (7351 associations). The LINCS  L1000 data consists of gene expression profiles obtained by perturbing different cell lines with a large collection of compounds. To obtain the complete genome-wide dataset in a convenient format, we used the Harmonizome , a large collection of uniformly processed biological datasets. The file downloaded from http://amp.pharm.mssm.edu/static/hdfs/harmonizome/data/lincscmapchemical/gene_attribute_edges.txt.gz contained 4,189,677 associations between 3924 compounds and 8347 genes differentially expressed after treatment, with a median of 257 genes changing for each compound.
STOPGAP data: gene – disease associations from rare disease sources (OMIM and Orphanet) were excluded. To ensure compatibility with other resources, gene symbols were mapped to Ensembl gene IDs with the EnsDb.Hsapiens.v75 package . MeSH terms were mapped to terms in the Experimental Factor Ontology (EFO)  using Zooma . LINCS L1000 data from Harmonizome: Entrez gene IDs were mapped to Ensembl gene IDs. Compound IDs were mapped to ChEMBL IDs with UniChem , using PubChem IDs as an intermediate.
EFO IDs and ChEMBL IDs were used to match diseases and drugs across different resources, respectively. A Fisher’s exact test  was used to perform enrichment tests between gene sets and to generate repositioning hypotheses. Results were corrected for multiple hypothesis testing using the Benjamini – Hochberg correction  and only results below a 5% (or lower) false discovery rate (FDR) threshold were considered significant. The Mann – Whitney U test  was used to assess whether distributions were significantly different. Receiver operating characteristic (ROC) curves and confidence intervals are calculated using a bootstrap procedure with 1000 iterations performed using the pROC package . The riverplot package  was used to draw the Sankey diagram, while all other plots were generated with ggplot2 .
We showed that genes genetically associated with a disease often significantly overlap with genes differentially expressed in the same disease, as well as with genes induced or repressed by drugs used to treat that disease. To our knowledge, this is the first report to test and validate these hypotheses. We presented a simple approach to generate target prioritisation and drug repositioning hypotheses that are driven by the genetic background of the disease. Unlike more conventional repurposing approaches that rely on reverse matching of drug and disease transcriptomic signatures, we have taken advantage of the notion that genetic evidence is crucial to maximise the chances of success of drug discovery programmes [15, 16].
Our in silico framework returns a large number of statistically significant results and validation of these hypotheses would require extensive experimental work. We believe this is a major limitation of our work: we are acutely aware of the many challenges and low success rates of drug discovery programmes and recognise that a considerable proportion of our hits could be false positives. Diseases with several associated genes and drugs eliciting large transcriptional responses are more likely to result in significant results simply because of the size of these gene sets and the methodology used to compute significance.
Another issue is the lack of directionality in the genetics data we use to represent the disease space. While other Connectivity Map-inspired methods exploit up- or down-regulated genes in the transcriptomic data to identify compound profiles reversing a disease signature [7, 10, 11], our method does not take directionality into account. This could result in false positives, including predictions which could actually worsen the disease state.
In conclusion, our work represents a proof of concept that combining disease genetic and drug transcriptomic data is a valuable approach for GWAS-based drug repositioning. However, we recognise that much work remains to be done to improve its real-world applicability and would like to encourage further research in this area.
We would like to thank Gautier Koscielny and Andrew Rouillard for enabling and facilitating access to data.
The authors received no specific funding for this work.
Availability of data and materials
The datasets analysed during the current study are available at the following repositories: STOPGAP (https://github.com/StatGenPRD/STOPGAP/blob/master/STOPGAP_data/stopgap.gene.mesh.RData); Harmonizome (http://amp.pharm.mssm.edu/static/hdfs/harmonizome/data/lincscmapchemical/gene_attribute_edges.txt.gz) and Open Targets (http://api.opentargets.io/v3/platform/docs).
The datasets generated during the current study are included in this published article and its additional files.
EF conceived the initial idea. EF and PA refined the proposed workflow. EF performed the data analysis. EF and PA worked on the interpretation of the results. EF wrote the first draft of the manuscript. EF and PA made changes and additions to the manuscript. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
EF and PA are full-time employees of GSK (GlaxoSmithKline).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 7.Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet J-P, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313:1929–35.CrossRefPubMedGoogle Scholar
- 8.Qu XA, Rajpal DK. Applications of connectivity map in drug discovery and development. Drug Discov Today. 2012;17(23–24):1289–98.Google Scholar
- 9.Vidović D, Koleti A, Schürer SC. Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action. Front Genet. 2014;5(SEP):1–14.Google Scholar
- 12.Brown AS, Patel CJ: A review of validation strategies for computational drug repositioning. Brief Bioinform 2016:bbw110.Google Scholar
- 24.Khaladkar M, Koscielny G, Hasan S, Agarwal P, Dunham I, Rajpal D, Sanseau P. Uncovering novel repositioning opportunities using the open targets platform. Drug Discov Today. 2017;22(12):1800–1807.Google Scholar
- 25.R Core Team: R: A Language and Environment for statistical Computing 2017:https://www.r-project.org/.
- 27.Koscielny G, An P, Carvalho-Silva D, Cham JA, Fumis L, Gasparyan R, Hasan S, Karamanis N, Maguire M, Papa E, Pierleoni A, Pignatelli M, Platt T, Rowland F, Wankar P, Bento AP, Burdett T, Fabregat A, Forbes S, Gaulton A, Gonzalez CY, Hermjakob H, Hersey A, Jupe S, Kafkas Ş, Keays M, Leroy C, Lopez F-J, Magarinos MP, Malone J, et al. Open targets: a platform for therapeutic target identification and validation. Nucleic Acids Res. 2017;45:D985–94.CrossRefPubMedGoogle Scholar
- 29.Rainer J. EnsDb.Hsapiens.v75. 2016. https://doi.org/10.18129/B9.bioc.EnsDb.Hsapiens.v75.
- 31.EMBL-EBI: Zooma. 2017:http://www.ebi.ac.uk/spot/zooma/.
- 34.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995;57:289–300.Google Scholar
- 37.Weiner J. riverplot. 2017. https://CRAN.R-project.org/package=riverplot.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.