Key words

1 Introduction

Systems medicine finds its roots in systems biology, the scientific discipline that aims at a systems-level understanding of, for example, biological networks, cells, organs, organisms, and populations. It generally involves a combination of wet-lab experiments and computational (bioinformatics) approaches. Systems medicine extends systems biology by focusing on the application of systems-based approaches to clinically relevant applications in order to improve patient health or the overall well-being of (healthy) individuals [1]. Systems medicine is expected to change health care practice in the coming years. It will contribute to new therapeutics through the identification of novel disease genes that provide drug candidates less likely to fail in clinical studies [2, 3]. It is also expected to contribute to fundamental insights into networks perturbed by disease, improved prediction of disease progression, stratification of disease subtypes, personalized treatment selection, and prevention of disease. To enable systems medicine it is necessary to characterize the patient at various levels and, consequently, to collect, integrate, and analyze various types of data including not only clinical (phenotype) and molecular data, but also information about cells (e.g., disease-related alterations in organelle morphology), organs (e.g., lung impedance when studying respiratory disorders such as asthma or chronic obstructive pulmonary disease), and even social networks. The full realization of systems medicine therefore requires the integration and analysis of environmental, genetic, physiological, and molecular factors at different temporal and spatial scales, which currently is very challenging. It will require large efforts from various research communities to overcome current experimental, computational, and information management related barriers. In this chapter we show how bioinformatics is an essential part of systems medicine and discuss some of the future challenges that need to be solved.

2 Bioinformatics and High-Throughput Experimental Technologies

2.1 Bioinformatics in Biomedical Research

To understand the contribution of bioinformatics to systems medicine, it is helpful to consider the traditional role of bioinformatics in biomedical research, which involves basic and applied (translational) research to augment our understanding of (molecular) processes in health and disease. The term “bioinformatics” was first coined by the Dutch theoretical biologist Paulien Hogeweg in 1970 to refer to the study of information processes in biotic systems [4]. Soon, the field of bioinformatics expanded and bioinformatics efforts accelerated and matured as the first (whole) genome and protein sequences became available. The significance of bioinformatics further increased with the development of high-throughput experimental technologies that allowed wet-lab researchers to perform large-scale measurements. These include determining whole-genome sequences (and gene variants) and genome-wide gene expression with next-generation sequencing technologies (NGS; see Table 1 for abbreviations and web links) [5], measuring gene expression with DNA microarrays [6], identifying and quantifying proteins and metabolites with NMR or (LC/GC-) MS [7], measuring epigenetic changes such as methylation and histone modifications [8], and so on. These, “omics” technologies, are capable of measuring the many molecular building blocks that determine our (patho)physiology. Genome-wide measurements have not only significantly advanced our fundamental understanding of the molecular biology of health and disease but have also contributed to new (commercial) diagnostic and prognostic tests [9, 10] and the selection and development of (personalized) treatment [11]. Nowadays, bioinformatics is therefore defined as “Advancing the scientific understanding of living systems through computation” (ISCB), or more inclusively as “Conceptualizing biology in terms of molecules and applying ‘informatics techniques’ (derived from disciplines such as applied mathematics, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale” [12].

Table 1 Abbreviations and websites

It is worth noting that solely measuring many molecular components of a biological system does not necessarily result in a deeper understanding of such a system. Understanding biological function does indeed require detailed insight into the precise function of these components but, more importantly, it requires a thorough understanding of their static, temporal, and spatial interactions. These interaction networks underlie all (patho)physiological processes, and elucidation of these networks is a major task for bioinformatics and systems medicine .

2.2 New Dimensions in Biomedical Research

The developments in experimental technologies have led to challenges that require additional expertise and new skills for biomedical researchers:

  • Information management. Modern biomedical research projects typically produce large and complex omics data sets , sometimes in the order of hundreds of gigabytes to terabytes of which a large part has become available through public databases [13, 14] sometimes even prior to publication (e.g., GTEx, ICGC, TCGA). This not only contributes to knowledge dissemination but also facilitates reanalysis and meta-analysis of data, evaluation of hypotheses that were not considered by the original research group, and development and evaluation of new bioinformatics methods. The use of existing data can in some cases even make new (expensive) experiments superfluous. Alternatively, one can integrate publicly available data with data generated in-house for more comprehensive analyses, or to validate results [15]. In addition, the obligation of making raw data available may prevent fraud and selective reporting. The management (transfer, storage, annotation, and integration) of data and associated meta-data is one of the main and increasing challenges in bioinformatics that needs attention to safeguard the progression of systems medicine.

  • Data analysis and interpretation. Bioinformatics data analysis and interpretation of omics data have become increasingly complex, not only due to the vast volumes and complexity of the data but also as a result of more challenging research questions. Bioinformatics covers many types of analyses including nucleotide and protein sequence analysis, elucidation of tertiary protein structures, quality control, pre-processing and statistical analysis of omics data, determination of genotype-phenotype relationships, biomarker identification, evolutionary analysis, analysis of gene regulation, reconstruction of biological networks, text mining of literature and electronic patient records, and analysis of imaging data. In addition, bioinformatics has developed approaches to improve experimental design of omics experiments to ensure that the maximum amount of information can be extracted from the data. Many of the methods developed in these areas are of direct relevance for systems medicine as exemplified in this chapter.

Clearly, new experimental technologies have to a large extent turned biomedical research in a data- and compute-intensive endeavor. It has been argued that production of omics data has nowadays become the “easy” part of biomedical research, whereas the real challenges currently comprise information management and bioinformatics analysis. Consequently, next to the wet-lab, the computer has become one of the main tools of the biomedical researcher .

3 Bioinformatics and Systems Medicine

Bioinformatics enables and advances the management and analysis of large omics-based datasets, thereby directly and indirectly contributing to systems medicine in several ways (Fig. 1):

Fig. 1
figure 1

The contribution of bioinformatics (dark grey boxes) to systems medicine (black box). (Omics) experiments, patients, and public repositories provide a wide range of data that is used in bioinformatics and systems medicine studies

  1. 1.

    Design of new omics experiments [1618].

  2. 2.

    Information management of omics and clinical data (Subheading 4).

  3. 3.

    Quality control and pre-processing of omics data. Pre-processing typically involves data cleaning (e.g., removal of failed assays) and other steps to obtain quantitative measurements that can be used in downstream data analysis.

  4. 4.

    (Statistical) data analysis methods of large and complex omics-based datasets. This includes methods for the integrative analysis of multiple omics data types (Subheading 5), and for the elucidation and analysis of biological networks (top-down systems medicine; Subheading 6).

Systems medicine comprises top-down and bottom-up approaches. The former represents a specific branch of bioinformatics, which distinguishes itself from bottom-up approaches in several ways [3, 19, 20]. Top-down approaches use omics data to obtain a holistic view of the components of a biological system and, in general, aim to construct system-wide static functional or physical interaction networks such as gene co-expression networks and protein-protein interaction networks. In contrast, bottom-up approaches aim to develop detailed mechanistic and quantitative mathematical models for sub-systems. These models describe the dynamic and nonlinear behavior of interactions between known components to understand and predict their behavior upon perturbation. However, in contrast to omics-based top-down approaches, these mechanistic models require information about chemical/physical parameters and reaction stoichiometry, which may not be available and require further (experimental) efforts. Both the top-down and bottom-up approaches result in testable hypotheses and new wet-lab or in silico experiments that may lead to clinically relevant findings.

4 Information Management for Systems Medicine

4.1 Public Databases in Systems Medicine

Biomedical research and, consequently, systems medicine are increasingly confronted with the management of continuously growing volumes of molecular and clinical data, results of data analyses and in silico experiments, and mathematical models. Due to policies of scientific journals and funding agencies, omics data is often made available to the research community via public databases. In addition, a wide range of databases have been developed, of which more than 1550 are currently listed in the Molecular Biology Database Collection [14] providing a rich source of biomedical information. Biological repositories do not merely archive data and models but also serve a range of purposes in systems medicine as illustrated below from a few selected examples. The main repositories are hosted and maintained by the major bioinformatics institutes including EBI, NCBI, and SIB that make a major part of the raw experimental omics data available through a number of primary databases including GenBank [21], GEO [22], PRIDE [23], and Metabolights [24] for sequence, gene expression, MS-based proteomics, and MS-based metabolomics data, respectively. In addition, many secondary databases provide information derived from the processing of primary data, for example pathway databases (e.g., Reactome [25], KEGG [26]), protein sequence databases (e.g., UniProtKB [27]), and many others. Pathway databases provide an important resource to construct mathematical models used to study and further refine biological systems [28, 29]. Other efforts focus on establishing repositories integrating information from multiple public databases. The integration of pathway databases [3032], and genome browsers that integrate genetic, omics, and other data with whole-genome sequences [33, 34] are two examples of this. Joint initiatives of the bioinformatics and systems biology communities resulted in repositories such as BioModels, which contains mathematical models of biochemical and cellular systems [35], Recon 2 that provides a community-driven, consensus “metabolic reconstruction ” of human metabolism suitable for computational modelling [36], and SEEK, which provides a platform designed for the management and exchange of systems biology data and models [37]. Another example of a database that may prove to be of value for systems medicine studies is MalaCards , an integrated and annotated compendium of about 17,000 human diseases [38]. MalaCards integrates 44 disease sources into disease cards and establishes gene-disease associations through integration with the well-known GeneCards databases [39, 40]. Integration with GeneCards and cross-references within MalaCards enables the construction of networks of related diseases revealing previously unknown interconnections among diseases, which may be used to identify drugs for off-label use. Another class of repositories are (expert-curated) knowledge bases containing domain knowledge and data, which aim to provide a single point of entry for a specific domain. Contents of these knowledge bases are often based on information extracted (either manually or by text mining) from literature or provided by domain experts [4143]. Finally, databases are used routinely in the analysis, interpretation, and validation of experimental data. For example, the Gene Ontology (GO) provides a controlled vocabulary of terms for describing gene products, and is often used in gene set analysis to evaluate expression patterns of groups of genes instead of those of individual genes [44] and has, for example, been applied to investigate HIV-related cognitive disorders [45] and polycystic kidney disease [46].

4.2 Phenotype Databases

Several repositories such as miR2Disease [47], PeroxisomeDB [41], and Mouse Genome Informatics (MGI) [43] include associations between genes and disorders, but only provide very limited phenotypic information. Phenotype databases are of particular interest to systems medicine. One well-known phenotype repository is the OMIM database, which primarily describes single-gene (Mendelian) disorders [48]. ClinVar is another example and provides an archive of reports and evidence of the relationships among medically important human variations found in patient samples and phenotypes [49]. ClinVar complements dbSNP (for single-nucleotide polymorphisms) [50] and dbVar (for structural variations) [51], which both provide only minimal phenotypic information. The integration of these phenotype repositories with genetic and other molecular information will be a major aim for bioinformatics in the coming decade enabling, for example, the identification of comorbidities, determination of associations between gene (mutations) and disease, and improvement of disease classifications [52]. It will also advance the definition of the “human phenome,” i.e., the set of phenotypes resulting from genetic variation in the human genome. To increase the quality and (clinical) utility of the phenotype and variant databases as an essential step towards reducing the burden of human genetic disease, the Human Variome Project coordinates efforts in standardization, system development, and (training) infrastructure for the worldwide collection and sharing of genetic variations that affect human health [53, 54].

4.3 Clinical Data

To implement and advance systems medicine to the benefit of patients’ health, it is crucial to integrate and analyze molecular data together with de-identified individual-level clinical data complementing general phenotype descriptions. Patient clinical data refers to a wide variety of data including basic patient information (e.g., age, sex, ethnicity), outcomes of physical examinations, patient history, medical diagnoses, treatments, laboratory tests, pathology reports, medical images, and other clinical outcomes. Inclusion of clinical data allows the stratification of patient groups into more homogeneous clinical subgroups. Availability of clinical data will increase the power of downstream data analysis and modeling to elucidate molecular mechanisms, and to identify molecular biomarkers that predict disease onset or progression, or which guide treatment selection. In biomedical studies clinical information is generally used as part of patient and sample selection, but some omics studies also use clinical data as part of the bioinformatics analysis (e.g., [9, 55]). However, in general, clinical data is unavailable from public resources or only provided on an aggregated level. Although good reasons exist for making clinical data available (Subheading 2.2), ethical and legal issues comprising patient and commercial confidentiality, and technical issues are the most immediate challenges [56, 57]. This potentially hampers the development of systems medicine approaches in a clinical setting since sharing and integration of clinical and nonclinical data is considered a basic requirement [1]. Biobanks [58] such as BBMRI [59] provide a potential source of biological material and associated (clinical) data but these are, generally, not publicly accessible, although permission to access data may be requested from the biobank provider. Clinical trials provide another source of clinical data for systems medicine studies, but these are generally owned by a research group or sponsor and not freely available [60] although ongoing discussions may change this in the future ([61] and references therein).

Although clinical data is not yet available on a large scale, the bioinformatics and medical informatics communities have been very active in establishing repositories that provide clinical data. One example is the Database of Genotypes and Phenotypes (dbGaP) [62] developed by the NCBI. Study metadata, summary-level (phenotype) data, and documents related to studies are publicly available. Access to de-identified individual-level (clinical) data is only granted after approval by an NIH data access committee. Another example is The Cancer Genome Atlas (TCGA) , which also provides individual-level molecular and clinical data through its own portal and the Cancer Genomics Hub (CGHub). Clinical data from TCGA is available without any restrictions but part of the lower level sequencing and microarray data can only be obtained through a formal request managed by dbGaP.

Medical patient records provide an even richer source of phenotypic information , and has already been used to stratify patient groups, discover disease relations and comorbidity, and integrate these records with molecular data to obtain a systems-level view of phenotypes (for a review see [63]). On the one hand, this integration facilitates refinement and analysis of the human phenome to, for example, identify diseases that are clinically uniform but have different underlying molecular mechanisms, or which share a pathogenetic mechanism but with different genetic cause [64]. On the other hand, using the same data, a phenome-wide association study (PheWAS ) [65] would allow the identification of unrelated phenotypes associated with specific shared genetic variant(s), an effect referred to as pleiotropy. Moreover, it makes use of information from medical records generated in routine clinical practice and, consequently, has the potential to strengthen the link between biomedical research and clinical practice [66]. The power of phenome analysis was demonstrated in a study involving 1.5 million patient records, not including genotype information, comprising 161 disorders. In this study it was shown that disease phenotypes form a highly connected network suggesting a shared genetic basis [67]. Indeed, later studies that incorporated genetic data resulted in similar findings and confirmed a shared genetic basis for a number of different phenotypes. For example, a recent study identified 63 potentially pleiotropic associations through the analysis of 3144 SNPs that had previously been implicated by genome-wide association studies (GWAS) as mediators of human traits, and 1358 phenotypes derived from patient records of 13,835 individuals [68]. This demonstrates that phenotypic information extracted manually or through text mining from patient records can help to more precisely define (relations between) diseases. Another example comprises the text mining of psychiatric patient records to discover disease correlations [52]. Here, mapping of disease genes from the OMIM database to information from medical records resulted in protein networks suspected to be involved in psychiatric diseases.

5 Integrative Bioinformatics

Integrative bioinformatics comprises the integrative (statistical) analysis of multiple omics data types. Many studies demonstrated that using a single omics technology to measure a specific molecular level (e.g., DNA variation, expression of genes and proteins, metabolite concentrations, epigenetic modifications) already provides a wealth of information that can be used for unraveling molecular mechanisms underlying disease. Moreover, single-omics disease signatures which combine multiple (e.g., gene expression) markers have been constructed to differentiate between disease subtypes to support diagnosis and prognosis. However, no single technology can reveal the full complexity and details of molecular networks observed in health and disease due to the many interactions across these levels. A systems medicine strategy should ideally aim to understand the functioning of the different levels as a whole by integrating different types of omics data. This is expected to lead to biomarkers with higher predictive value, and novel disease insights that may help to prevent disease and to develop new therapeutic approaches. Integrative bioinformatics can also facilitate the prioritization and characterization of genetic variants associated with complex human diseases and traits identified by GWAS in which hundreds of thousands to over a million SNPs are assayed in a large number of individuals. Although such studies lack the statistical power to identify all disease-associated loci [69], they have been instrumental in identifying loci for many common diseases. However, it remains difficult to prioritize the identified variants and to elucidate their effect on downstream pathways ultimately leading to disease [70]. Consequently, methods have been developed to prioritize candidate SNPs based on integration with other (omics) data such as gene expression, DNase hypersensitive sites, histone modifications, and transcription factor-binding sites [71].

5.1 Data Integration

The integration of multiple omics data types is far from trivial and various approaches have been proposed [7274]. One approach is to link different types of omics measurements through common database identifiers. Although this may seem straightforward, in practice this is complicated as a result of technical and standardization issues as well as a lack of biological consensus [32, 7577]. Moreover, the integration of data at the level of the central dogma of molecular biology and, for example, metabolite data is even more challenging due to the indirect relationships between genes, transcripts, and proteins on the one hand and metabolites on the other hand, precluding direct links between the database identifiers of these molecules.

Statistical data integration [72] is a second commonly applied strategy, and various approaches have been applied for the joint analysis of multiple data types (e.g., [78, 79]). One example of statistical data integration is provided by a TCGA study that measured various types of omics data to characterize breast cancer [80]. In this study 466 breast cancer samples were subjected to whole-genome and -exome sequencing, and SNP arrays to obtain information about somatic mutations, copy number variations, and chromosomal rearrangements. Microarrays and RNA-Seq were used to determine mRNA and microRNA expression levels, respectively. Reverse-phase protein arrays (RPPA) and DNA methylation arrays were used to obtain data on protein expression levels and DNA methylation, respectively. Simultaneous statistical analysis of different data types via a “cluster-of-clusters” approach using consensus clustering on a multi-omics data matrix revealed that four major breast cancer subtypes could be identified. This showed that the intrinsic subtypes (basal, luminal A and B, HER2) that had previously been determined using gene expression data only could be largely confirmed in an integrated analysis of a large number of breast tumors.

5.2 Multi-omics Disease Signatures

Single-level omics data has extensively been used to identify disease-associated biomarkers such as genes, proteins, and metabolites. In fact, these studies led to more than 150,000 papers documenting thousands of claimed biomarkers, However, it is estimated that fewer than 100 of these are currently used for routine clinical practice [81]. Integration of multiple omics data types is expected to result in more robust and predictive disease profiles since these better reflect disease biology [82]. Further improvement of these profiles may be obtained through the explicit incorporation of interrelationships between various types of measurements such as microRNA–mRNA target, or gene methylation–microRNA (based on a common target gene). This was demonstrated for the prediction of short-term and long-term survival from serous cystadenocarcinoma TCGA data [83].

6 Biological Networks

According to the recent CASyM roadmap : “Human disease can be perceived as perturbations of complex, integrated genetic, molecular and cellular networks and such complexity necessitates a new approach.” [84]. In this section we discuss how (approximations) to these networks can be constructed from omics data and how these networks can be decomposed in smaller modules. Then we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, lead to predictive diagnostic and prognostic models, and help to further subclassify diseases [55, 85] (Fig. 2). Such top-down network-based approaches will provide medical doctors with molecular level support to make personalized treatment decisions.

Fig. 2
figure 2

Overview of network-based approaches for systems medicine (Subheading 6)

6.1 Top-Down Network Reconstruction

In a top-down approach the aim of network reconstruction is to infer the connections between the molecules that constitute a biological network. Network models can be created using a variety of mathematical and statistical techniques and data types. Early approaches for network inference (also called reverse engineering ) used only gene expression data to reconstruct gene networks. Here, we discern three types of gene network inference algorithms using methods based on (1) correlation-based approaches, (2) information-theoretic approaches, and (3) Bayesian networks [86].

Co-expression networks are an extension of commonly used clustering techniques , in which genes are connected by edges in a network if the amount of correlation of their gene expression profiles exceeds a certain value. Co-expression networks have been shown to connect functionally related genes [87]. Note that connections in a co-expression network correspond to either direct (e.g., transcription factor-gene and protein-protein) or indirect (e.g., proteins participating in the same pathway) interactions. In one of the earliest examples of this approach, pair-wise correlations were calculated between gene expression profiles and the level of growth inhibition caused by thousands of tested anticancer agents, for 60 cancer cell lines [88]. Removal of associations weaker than a certain threshold value resulted in networks consisting of highly correlated genes and agents, called relevance networks, which led to targeted hypotheses for potential single-gene determinants of chemotherapeutic susceptibility.

Information-theoretic approaches have been proposed in order to capture nonlinear dependencies assumed to be present in most biological systems and that cannot be captured by correlation-based distance measures . These approaches often use the concept of mutual information, a generalization of the correlation coefficient which quantifies the degree of statistical (in)dependence. An example of a network inference method that is based on mutual information is ARACNe, which has been used to reconstruct the human B-cell gene network from a large compendium of human B-cell gene expression profiles [89]. In order to discover regulatory interactions, ARACNe removes the majority of putative indirect interactions from the initial mutual information-based gene network using a theorem from information theory, the data processing inequality. This led to the identification of MYC as a major hub in the B-cell gene network and a number of novel MYC target genes, which were experimentally validated. Whether information-theoretic approaches are more powerful in general than correlation-based approaches is still subject of debate [90].

Bayesian networks allow the description of statistical dependencies between variables in a generic way [91, 92]. Bayesian networks are directed acyclic networks in which the edges of the network represent conditional dependencies; that is, nodes that are not connected represent variables that are conditionally independent of each other. A major bottleneck in the reconstruction of Bayesian networks is their computational complexity. Moreover, Bayesian networks are acyclic and cannot capture feedback loops that characterize many biological networks. When time-series rather than steady-state data is available, dynamic Bayesian networks provide a richer framework in which cyclic networks can be reconstructed [93].

Gene (co-)expression data only offers a partial view on the full complexity of cellular networks. Consequently, networks have also been constructed from other types of high-throughput data. For example, physical protein-protein interactions have been measured on a large scale in different organisms including human, using affinity capture-mass spectrometry or yeast two-hybrid screens, and have been made available in public databases such as BioGRID [94]. Regulatory interactions have been probed using chromatin immunoprecipitation sequencing (ChIP-Seq) experiments, for example by the ENCODE consortium [95].

Using probabilistic techniques , heterogeneous types of experimental evidence and prior knowledge have been integrated to construct functional association networks for human [96], mouse [97], and, most comprehensively, more than 1100 organisms in the STRING database [98]. Functional association networks can help predict novel pathway components, generate hypotheses for biological functions for a protein of interest, or identify disease-related genes [97]. Prior knowledge required for these approaches is, for example, available in curated biological pathway databases, and via protein associations predicted using text mining based on their co-occurrence in abstracts or even full-text articles. Many more integrative network inference methods have been proposed; for a review see [99]. The integration of gene expression data with ChIP data [100] or transcription factor-binding motif data [101] has shown to be particularly fruitful for inferring transcriptional regulatory networks. Recently, Li et al. [102] described the results from a regression-based model that predicts gene expression using ENCODE (ChIP-Seq) and TCGA data (mRNA expression data complemented with copy number variation, DNA methylation, and microRNA expression data). This model infers the regulatory activities of expression regulators and their target genes in acute myeloid leukemia samples. Eighteen key regulators were identified, whose activities clustered consistently with cytogenetic risk groups.

Bayesian networks have also been used to integrate multi-omics data. The combination of genotypic and gene expression data is particularly powerful, since DNA variations represent naturally occurring perturbations that affect gene expression detected as expression quantitative trait loci (eQTL ). Cis-acting eQTLs can then be used as constraints in the construction of directed Bayesian networks to infer causal relationships between nodes in the network [103].

6.2 Module-Based Approaches

Large multi-omics datasets consisting of hundreds or sometimes even thousands of samples are available for many commonly occurring human diseases, such as most tumor types (TCGA), Alzheimer’s disease [104], and obesity [105]. However, a major bottleneck for the construction of accurate gene networks is that the number of gene networks that are compatible with the experimental data is several orders of magnitude larger still. In other words, top-down network inference is an underdetermined problem with many possible solutions that explain the data equally well and individual gene-gene interactions are characterized by a high false-positive rate [99]. Most network inference methods therefore try to constrain the number of possible solutions by making certain assumptions about the structure of the network. Perhaps the most commonly used strategy to harness the complexity of the gene network inference problem is to analyze experimental data in terms of biological modules, that is, sets of genes that have strong interactions and a common function [106]. There is considerable evidence that many biological networks are modular [107]. Module-based approaches effectively constrain the number of parameters to estimate and are in general also more robust to the noise that characterizes high-throughput omics measurements. A detailed review of module-based techniques is outside the scope of this chapter (see, for example [108]), but we would like to mention a few examples of successful and commonly used modular approaches.

Weighted gene co-expression network analysis (WGCNA) decomposes a co-expression network into modules using clustering techniques [109]. Modules can be summarized by their module eigengene, a weighted average expression profile of all gene member of a given module. Eigengenes can then be correlated with external sample traits to identify modules that are related with these traits. Parikshak et al. [110] used WGCNA to extract modules from a co-expression network constructed using fetal and early postnatal brain development expression data. Next, they established that several of these modules were enriched for genes and rare de novo variants implicated in autism spectrum disorder (ASD). Moreover, the ASD-associated modules are also linked at the transcriptional level and 17 transcription factors were found acting as putative co-regulators of ASD-associated gene modules during neocortical development. WGCNA can also be used when multiple omics data types are available. One example of such an approach involved the integration of transcriptomic and proteomic data from a study investigating the response to SARS-CoV infection in mice [111]. In this study WGCNA-based gene and protein co-expression modules were constructed and integrated to obtain module-based disease signatures. Interestingly, the authors found several cases of identifier-matched transcripts and proteins that correlated well with the phenotype, but which showed poor or anticorrelation across these two data types. Moreover, the highest correlating transcripts and peptides were not the most central ones in the co-expression modules. Vice versa, the transcripts and proteins that defined the modules were not those with the highest correlation to the phenotype. At the very least this shows that integration of omics data affects the nature of the disease signatures.

Identification of active modules is another important integrative modular technique . Here, experimental data in the form of molecular profiles is projected onto a biological network, for example a protein-protein interaction network. Active modules are those subnetworks that show the largest change in expression for a subset of conditions and are likely to contain key drivers or regulators of those processes perturbed in the experiment. Active modules have, for example, been used to find a subnetwork that is overexpressed in a particularly aggressive lymphoma subtype [112] and to detect significantly mutated pathways [113]. Some active module approaches integrate various types of omics data. One example of such an approach is PARADIGM [114], which translates pathways into factor graphs, a class of models that belongs to the same family of models as Bayesian networks, and determines sample-specific pathway activity from multiple functional genomic datasets. PARADIGM has been used in several TCGA projects, for example, in the integrated analysis of 131 urothelial bladder carcinomas [55]. PARADIGM-based analysis of copy number variations and RNA-Seq gene expression in combination with a propagation-based network analysis algorithm revealed novel associations between mutations and gene expression levels, which subsequently resulted in the identification of pathways altered in bladder cancer. The identification of activating or inhibiting gene mutations in these pathways suggested new targets for treatment. Moreover, this effort clearly showed the benefits of screening patients for the presence of specific mutations to enable personalized treatment strategies.

6.3 Network-Based Disease Signatures

Often, published disease signatures cannot be replicated [81] or provide hardly additional biological insight. Also here (modular) network-based approaches have been proposed to alleviate these problems. A common characteristic of most methods is that the molecular activity of a set of genes is summarized on a per sample basis. Summarized gene set scores are then used as features in prognostic and predictive models. Relevant gene sets can be based on prior knowledge and correspond to canonical pathways, gene ontology categories, or sets of genes sharing common motifs in their promoter regions [115]. Gene set scores can also be determined by projecting molecular data onto a biological network and summarizing scores at the level of subnetworks for each individual sample [116]. While promising in principle, it is still subject of debate whether gene set-based models outperform gene-based one s [117].

6.4 Crossing the Species Boundary

The comparative analysis of networks across different species is another commonly used approach to constrain the solution space. Patterns conserved across species have been shown to be more likely to be true functional interactions [107] and to harbor useful candidates for human disease genes [118]. Many network alignment methods have been developed in the past decade to identify commonalities between networks. These methods in general combine sequence-based and topological constraints to determine the optimal alignment of two (or more) biological networks. Network alignment has, for example, been applied to detect conserved patterns of protein interaction in multiple species [107, 119] and to analyze the evolution of co-expression networks between humans and mice [120, 121]. Network alignment can also be applied to detect diverged patterns [120] and may thus lead to a better understanding of similarities and differences between animal models and human in health and disease. Information from model organisms has also been fruitfully used to identify more robust disease signatures [122125]. Sweet-Cordero and co-workers [122] used a gene signature identified in a mouse model of lung adenocarcinoma to uncover an orthologous signature in human lung adenocarcinoma that was not otherwise apparent. Bild et al. [123] defined gene expression signatures characterizing several oncogenic pathways of human mammary epithelial cells. They showed that these signatures predicted pathway activity in mouse and human tumors. Predictions of pathway activity correlated well with the sensitivity to drugs targeting those pathways and could thus serve as a guide to targeted therapies. A generic approach, Pathprint, for the integration of gene expression data across different platforms and species at the level of pathways, networks, and transcriptionally regulated targets was recently described [126]. The authors used their method to identify four stem cell-related pathways conserved between human and mouse in acute myeloid leukemia, with good prognostic value in four independent clinical studies.

6.5 From Networks to Medicine

We reviewed a wide array of different approaches showing how networks can be used to elucidate integrated genetic, molecular, and cellular networks. However, in general no single approach will be sufficient and combining different approaches in more complex analysis pipelines will be required. This is fittingly illustrated by the DIGGIT (Driver-gene Inference by Genetical-Genomics and Information Theory) algorithm [127]. In brief, DIGGIT identities candidate master regulators from an ARACNe gene co-expression network integrated with copy number variations that affect gene expression. This method combines several previously developed computational approaches and was used to identify causal genetic drivers of human disease in general and glioblastoma, breast cancer, and Alzheimer’s disease in particular. This enabled identification of KLHL9 deletions as upstream activators of two previously established master regulators in a specific subtype of glioblastoma.

7 Discussion

Systems medicine is one of the steps necessary to make improvements in the prevention and treatment of disease through systems approaches that will (a) elucidate (patho)physiologic mechanisms in much greater detail than currently possible, (b) produce more robust and predictive disease signatures, and (c) enable personalized treatment. In this context, we have shown that bioinformatics has a major role to play.

Bioinformatics will continue its role in the development, curation, integration, and maintenance of (public) biological and clinical databases to support biomedical research and systems medicine. The bioinformatics community will strengthen its activities in various standardization and curation efforts that already resulted in minimum reporting guidelines [128], data capture approaches [75], data exchange formats [129], and terminology standards for annotation [130]. One challenge for the future is to remove errors and inconsistencies in data and annotation from databases and prevent new ones from being introduced [32, 76, 131135]. An equally important challenge is to establish, improve, and integrate resources containing phenotype and clinical information. To achieve this objective it seems reasonable that bioinformatics and health informatics professionals team up [136138]. Traditionally health informatics professionals have focused on hospital information systems (e.g., patient records, pathology reports, medical images) and data exchange standards (e.g., HL7), medical terminology standards (e.g., International Classification of Disease (ICD), SNOMED), medical image analysis, analysis of clinical data, clinical decision support systems, and so on. While, on the other hand, bioinformatics mainly focused on molecular data, it shares many approaches and methods with health informatics. Integration of these disciplines is therefore expected to benefit systems medicine in various ways [139].

Integrative bioinformatics approaches clearly have added value for systems medicine as they provide a better understanding of biological systems, result in more robust disease markers, and prevent (biological) bias that would possibly occur from using single-omics measurements. However, such studies, and the scientific community in general, would benefit from improved strategies to disseminate and share data which typically will be produced at multiple research centers (e.g., https://www.synapse.org; [140]). Integrative studies are expected to increasingly facilitate personalized medicine approaches such as demonstrated by Chen and co-workers [141]. In their study they presented a 14-month “integrative personal omics profile” (iPOP) for a single individual comprising genomic, transcriptomic, proteomic, metabolomic, and autoantibody data. From the whole-genome sequence data an elevated risk for type 2 diabetes (T2D) was detected, and subsequent monitoring of HbA1c and glucose levels revealed the onset of T2D, despite the fact that the individual lacked many of the known non-genetic risk factors. Subsequent treatment resulted in a gradual return to the normal phenotype. This shows that the genome sequence can be used to determine disease risk in a healthy individual and allows selecting and monitoring specific markers that provide information about the actual disease status.

Network-based approaches will increasingly be used to determine the genetic causes of human diseases. Since the effect of a genetic variation is often tissue or cell-type specific, a large effort is needed in constructing cell-type-specific networks both in health and disease. This can be done using data already available, an approach taken by Guan et al. [142]. The authors proposed 107 tissue-specific networks in mouse via their generic approach for constructing functional association networks using low-throughput, highly reliable tissue-specific gene expression information as a constraint. One could also generate new datasets to facilitate the construction of tissue-specific networks. Examples of such approaches are TCGA and the genotype-tissue expression (GTEx) project. The aim of GTEx is to create a data resource for the systematic study of genetic variation and its effect on gene expression in more than 40 human tissues [143]. Regardless of the way how networks are constructed, it will become more and more important to offer a centralized repository where networks from different cell types and diseases can be stored and accessed. Nowadays, these networks are difficult to retrieve and are scattered in supplementary files with the original papers, links to accompanying web pages, or even not available at all. A resource similar to what the systems biology community has created with the BioModels database would be a great leap forward. There have been some initial attempts in building databases of network models, for example the CellCircuits database [123] (http://www.cellcircuits.org) and the causal biological networks (CBN) database of networks related to lung disease [144] (http://causalbionet.com). However, these are only small-scale initiatives and a much larger and coordinated effort is required.

Another main bottleneck for the successful application of network inference methods is their validation. Most network inference methods to date have been applied to one or a few isolated datasets and were validated using some limited follow-up experiments, for example via gene knockdowns, using prior knowledge from databases and literature as a gold standard, or by generating simulated data from a mathematical model of the underlying network [145, 146]. However, strengths and weaknesses of network inference methods across cell types, diseases, and species have hardly been assessed. Notable exceptions are collaborative competitions such as the Dialogue on Reverse Engineering Assessment and Methods (DREAM) [147] and Industrial Methodology for Process Verification (IMPROVER) [146]. These centralized initiatives propose challenges in which individual research groups can participate and to which they can submit their predictions, which can then be independently validated by the challenge organizers. Several DREAM challenges in the area of network inference have been organized, leading to a better insight into the strengths and weaknesses of individual methods [148]. Another important contribution of DREAM is that a crowd-based approach integrating predictions from multiple network inference methods was shown to give good and robust performance across diverse data sets [149]. Also in the area of systems medicine challenge-based competitions may offer a framework for independent verification of model predictions.

Systems medicine promises a more personalized medicine that effectively exploits the growing amount of molecular and clinical data available for individual patients. Solid bioinformatics approaches are of crucial importance for the success of systems medicine. However, really delivering the promises of systems medicine will require an overall change of research approach that transcends the current reductionist approach and results in a tighter integration of clinical, wet-lab laboratory, and computational groups adopting a systems-based approach. Past, current, and future success of systems medicine will accelerate this change.