Clinical bioinformatics for complex disorders: a schizophrenia case study
- 5.4k Downloads
In the diagnosis of complex diseases such as neurological pathologies, a wealth of clinical and molecular information is often available to help the interpretation. Yet, the pieces of information are usually considered in isolation and rarely integrated due to the lack of a sound statistical framework. This lack of integration results in the loss of valuable information about how disease associated factors act synergistically to cause the complex phenotype.
Here, we investigated complex psychiatric diseases as networks. The networks were used to integrate data originating from different profiling platforms. The weighted links in these networks capture the association between the analyzed factors and allow the quantification of their relevance for the pathology. The heterogeneity of the patient population was analyzed by clustering and graph theoretical procedures. We provided an estimate of the heterogeneity of the population of schizophrenia and detected a subgroup of patients featuring remarkable abnormalities in a network of serum primary fatty acid amides. We compared the stability of this molecular network in an extended dataset between schizophrenia and affective disorder patients and found more stable structures in the latter.
We quantified robust associations between analytes measured with different profiling platforms as networks. The methodology allows the quantitative evaluation of the complexity of the disease. The identified disease patterns can then be further investigated with regards to their diagnostic utility or help in the prediction of novel therapeutic targets. The applied framework is able to enhance the understanding of complex psychiatric diseases, and may give novel insights into drug development and personalized medicine approaches.
KeywordsSchizophrenia Schizophrenia Patient Bipartite Network Graph Theoretical Approach Standard Laboratory Test
Clinical bioinformatics is concerned with the analysis and visualization of complex medical datasets . In contrast to the classical 'main stream' bioinformatics field which focuses on the analysis of biological information (See [2, 3, 4, 1] for an introduction to Clinical Bioinformatics), here the main focus is to collate heterogeneous data sets from disparate data sources (e.g. patient clinical records, proteomics and transcriptomics data) and develop novel algorithms for the analysis of such heterogeneous data sets. Thus, the key goal is the simultaneous evaluation of clinical and basic research data with the aim to improve medical care and care provision (See  for data exploitation methods in cancer therapy development). For complex diseases such as psychiatric disorders, a wealth of information about patients is usually available. This includes clinical data, standard laboratory evaluations, genetic data, brain imaging data and data obtained from molecular profiling experiments. The recent advance in technological innovations allows to perform high throughput experiments resulting in an enormous increase in the amount of biomedical data generated. Yet, the different sources of data are commonly kept separate which means that valuable information is lost or neglected. Due to this lack of integrated analysis, the importance and relationships between clinical observations and the underlying molecular mechanisms are not understood. In clinical bioinformatics, a major aim is to combine these different sources of information and identify emerging features of the diseases under investigation. These features may reveal links to other pathologies and uncover networks of relationships between different diseases.
Novel clinical bioinformatics approaches could thus provide a better understanding and definition of complex diseases resulting in more accurate, improved diagnosis and better therapies. Over the last years, a need for personalized medicine is increasingly appreciated as it has been apparent that standard treatment approaches are rarely efficient across the entire patient population. Schizophrenia is a good example for a disorder that presents with a broad spectrum of different clinical manifestations which almost certainly is due to the existence of diverse underlying etiologies that happen to present clinically with similar symptoms . Difficult diagnosis and low success of current drug regimes are an inevitable consequence. It would be highly desirable to identify patient subgroups corresponding exactly to the underlying disease pathology, thus facilitating the choice of the most appropriate treatment.
Here, we present a clinical bioinformatics approach to improve diagnosis and understanding of complex psychiatric diseases, which entails the application of a graph theoretical approach that captures information about patients and all disease associated data in a network. We investigate the relationships between patient specific variables and the disease and show how dependencies between these variables can be used to obtain important insights into disease pathologies and are directly related to improved diagnostic approaches. We use this methodology for an integrated assessment of data derived from different profiling platforms and standard laboratory tests and show how it can improve the understanding of highly heterogeneous disorders.
Results and discussion
Schizophrenia – a complex disease
The clinical data used in this study was derived from two different profiling platforms and standard laboratory tests. Metabolites in the CSF of 77 individuals (33 first episode drug naive patients suffering from acute schizophrenia (DSM-IV: 295.30 or 298.8) and 44 demographically matched healthy volunteers) were profiled by H-NMR. Serum proteins of the same subjects were investigated by LC mass spectrometry. For the NMR dataset, signals corresponding to the same molecules were averaged. In the mass spectrometric dataset, each variable referred to a mass spectrometric peak including adduct formation and isotopes. The study also includes measurements of CSF and serum glucose levels derived from a standard laboratory test on all patients.
Importantly, these associations may reveal disease relevant features such as dependencies between symptom severity and genetic background or levels of blood proteins. For psychiatric disorders, dependency structures such as the association between clinical features such as hallucinations or delusions and respective molecular abnormalities are not well understood.
Interestingly, the abnormalities found for primary fatty acid amides were not related to the significant differences we observed in CSF Glucose and glutamate levels which are a known feature of the schizophrenia pathology .
Mapping patients on the "diseasome" network
The concept of encoding relationships between individuals, diseases or molecules into networks is called network medicine  and has been shown to produce biologically interesting results in the investigation of disease gene associations . As noted by Loscalzo et al, the application of network concepts in complex diseases can result in novel approaches for diagnosis and treatment . Here, we applied this concept to achieve an integrated representation of complex disorders encoding all relevant information simultaneously. Graph theoretical approaches are suited to capture the complexity of human diseases and provide a theoretical framework to easily incorporate molecular readouts and patient information to give a comprehensive description of a disease state.
Using the graph theoretical approach, the similarity of patients can be readily determined from the integrated patient information enabling the assessment of disease similarity and possibly, the subclassification of patients. This would be particularly desirable for psychiatric disorders for which the highly heterogeneous symptoms may result from different etiologies and possibly contribute to the low efficacy of current drug regimes. Extending the concept of subclassifying patient cohorts to the single patient level leads to a conceptual framework often referred to as personalized medicine. Patient specific information can be incorporated into the network approach and may allow for an individualized assessment of a given patient's disease state . Besides facilitating more efficient treatment approaches, a system of robust yet patient specific hallmarks of a complex disorder would be invaluable in the design of clinical trials, the development of new drug candidates or the identification of novel drug targets. In the context of psychiatric disorders, a personalized diagnosis and treatment approach would be of particular value as patients' responsiveness to treatment can currently not be predicted, impeding appropriate and successful therapy.
In this study, we analyzed complex psychiatric diseases in the form of disease networks. We quantified robust associations between analytes measured with different profiling platforms and standard laboratory tests and were able to determine a subgroup of patients that featured remarkable abnormality in a molecular system of primary fatty acid amides. The results were validated in an extended dataset of schizophrenia patients and the network properties compared to the ones present in affective disorder. We found that in affective disorder, the molecular networks were more profoundly altered when compared to schizophrenia. The methodology helps to statistically assess the complexity of a given disease and disease associated patterns can then be further investigated with regards to their diagnostic utility or help in the prediction of novel therapeutic targets. The applied framework is able to enhance the understanding of complex psychiatric diseases and may give valuable insights into drug development and personalized medicine approaches.
State of the arts methods in clinical bioinformatics
There is currently a tremendous growth in the amount of life science high through-put data which has been paralleled by similar growths of electronic storage of clinical data. Bioinformatics is experiencing a period of great capability in providing the methodologies complementing life science experimental research and they are keeping the pace with the growing availability of a variety of molecular biology high through-put data. Physicians and biologists are now pressing with more challenging requests. The most important issue is about the integration of the different types of high throughput data (omics), the second is the integration of molecular biology data with clinical data. The integration of such large heterogeneous amount of data is representing the start of a new golden age for artificial intelligence and in particular for machine learning techniques related to clinical bioinformatics.
In clinical bioinformatics for complex diseases, data from multiple sources are integrated constituting a multiscale challenge. Data integration on large scale datasets has been successfully applied for gene function prediction (see  and references therein). A different approach tried to derive more informative data from heterogeneous datasets by means of consensus clustering [12, 13]. In the clinical context, a kernel method based on Support Vector Machines was introduced to combine microarray data with clinical data for diagnosing breast cancer patients . The method has recently been applied for the combination of proteomics and microarray datasets derived from rectal cancer patients . So far, methods used in clinical bioinformatics approaches focussed on the improvement of predictive power by integrating additional information.
Here, we follow a different approach by setting up a comprehensive analysis framework reaching from the initial stage of consistent data collection to integrated disease investigation. The basic procedure is as follows: First, we combine data from disparate sources such as molecular, clinical or phenotypological data into a compound dataset. Then, exploratory data analysis is performed to determine the relevance of single variables for a given pathology. If significant dependencies are found, we proceed towards an investigation of these relationships by means of graph theory and clustering methods. These methods were applied to determine robust, disease associated patters or molecular/clinical abnormalities. Such patterns can then be used to determine the disease state of a given individual, e.g. to assist diagnosis or evaluate treatment success .
Statistics for complex diseases
This powerful methodology is quite general and would be ideal for preliminary data exploration of the meaningful variables for the different disease phenotypes. In our case, the F statistics gives us information about the link between measurement response and disease phenotype.
Caution has to be taken to account for multiplicity problems as it gets increasingly likely to determine significant F statistics as the number of investigated variables increases. Multiple methods exist to adjust for multiple hypothesis testing. A widely accepted method is the control of the False Discovery Rate (FDR) which controls the ratio of false positive findings among all rejected hypotheses . This procedure is more powerful than more classical approaches such as the bonferroni correction, as it based on the rationale that few false positive findings are not too problematic if the number of positive findings is high.
Furthermore, FDR procedures do not assume that variables are independent and in fact, is has been proven that the FDR procedure is applicable to datasets containing dependencies between the variables . Here we adjust p-values resulting from FANOVA using the FDR procedure suggested by Benjamini and Hochberg . If single features were significant after the multiplicity adjustment, dependencies between the features were investigated. The dependency structure contains valuable information about the relationships between the investigated variables that is not apparent from exploratory analysis.
To analyze the dependencies between the different variables, we first encoded the data matrix into a directed graph of N patients and M molecules. Here, every patient n i is connected to a variable m j if this patient has an abnormal state of the variable. The 'abnormality' of the state of a variable is defined with respect to the distribution of the same variable in the control population and a link between a patient and a variable was only built if the value of the variable was outside three standard deviations of the control mean. This procedure generated a directed graph in which one partition contained all variables and the other partition all patients.
In the present study, we use clustering procedures on the directed graph to investigate the dependencies between molecular compounds. Based on the directed graph, a graph with m nodes can be constructed that reflects which variables are altered in patients simultaneously. This procedure is performed for all pairs of variables setting the weight of the respective links equal to the number of patients in which the variables have abnormal levels. The resulting graph contains information about the joint relevance of the variables for the disease state.
We use a clustering algorithm that does not need information on the number of clusters which is often unknown in large-scale comparisons. Although there is now a wide range of clustering algorithms, only a restricted number can successfully handle a network with the complete and weighted graph properties. Among them, we cite the recent method proposed by  that is based on simulated annealing to obtain clustering by direct maximization of the modularity. The modularity has been introduced by  and it is a measure of the difference between the number of links inside a given module and the expected value for a randomized graph of the same size and degree distribution. The modularity Q of a partition of a network is defined as Open image in new window where the sum is over all modules of the partition. l s and d s describe the number of links and the total degree of the nodes inside module s and L the total number of links of the network . In a recent work on resolution limits in community detection  the authors give evidence that modularity optimization may fail to identify modules smaller than a certain scale, depending on the total number of links in the network and on the number of connections between the clusters. Because of its properties, at the end, we implemented the Markov Clustering Algorithm (MCL, ). Its input is a stochastic matrix where each element is the probability of a transition between adjacent nodes. The weights between m i and m j were given the frequency of variables m i and m j being altered, i.e. an abnormality of molecule m j recorded in patient m i .
For the clustering of the bipartite network, we incorporated the modularity measure into the MCL algorithm. The result of the clustering procedure is largely dominated by the choice of the contraction parameter r; low values of r result in large clusters whereas the network is decomposed into single nodes at high values of r. For each arising cluster, we increased r until the cluster was split into at least two sub-clusters; we then used the modularity of a bipartite network  to compare whether the split increased the modularity across all clusters or not. If the modularity was improved, the clustering procedure was continued at the respective community; otherwise it was continued at the next community until no cluster remained.
where the sum is over all edges and the entropy is normalized by the total number of edges, L . This might be used to detect the best clustering obtained after a long series of clusterings with different granularity parameters each time.
The entropy can also be used to study the stability of communities obtained from the clustering procedures. Due to the repeated noisy realizations of the original matrix, nodes may be attached to different communities after the clustering procedure. However, if the investigated system is very stable, nodes tend to cluster with the same communities regardless of the added noise. The stability of the different communities can be investigated by analyzing the entropy as a function of the clustering parameter r as the network breaks down into increasingly separated clusters as r increases.
Schizophrenia data used in this study
In this study, Cerebrospinal Fluid (CSF)and serum samples from a large cohort of 77 individuals were used (for a detailed characterization of the patient population see ). CSF surrounds the brain and is besides its functions regarding mechanical protection, a transport medium for important molecules. Due to its close proximity to the brain, it is likely that pathological abnormalities of the brain are reflected in the CSF. Serum samples most body tissues and fluids and is also an important carrier of signalling molecules. In both serum and CSF samples, a global metabolic profiling was conducted. CSF samples were profiled using proton NMR spectroscopy, serum samples using Liquid Chromatography Mass Spectrometry. The profiled samples (CSF and serum) included 33 samples from drug naive first onset schizophrenia patients and 44 samples taken from healthy volunteers. In an extended analysis we profiled serum samples from 91 additional samples comprising 33 antipsychotic naive first onset schizophrenia patients, 39 samples obtained from patients suffering from affective disorder and 15 controls. We assessed glucose concentrations in the serum and CSF of all individuals and integrated the information with mass spectrometric and NMR data. The ethical committees of the Medical Faculty of the University of Cologne approved the protocols of this study. Informed consent was given in writing by all participants and clinical investigations were conducted according to the principles expressed in the Declaration of Helsinki.
This research was kindly supported by the Stanley Medical Research Institute (SMRI). We want to thank all members of the Bahn Laboratory for discussions, help and encouragement. We also want to thank Dr. Elaine Holmes for performing the NMR experiments and Dr. Hilary Major for her help during the acquisition of the LC-MS data that are the basis for the investigations in this manuscript. E.S. holds a Cambridge European Trust scholarship. S. B. holds a NARSAD Essel Independent Investigator Fellowship.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 12, 2009: Bioinformatics Methods for Biomedical Complex System Applications. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S12.
- 1.Trent RJ: Clinical Bioinformatics. In Methods in Molecular Medicine. 1st edition. Humana Press Inc., U.S; 2007.Google Scholar
- 2.Atanu Biswas, JPF, MRS, Sujay Datta (Eds): Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics In Probability and Statistics. Wiley; 2008.Google Scholar
- 6.Schwarz E, Bahn S: The utility of biomarker discovery approaches for the detection of disease mechanisms in psychiatric disorders. Br J Pharmacol. 2008, 155(Suppl 6):795–6.Google Scholar
- 8.Holmes E, Tsang T, Huang J, Leweke F, Koethe D, Gerth C, Nolden B, Gross S, Schreiber D, Nicholson J, Bahn S: Metabolic profiling of CSF: evidence that early intervention may impact on disease progression and outcome in schizophrenia. PLoS Med 2006, 3(8):e327.PubMedCentralCrossRefPubMedGoogle Scholar
- 13.Filkov V, Skiena S: Heterogeneous data integration with the consensus clustering formalism. Lecture notes in computer science 2004.Google Scholar
- 14.Daemen A, Gevaert O, De Moor B: Integration of clinical and microarray data with kernel methods. Proceedings of the 29th Annual International Conference of the IEEE EMBS, Eng Med Biol Soc 2007, 5411–5.Google Scholar
- 15.Daemen A, Gevaert O, De Bie T, Debucquoy A, Machiels J, De Moor B, Haustermans K: Integrating microarray and proteomics data to predict the response on cetuximab in patients with rectal cancer. Pac Symp Biocomput 2008, 166–77.Google Scholar
- 16.Ramsay JO, Dalzell CJ: Some tools for functional data analysis. Journal of the Royal Statistical Society, Series B 1991, 53: 539–572.Google Scholar
- 19.Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 1995, 57: 289–300.Google Scholar
- 24.van Dongen S: Graph Clustering by Flow Simulation. PhD thesis. University of Utrecht; 2000.Google Scholar
- 25.MJ B: Modularity and community detection in bipartite networks. Phys Rev E Stat Nonlin Soft Matter Phys 2007, 76(6 Pt 2):066102.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.