Distance-based classifiers as potential diagnostic and prediction tools for human diseases
Typically, gene expression biomarkers are being discovered in course of high-throughput experiments, for example, RNAseq or microarray profiling. Analytic pipelines that extract so-called signatures suffer from the "Dimensionality curse": the number of genes expressed exceeds the number of patients we can enroll in the study and use to train the discriminator algorithm. Hence, problems with the reproducibility of gene signatures are more common than not; when the algorithm is executed using a different training set, the resulting diagnostic signature may turn out to be completely different.
In this paper we propose an alternative novel approach which takes into account quantifiable expression levels of all genes assayed. In our analysis, the cumulative gene expression pattern of an individual patient is represented as a point in the multidimensional space formed by all gene expression profiles assayed in given system, where the clusters of "normal samples" and "affected samples" and defined. The degree of separation of the given sample from the space occupied by "normal samples" reflects the drift of the sample away from homeostasis in the course of development of the pathophysiological process that underly the disease. The outlined approach was validated using the publicly available glioma dataset deposited in Rembrandt and associated with survival data. Additionally, the applicability of the distance analysis to the classification of non-malignant sampled was tested using psoriatic lesions and non-lesional matched controls as a model.
Keywords: biomarkers; clustering; human diseases; RNA
KeywordsNormal Sample Oligodendrogliomas Homeostatic State Dimensionality Curse Affymetrix Human Genome U133A
The typical application of gene expression signatures for diagnosis and prediction of the course of disease is based upon an oversimplified understanding of the pathology. According to this model, there is a gene or a set of genes (say, a "gene expression program") that is "responsible" for a pathophysiological process within certain tissue, or a cell type, that manifests itself on an organismal level as a disease. If we see this gene over- or underexpressed, or observe a set of concerted changes in expression of a set of genes, we can diagnose the disease. We can further use the respective levels of the over- or underexpression to predict the course of the disease.
While many diseases are well described by this model, some--like many cancers-- are not. Often we deal with a system-wide changes of entire gene expression profile that involve many cellular pathways and networks, some changes are being related to the pathogenesis, and some are of compensatory nature . In the case of system- wide changes, the sheer number of the genes to be examined prevents unambiguous determination of a group of genes (using a more technical language, a linear combination of their expression levels) suitable as a diagnostic signature for the given disease. The problem is not with the procedure per se, but with a typically limited number of already diagnosed patients we could biopsy and enter into the analysis as a training data set. As shown by simulations in , the development of the robust gene signature require enrollment of thousands patients, which is not feasible (see also  for detailed discussion).
In a nutshell, the problem is that we are trying to profile many different genes simultaneously. In a typical high-throughput experiment assessing transcriptome, proteome or metabolome, the number of available tissue samples is much smaller than the number of variables . This leads to a high probability of spurious correlations. Indeed, even if the probability for one gene expression level to show a spurious correlation with the disease within the given data set is as small as 10− 3, an analysis of data streaming from this experiment with 4 × 104 genes will almost certainly produce many false positives.
This problem is especially prominent in the analysis of mRNA microarray or the RNAseq experiments. When different groups extract diagnostic signatures for the same disease, the resultant sets of genes often have negligible overlap. For example, the tests  and  for breast carcinoma, having 76 genes and 70 genes correspondingly, have only 3 genes in common. Even starting from the same data set, one can get different "predictive" panels with minimal overlap . The difference in gene sets extracted using different training sets is not limited to individual genes. The pathways and networks that can be built using independently obtained gene expression signatures are also quite different . These observations cast substantial doubt at the biological relevance of the diagnostic approach that relies on gene signatures.
Expression levels for individual genes and other variables quantified in high- throughput biological experiments are commonly thought of as dimensions of the space on which we are collecting information. Thus the problem outlined above is known as the "curse of dimensionality" . Briefly, in highly dimensional models, the number of parameters (dimensions) p is substantially larger than the sample size n. This property of biological datasets makes the task of distinguishing the noise from the true biological signal quite challenging, and it becomes close to impossible to obtain consistent estimator procedures [9, 10]. Hence there is a need to develop integrative approaches, capable of combining data from multiple high-throughput experiments to increase sample size [9, 10] or statistically sound and robust techniques to reduce the data to the most informative features. As an example of the latter approach, we can try to transform the entire dataset into a limited set of clusters using hierarchical clustering : starting from the definition of a distance between two tissue samples, we proceed by regrouping individual expression profiles to obtain a branched cluster tree. Unfortunately, hierarchical clustering produces plausibly looking trees even when random data points are entered . Hence, an extensive data perturbation by resampling is required for the validation of the obtained clustering . Moreover, unsupervised classification techniques are far from being robust, as the inclusion of a new patient typically modifies original clustering.
Another popular solution to the dimensionality curse is to use a supervised approach that relies either on the pre-selection of the feature-limiting steps or on pre-filtering the data by the strength of an association of each variable with clinical outcome, or associations between variables [14, 15]. Unfortunately, a majority of biological data analysts try a variety of data processing techniques before arriving at the final one that seems to be suitable to the dataset in question. Therefore this kind of supervision is inherently biased.
In this paper we propose an alternative novel approach based on the "distances" in the multidimensional space of gene expression values. As a proof-of-principle, we show that this approach produces surprisingly good results in separation of normal and affected samples both for analysis of human malignancies and for chronic progressive conditions like psoriasis.
Multidimensional distances and clustering
A result of an expression experiment for a given sample is a (very long) vector that may be represented as a point in a multidimensional space. If we introduce a distance in this space, we can use the standard clustering techniques  to classify the points.
This distance takes into account all correlations in the data. The problem with it is that to calculate correlation matrix and its inverse, many data points, i.e. many patients, are required.
Therefore, in the calculations below we use, as a rule, Pearson distance (4). Note that since all components of |X| and |Y| are non-negative, this distance is always between 0 and 1.
To test for the practical usefulness of the distance-based expression metrics we deployed the following strategy:
1 The datasets were selected from public MIAME-compliant GEO repository http://www.ncbi.nlm.nih.gov/geo/.
2 For each subset of the samples within data set (normal tissues, affected tissues, etc.) we calculated the coordinate for the center of the space defined by points of all the vectors as the simple arithmetic mean of all the samples in the subset.
3 For each point we calculated the distance to the centers of all subsets.
4 The distances to one center r1 vs. the distance to another center r2 were plotted.
Expression profiles of primary tumors and their metastases drift away from the homeostatic state
To test the hypothesis that the expression profiles of primary tumors and their metastases drift away from the healthy, homeostatic state, the RNAseq dataset with GEO Series accession number GSE46622 described in  was downloaded and reanalyzed. This dataset was generated using RNAseq profiling of matching normal, tumor and metastasis tissues from eight colorectal cancer patients. In the study, adaptor-clipped Illumina Genome Analyser IIx reads were mapped to the human genome version GRCh37 (hg19) using transcript models taken from Ensemble v64 with TopHat followed by determination of differential expression using the Cufflinks software bundle and the cuffdiff with upper quartile normalization .
The drifting distances of tumor samples reflect the degree of their relative malignancy
To prove this point, we downloaded data represented in publicly available Repository for Molecular Brain Neoplasia Data (Rembrandt) http://caintegrator.nci.nih.gov/rembrandt/, which included data on 21 normal samples, 221 glioblastoma multiforme (GBMs), 145 astrocytomas, 66 oligodendrogliomas and 11 tumors of mixed origin. The raw gene expression CEL files from Affymetrix HGU133 Plus 2.0 arrays were normalized using the robust multi-array average (RMA) method  with default parameters .
Distance analysis is applicable to classification of samples collected from patients with non-malignant chronic disease
To date, the quantification of the diagnostic and prognostic biomarker molecules in the human serum and tissues, including cancer specimens, remains the primary means of enhancing the clinician's ability to diagnose the chronic condition. Importantly, with innumerable molecular markers in development, the discovery of novel standalone biomarkers with acceptable sensitivity and specificity is an extremely rare event.
Here we challenge the biomarker paradigm by developing a distance measure that places each tissue sample by its entire tissue-wide transcriptome profiles within the space occupied by similarly obtained profiles of the samples collected from the same individual or from individuals that do not have given chronic condition. We hypothesize that as farther away individual sample drifts from its homeostatic state defined as center of the space occupied (defined) by a large number of reference (normal) samples, as farther away the respective tissue will be from the well maintained, healthy state. In our study, we used publicly available datasets, to develop easily interpretable, composite measure that capable of integrating high-throughput transcriptome profiles into comprehensive, holistic metric describing the molecular homeostasis within given sample.
The comprehensive distance measures account for the intrinsic heterogeneity of human tumors that plagues hight-throughput studies involving this type of the biological material  and even for a heterogeneity of the cell types that comprise given tissue . In particular, the composite biomarker metric that we call a distance metric, was validated using well-known Rembrandt glioma dataset associated with survival outcomes.
Importantly, proposed composite biomarker may be suitable for a dynamic description of patients' condition. This novel concept allows one to depart from the classical two-bin prediction model (e.g. "bad prognosis/good prognosis") as it produces a continuous prognosis model, where each sample is located in the neighborhood of other samples analyzed post-hoc and associated with known survival. For each sample, this concept quantitatively describes the degree of "the drift" from the standardized phenotype that will reflect the departure of the body from homeostasis. In the concepts, the effects of each personalized intervention could be evaluated by comparing the distance metrics for samples collected before the treatment and at multiple time-points within the interventional treatment course.
If proven valid, this concept might be developed into a novel type of integrative tests for the monitoring of the disease progress and the prediction of disease outcomes. The proposed distance analysis has a potential to become versatile in its application as it is equally attributable to gene expression profiles collected both by microarrays and by RNA-seq platforms, as well as, possibly, to proteome and metabolome profiles.
There is no doubt that proposed computational approach requires further development and optimization, in particular, other types of correlation-based metrics have to be tested for various kinds of multiparametric datasets that comprise simultaneously measured analytes. Future studies should include an analysis of longitudinal experiments that involve either various time points in course of the therapeutic treatment that ultimately results in the normalization of the pathological condition, or gradual processes detrimental to experimental system, for example, a development of insulin resistance or an ageing.
The distance analysis of molecular portraits is robust and versatile in its application as it is equally attributable to gene expression profiles collected by microarrays and by RNA-seq. The distance-based continuous predictive models depart from the classical two-bin prediction model (e.g. "bad prognosis/good prognosis") by placing each sample in the neighborhood of other samples analyzed post-hoc and associated with known survival.
We are grateful to Dr. Ganiraju Manyam (MD Anderson Cancer Center, TX, USA) and Prof. Alessandro Giuliani (Istituto Superiori de Sanita, Italy) for the discussions that greatly contributed to initial stages of the development of the holistic analysis of gene expression and to the concept of distance metric.
Some calculations were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA, USA.
This project was partially supported by "Human Proteome" program of the Ministry of Education and Science of the Russian Federation
The publication charges were covered by Vavilov Institute of General Genetics RA.
This article has been published as part of BMC Genomics Volume 15 Supplement 12, 2014: Selected articles from the IX International Conference on the Bioinformatics of Genome Regulation and Structure\Systems Biology (BGRS\SB-2014): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S12.
- 3.Veytsman B, Baranova A: High-throughput approaches to biomarker discovery and the challenges of subsequent validation. Biomarkers in Disease: Methods, Discoveries and Applications. Edited by: Preedy V. 2014, Springer, New YorkGoogle Scholar
- 4.Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365 (9460): 671-679. 10.1016/S0140-6736(05)17947-1.PubMedCrossRefGoogle Scholar
- 5.van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-536. 10.1038/415530a.CrossRefGoogle Scholar
- 8.Bellman RE: Adaptive Control Processes. A Guided Tour. 1961, Princeton University Press, Princeton, NJGoogle Scholar
- 15.Ma S, Dai Y: Principal component analysis based methods in bioinformatics studies. Brief Bioinform. 2011, 6-10.1093/bib/bbq090. 12Google Scholar
- 16.Xu R, Wunsch DC II: Clustering. IEEE Series on Computational Intelligence. 2009, IEEE Press; John Wiley & Sons, Hoboken, New JerseyGoogle Scholar
- 17.Ungar AA: Barycentric Calculus in Euclidean and Hyperbolic Geometry: a Comparative Introduction. 2010, World Scientific, Singapore; Hackensack, NJGoogle Scholar
- 19.Röhr C, Kerick M, Fischer A, Kuhn A, Kashofer K, Timmermann B, Daskalaki A, Meinel T, Drichel D, Börno ST, Nowka A, Krobitsch S, McHardy AC, Kratsch C, Becker T, Wunderlich A, Barmeyer C, Viertler C, Zatloukal K, Wierling C, Lehrach H, Schweiger MR: High-throughput miRNA and mRNA sequencing of paired colorectal normal, tumor and metastasis tissues and bioinformatic modeling of miRNA-1 therapeutic applications. PLoS One. 2013, 8 (7): 67461-10.1371/journal.pone.0067461.CrossRefGoogle Scholar
- 23.Kim Y-H, Nobusawa S, Mittelbronn M, Paulus W, Brokinkel B, Keyvani K, Sure U, Wrede K, Nakazato Y, Tanaka Y, Vital A, Mariani L, Stawski R, Watanabe T, De Girolami U, Kleihues P, Ohgaki H: Molecular classification of low-grade diffuse gliomas. Am J Pathol. 2010, 177 (6): 2708-2714. 10.2353/ajpath.2010.100680.PubMedPubMedCentralCrossRefGoogle Scholar
- 26.Zaba LC, Suarez-Farinas M, Fuentes-Duculan J, Nograles KE, Guttman-Yassky E, Cardinale I, Lowes MA, Krueger JG: Effective treatment of psoriasis with etanercept is linked to suppression of IL-17 signaling, not immediate response TNF genes. J Allergy & Clinical Immunol. 2009, 124 (5): 1022-1030395. 10.1016/j.jaci.2009.08.046.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.