Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm
- 7.6k Downloads
Life processes are determined by the organism's genetic profile and multiple environmental variables. However the interaction between these factors is inherently non-linear . Microarray data is one representation of the nonlinear interactions among genes and genes and environmental factors. Still most microarray studies use linear methods for the interpretation of nonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, to analyze three independent large Affymetrix high-density oligonucleotide microarray data sets.
Isomap discovered low-dimensional structures embedded in the Affymetrix microarray data sets. These structures correspond to and help to interpret biological phenomena present in the data. This analysis provides examples of temporal, spatial, and functional processes revealed by the Isomap algorithm. In a spinal cord injury data set, Isomap discovers the three main modalities of the experiment – location and severity of the injury and the time elapsed after the injury. In a multiple tissue data set, Isomap discovers a low-dimensional structure that corresponds to anatomical locations of the source tissues. This model is capable of describing low- and high-resolution differences in the same model, such as kidney-vs.-brain and differences between the nuclei of the amygdala, respectively. In a high-throughput drug screening data set, Isomap discovers the monocytic and granulocytic differentiation of myeloid cells and maps several chemical compounds on the two-dimensional model.
Visualization of Isomap models provides useful tools for exploratory analysis of microarray data sets. In most instances, Isomap models explain more of the variance present in the microarray data than PCA or MDS. Finally, Isomap is a promising new algorithm for class discovery and class prediction in high-density oligonucleotide data sets.
KeywordsSpinal Cord Injury Class Discovery Robust Multichip Average Acute Myeloid Leukemia Sample Monocytic Differentiation
The gene expression microarray is an assay that measures expression levels of tens of thousands of genes in parallel on a single chip. Microarrays can be performed from a very small amount of a biological sample, thus allowing for an experimental design involving many sample groups, repeats, dense time series, and samples collected at high-granularity from various anatomic locations. Today, the cost of microarrays is the principal factor limiting the number of samples that can be examined in a particular experiment. In spite of the high cost of microarrays, two thirds of those surveyed by GenomeWeb said they performed more than 200 microarrays and 57% spent more than $100,000 on microarrays in 2003 . Sixty eight percent of these chips were oligonucleotide arrays, mostly Affymetrix chips. With the widespread use of microarrays in basic research and their increasing use in medical diagnostics, biomedical researchers can anticipate lower costs for chips that will lead to more studies utilizing hundreds, if not thousands, of samples. This expansion in sample size will provide researchers with higher resolution insights into biological processes as they are reflected in temporal, spatial, and functional patterns in microarray data sets. To reveal these patterns, several types of pattern recognition and clustering techniques have been developed and applied to microarray data.
A common task in the analysis of large microarray data sets is sample classification based on gene expression patterns. This process can be divided into two steps: class prediction and class discovery. During class prediction samples are assigned to predefined sample classes; whereas class discovery is the process of establishing new sample classes. For example, when gene expression arrays are used for cancer classification, class prediction assigns tumor samples into pre-existing groups of malignancies, while class discovery reveals previously unknown cancer subtypes . The newly discovered tumor subtypes may have different clinical patterns, respond differently to certain drugs, and require more or less aggressive surgical and radiological treatment. Class discovery may also reveal previously unknown processes in cancer biology and define more specific indications for certain drugs. Specific drugs may be used to target newly discovered tumor subtypes, thus facilitating pharmacogenomic drug design and development. These goals will soon become achievable with the results from microarray studies using large samples. Class prediction and class discovery using large data sets will require the evaluation, adaptation, and development of robust mathematical, statistical, and computational tools.
Several mathematical algorithms and computational methods have been applied to class prediction and class discovery in large gene expression data sets. The methods most frequently used are based on clustering techniques such as hierarchical clustering (HC) . HC was used for temporal classification in conjunction with Fourier analysis to detect genes that correlate with periodic changes in synchronized S. cerevisiae cells . HC was also applied to cancer classification, for example breast cancer classification . Other clustering techniques applied to microarray data are the unstructured k-means clustering , cluster affinity search (CAST) , fuzzy c-means clustering , and two-way clustering  that was used for the analysis of drug-tumor interactions . Self-organizing maps (SOM) is another technique that is particularly well suited for exploratory data analysis. Unlike HC, SOM does not impose a rigid structure to the data . The utility of SOM was demonstrated in leukemia classification using a weighted voting procedure . Weighted voting classification was also used for predicting chemosensitivity of the NCI-60 tumor collection  and human breast cancers . Supervised methods, such as Fisher's linear discriminant analysis, artificial neural networks (ANN), support vector machines (SVM), and boosting, have the advantage that the sample classes are usually defined by another "gold standard" method (e.g., histology, clinical outcome, length of survival, etc). In supervised classification, the choice of classifiers is frequently based on other considerations, e.g., genes that play a role in the pathomechanism of a certain disease or are expressed in a particular tissue. These "enrichment" methods can improve the prediction strength of a classification and decrease the sample number necessary for developing a prediction model. However they also introduce bias that may lead to overtraining or lack of discovery of unexpected sample classes. One of the supervised methods, support vector machines (SVM), has the advantage that it does not make assumptions about the distribution of the data . SVM was tested on ovarian cancer, leukemia, and colon tumor data sets  and was also demonstrated to be useful for multi-class cancer classification [16, 17, 18]. Artificial neural networks (ANN), another machine learning method, was shown to classify small, round blue-cell tumors (SRBCT) . Tree harvesting, a new method of supervised learning was recently applied for gene expression data . In contrast to these complex procedures, much simpler classifiers may also perform equally well on some data sets. For example, nearest shrunken centroids were applied to SRBCT cells and leukemias .
Dimensionality reduction methods are useful in predicting the underlying true dimensionality of a microarray data set and reduce the number of variables applied as inputs to any of the classification procedures. Multi-dimensional scaling (MDS), a linear method, was used for classifying alveolar rhabdomyosarcomas , cutaneous malignant melanomas , and breast cancers . Principal component analysis (PCA), another dimensionality reduction method, was used for visualizing gene expression maps of central nervous system (CNS) development  and classifying embryonal CNS tumors . Probabilistic PCA, a method incorporating biological assumptions in linear factor models, was recently applied to two yeast microarray datasets . Another method, singular value decomposition (SVD), was applied to soft tissue tumors , S. cerevisiae cell cycle, and serum-induced fibroblast data sets [29, 30]. Generalized SVD (GSVD) was developed to extract conserved gene expression patterns comparable between two different organisms . All these linear methods are inherently sensitive to outliers, missing values, and non-normal distribution. A variant of SVD, robust SVD (rSVD), was recently developed to minimize the effect of these corruptions in the data set . Sammon mapping, a nonlinear mapping algorithm  was incorporated in the R multiv package as well as gene expression data processing and exploratory analysis software, such as ENGENE . To address similar issues, Isomap  a nonlinear technique of dimensionality reduction originally designed for solving classical problems of pattern recognition, such as visual perception and handwriting recognition, was applied recently to discovery biologically relevant structures in cDNA microarrays [36, 37].
We have applied Isomap to the analysis of both breast cancer microarray data sets and prostate cancer proteomics spectra  and showed that it consistently outperformed PCA in revealing biologically relevant low-dimensional structures in high-dimensional data sets. Nilsson et al. independently demonstrated the utility of Isomap using a lymphoma and a lung adenocarcinoma cDNA microarray data set . We report here the application of Isomap to three independent Affymetrix GeneChip® data sets. We show that Isomap is capable of discovering temporally, spatially, and functionally relevant structures in gene expression data. To avoid any kind of bias that may be introduced with gene selection or feature enrichment methods, we did not apply any data scrubbing, filtering, or feature enrichment techniques. We also show that Isomap can successfully detect biologically relevant structures even in the background noise of tens of thousands of genes present.
Spinal cord injury data set
The Isomap algorithm was first evaluated on a large data set consisting of 170 rat U34A high density oligonucleotide arrays with 8,799 genes on each array . These data were originally collected for a study on spinal cord injury that illustrated the role of cell cycle in trauma-induced neuronal death . Compared to earlier microarray studies on experimental spinal cord damage, this study applied a lower level spinal cord injury, used individual rat samples rather than pooled spinal cord tissues, evaluated several time points, employed larger Affymetrix arrays, and used both sham-injured and naïve controls.
Unlike this original study, we do not make the assumption that only those genes "consistently expressed above background" are important for further consideration. The DiGiovanni et al. study used a stringent inclusion threshold that included only those genes present in at least 40% of all the samples. In addition to this first filter, a second filter was also applied that eliminated genes that did not have a change in expression level of at least two-fold compared to that of the sham controls, which was determined with Welch ANOVA t-test. Unlike the DiGiovanni et al. study, our analysis considers all 8,799 genes, thus eliminating an important source of potential selection bias. We do not use stringent selection criteria at the level of data scrubbing, because these filters may introduce confounding into the data set, which can lead to separation between sample classes based on subjective filtering criteria rather than existing biological phenomena.
Figure 2 shows the 170 samples in a three-dimensional model that was generated by Isomap. Samples are classified and colored according to one of three major attributes – time, location, and severity of the injury. The sphere diameters express the compactness of the sample classes; the more compact the class, the smaller the diameter of the sphere. If members of a class spread out in a larger space then larger size spheres are displayed. The sphere diameter is determined as the 95 percentile of the distance distribution from the nearest neighbor within the class. The Isomap model successfully clusters similar samples into groups that are at well-defined locations of the three-dimensional model. The least affected sham-operated samples are shown at the very central core location of the model. By contrast, the most affected samples after moderate to severe injury at 24 h (that is, the most active phase of spinal cord injury) are shown at the peripheries of the model.
The location panel (Fig. 2A) shows that samples from regions below and above the injury are at the central core of the model. At the same time, samples from the injury itself are displayed in the peripheral lobes of the model. There is no visible separation between the unaffected samples and those originating from locations other than the injury itself.
The time panel (Fig. 2B) shows a clear-cut time-dependent separation of the samples along the x-axis. The right-side lobe of the model shows samples at 30 min, 4 h, and 24 h after injury. The earlier samples are at the core of the model. Conversely, samples from later time points are found at more distal locations in the right-side lobe of the model. This time-pattern is in agreement with Di Giovanni et al.  who found with temporal clustering that the immediate early genes are over-expressed at 30 min after spinal cord injury. This is followed by genes associated with inflammatory and oxidative stress plateauing at 4 h after injury, as well as cell cycle and neuronal apoptosis-regulating genes at 4–24 h. Isomap effectively separates this early active phase of post-injury damage from the later phase of regeneration that takes place at 48 h – 28 days. Samples from this later regenerative phase, at time points of 2, 3, 7, 14, and 28 days, are displayed in the left-side lobe of the Isomap model. During regeneration, the earlier samples are shown in the more peripheral locations and the later samples are closer to the central core of the model. It is noteworthy that some samples are at the peripheral locations of the model even 28 days after the injury, which is explained by incomplete regeneration and permanent damage caused by a more severe spinal cord injury as will be demonstrated later. In the time panel, Isomap shows the most striking separation between the early and later phases of post-injury events. The right-side lobe of the model contains samples from the early 30 min – 24 h phase of the post-injury damage. In these samples, the dominant events are the spinal cord injury, the secondary biochemical changes, and the endogenous autodestructive events. On the other hand, the left-side lobe of the Isomap model contains samples from the later 48 h-28 day phase when the neuroprotective and recovery promoting phenomena overcome the earlier autodestructive events.
The severity panel (Fig. 2C) shows that samples taken after incremental levels of spinal cord injury appear at increasingly distal locations in the Isomap model. The sham-operated samples are located in the central core of the model surrounded by a shell of samples from rats with mild spinal cord injury. The most peripheral samples in both lobes of the model are those representing moderate to severe injury. In the right lobe of the model, which represents the 30 min – 24 h time points, there is no clear separation between moderate injury samples and those with severe injury. Unexpectedly, in the left lobe that represents the later time points, the samples with moderate injury are more distal than those with severe injury. Although this separation between the moderate and severe injury was not anticipated, the separation between mild and moderate to severe injury is very clear. This separation also answers a question we raised on the time panel. In the left lobe of the model, six samples are visible at 28 days after injury. Three of these samples are in a cluster that is closer to the periphery and the other three samples are closer to the core of the model. The severity panel clearly shows that the three distal samples are those that underwent moderate injury and the more central three samples are those with only mild injury. This finding is consistent with our interpretation that the left lobe of the model represents the regeneration process. While three rats were able to partially recover from a mild injury after 28 days, the other three animals with moderate spinal cord injury did not recover by this time.
Rat multiple tissue gene expression data set
Isomap was applied to another data set containing 122 samples from 11 peripheral tissues and 15 brain regions from three common outbred rat strains, such as Wistar, Wistar Kyoto, and Sprague Dawley. The original study was performed to characterize physiological expression levels in these anatomical locations, to find genes that are important in certain brain regions, and to explain the phenotypic variations between rat strains . This data set consists of 122 Affymetrix rat U34A arrays with 8,799 genes on each array.
Using Affymetrix MAS5, Walker et al. applied several filters on the genes, such as t-test p-value less than 0.05, at least two-fold change of expression levels, and minimum expression thresholds. Additionally, they used Rosetta Resolver with two filters such as Resolver ANOVA p-value less than 0.05 and at least two-fold change of expression levels. Our Isomap analysis of this data set used all genes to compute sample-to-sample differences without the use of input filters. Although our approach may increase the noise in the data set, it also eliminates any potential source of selection bias. We also avoided using any tissue-enrichment techniques based on a priori knowledge. Isomap analysis was carried out in a completely unsupervised fashion.
We were surprised by the ability of Isomap to sort several rat tissue samples in a fashion that reflects the topological anatomy of the source tissues (Fig. 2A). It is apparent in this figure that duplicate samples from the same tissue are displayed at almost identical locations on the Isomap map. For example, on the top left side of the figure, duplicates from small intestine, large intestine, and endothelial samples are presented as three dots. These three tissues are displayed close to each other, which is expected since the primary components of all three are epithelial cells. Nearby is displayed the kidney, a large part of which is also derived from epithelial cells. Below these samples on the mid left part of the figure close to each other are the spleen and thymus samples, two organs of the immune system. On their right is the bone marrow which is the primary site of blood cell generation and therefore functionally related to the spleen and thymus. On the bottom left part of the figure, close to each other are the two major muscle tissues, such as the skeletal muscle and the heart muscle. At the bottom of the figure is the cornea that is an ectodermal tissue close to the brain samples that are also ectodermal in origin and farther away from the endodermal and mesodermal tissues mentioned above. The two endocrine glands of the brain, such as the pineal and pituitary glands, are somewhat separate from the other parts of the brain.
Figure 3E is a two-dimensional Isomap model of another enlarged region of panel A. This panel shows the neighborhood graphs used for building the Isomap model. On the left side of this panel is represented the prefrontal cortex (PFC) that is connected to the temporal/parietal/occipital cortex (cortex minus frontal cortex, CMFC). At the bottom of this panel are the samples from the frontal cortex (FC) below two hypothalamus (H) samples. Above these are samples from the amygdala (A), the amygdala central nucleus (ACN), which is a part of the amygdala, the ventral striatum (VS), and the dorsal striatum (DS). Not circled are two samples from the Fisher dorsal root ganglia (DRGF). Figure 3E also shows that most of the amygdala samples are close to the ACN. Unexpectedly, two other amygdala samples are in a very different location showed on panel D close to the VTA. Similar to the amygdala, the hypothalamus (H) is also shown at two different locations; two hypothalamus samples are shown in panel E below the amygdala, and two other samples are in panel D close to the dorsal raphe (DR).
Isomap analysis projects the hypothalamic samples onto two different well-defined locations in the Isomap model. This separation of the hypothalamus samples may reflect an unequal contribution of the different hypothalamic regions to the four hypothalamic samples and provides an example of class discovery. The hypothalamus has several anatomic regions, one of which is the periventricular hypothalamus that is functionally related to the pituitary as well as to the autonomic areas in the brain stem and the spinal cord. Another hypothalamic region, the medial hypothalamus has several connections with the medial division of the amygdala. The third hypothalamic region, the lateral preoptic hypothalamus, has a very complex anatomy with many fibers passing through this region. These subclasses may be explained by spatial or temporal differences. The hypothalamus samples on panel D close to the pituitary samples may originate from samples that are rich in periventricular hypothalamus. On the other hand, samples on panel E close to the amygdala samples may be richer in medial hypothalamus. The separation of the hypothalamus samples into two well defined classes may be explained by not only anatomical but temporal differences. Many of the hypothalamic nuclei synthesize several neurotransmitters and hormones, such as corticotropin-releasing hormone (CRH) and other related releasing hormones. The levels of these molecules and the activity of particular hypothalamic neurons may vary by time depending on the circadian rhythm, stress, food intake, emotional state, estrus cycle, and other factors. The two hypothalamic subclasses discovered by Isomap may be different because of these temporal variations.
Most of the amygdala (A) samples are projected on the Isomap map at a location between the cortical regions and the ventral striatum (Fig. 3E). Two samples, on the other hand, are projected at the ventral tegmental area (VTA) and the nucleus accumbens (panel E). These two amygdala samples are clearly different from the rest of the amygdala samples. This separation is consistent with anatomical dissimilarities between different parts of the amygdala. The largest portion of the amygdala, the basolateral nuclear complex, primarily consists of pyramidal and stellate neurons similar to those in the cerebral cortex. In fact, Isomap found most of the amygdala samples in the proximity of the cerebral cortex. Another part of the amygdala, the centromedial group, including the central nucleus (ACN), is connected to the bed nucleus through fibers called the stria terminalis. Although the bed nucleus is anatomically closer to the hypothalamus, it is histologically very similar to the amygdala. This separation demonstrates a potential class discovery by Isomap, where the discovered classes are in agreement with anatomically, histologically, and functionally well defined structures within the hypothalamus and the amygdala. Some of the hypothalamus and amygdala samples potentially contain more or less tissue from different parts of these anatomically heterogeneous brain structures, which leads to different classifications based on the major constituent.
As a comparison, we analyzed the multiple tissue dataset also with Principal Component Analysis (PCA) (Fig. 3C). Unlike Isomap, PCA was unable to separate the skeletal muscle, kidney, and bone marrow samples from each other (arrow 1). With PCA, the thymus and spleen samples overlap the nucleus accumbens samples (arrow 2); and the small intestine (arrow 3) and large intestine (arrow4) samples are mapped at a distance from each other. This comparison demonstrates that Isomap outperforms PCA in mapping the different tissue samples on a 2-dimensional space. This difference in performance of the two methods is also supported by the higher residual variance of the PCA model compared to that of the Isomap model (Fig. 3B).
High throughput drug screening data set
Chemical compounds and their abbreviations used in the high throughput screening dataset
all trans retinoic acid
(-) scopolamine methyl bromide
Cells and their abbreviations included in the high throughput screening dataset
primary patient acute myeloid leukemia cells
primary acute promyelocytic leukemia cells
HL-60 cell line
normal human monocytes
normal human neutrophils
Figure 4B, an enlarged portion of panel A, focuses on the monocytic end of the Y distribution showing HL-60 cells treated with several chemical compounds. These compounds, Apo, Keto, Cyc7p5, Sulma, Erythro, DMSO, Methyl, Phen, EGFR, and PMA (Table 1) change the gene expression pattern of HL-60 cells to become more or less similar to normal monocytes; therefore these compounds are potential inducers of monocytic differentiation. ATRA is displayed on the left side of this panel and represents an effect that is more similar to granulocytic than monocytic differentiation.
Figure 4C, another enlarged portion of panel A, shows an area of the Y-distribution around the normal granulocytes (poly). In addition to the granulocytes, this area also contains duplicates of APL samples before and after several drug treatments. The effect of drugs on APL, as measured by the change of gene expression levels, is less compared to the effect of the same drugs on the HL-60 cells. The result of Keto- and Phen-treatment on APL cells is different from that of ATRA; while EGFR and Erythro causes only a minor change in the expression pattern of APL cells.
Figure 4D zooms in the area of the Y-distribution that is surrounding the undifferentiated HL-60 cells (undiff) at the third end of the Y. Two compounds, 5-FU, and 5-FUD, cause a change in the gene expression pattern that is different from monocytic and granulocytic differentiation. The Scop- and Caff-treated samples are very similar to each other and represent only a small change from the untreated HL-60 cells. The effect of two other compounds, Apo and Perg on HL-60 cells is unique and cannot be classified as only monocytic or granulocytic. At the same time, VitD is displayed close to the ATRA-treated samples in the left upper corner of panel D, which suggests that calcitriol's effect on HL-60 cells is similar to that of ATRA. This functional similarity is supported by the known similarities of action between these two compounds. Both chemicals bind to nuclear receptors, VitD binds to vitamin D receptors (VDR) and ATRA binds to retinoic acid receptors (RAR). Both of these receptors heterodimerize with retinoid X receptor (RXR) and act as hormone-dependent transcription factors.
Stegmaier et al. computed "monocyte and neutrophil scores" based on four parameters: interleukin 1 receptor antagonist (IL1RN) and secreted phosphoprotein 1 (SPP1) for the monocyte signature genes and autosomal chronic granulomatous disease protein (NCF1) and orosomucoid 1 (ORM1) for the neutrophil signature genes. In a supplementary table, these four values were presented for undifferentiated HL-60 cells and cells treated with one of 16 compounds. In Figure 4E, we present a two dimensional Isomap model built only from these four parameters. Arrows point to the directions of the granulocytic (left) and monocytic (right) differentiation. It is noteworthy that our unbiased model, using the expression levels of all the 18,400 transcripts and variants as input, offers a very similar result to the model using only four signature genes. In both models, the monocytic and granulocytic differentiations are revealed and the various chemical treatments cause similar changes relative to the two main dimensions. In both maps, Caff and Scop are proximal and represent only a small change compared to the undifferentiated HL-60 cells (undiff). Perg, 5-FU, and 5-FUD cause more change and Erythro, Apo, EGFR, Phen, ATRA, and VitD are all mapped in similar relative locations in panel A and E. The only major difference between the two models is in the location of Sulma and Methyl. However, in both cases these two compounds are mapped close to each other. All these similarities underscore Isomap's ability to recognize treatment-related changes in gene expression patterns even when no a priori information is available about which genes are good predictors of a biological process or difference. Isomap can detect changes in gene expression signals in the background noise of tens of thousands of other genes.
In this study, we use three Affymetrix high density oligonucleotide microarray data sets to demonstrate Isomap's ability to discover relevant structures in an unbiased fashion. In the rat multiple tissue gene expression data set, the structures revealed correspond to the anatomical topology of the source organs and tissues . Similarly, in the high-throughput drug screening data set, the structures match the monocytic and granulocytic differentiation patterns of myeloid leukemia cells treated with various chemical compounds . We also demonstrate with a complex spinal cord injury data set that Isomap reveals the three main modalities (e.g., location, time, and severity of the injury) of the experimental design . This Isomap model (Fig. 2) is superior to hierarchical clustering (Fig. 1) because Isomap can model structures of higher dimensionality while hierarchical clustering is limited to the discovery of only one-dimensional structures. Clustering is limited in two significant ways; it requires well-separated data and linear correlation. The only relationship clustering can detect is a one-on-one relationship when pair-wise linear comparisons are made . Isomap is a nonlinear algorithm capable of analyzing microarray data sets that are nonlinear in nature. The sources of these nonlinearities may originate from outliers, missing values, and non-normal distribution; all common events for measurements in biological systems.
In all of our examples, Isomap is applied to the expression level values of all of the genes in the microarray data set. To avoid any kind of selection bias, no filtering or data scrubbing procedures were applied. Even under these conditions, we demonstrate that Isomap is capable of finding the biologically relevant structures within noisy data sets. The experimental noise may come from the measurement error and the variations of expression levels of tens thousands of genes. When Isomap is used in production mode, the experimental noise may be decreased by eliminating the genes that are not expressed in the studied tissues and those which have very stable and unchanging expression levels in all of the samples. These procedures will further improve the resolution of the Isomap models and facilitate class discovery by Isomap. The classification power of Isomap may be improved when it is used in combination with enrichment procedures that weight the gene expression levels of different genes dependent on other surrogate information coming from the knowledge of chemical pathways in which the gene plays a role, the tissues in which the gene is expressed , and the gene expression patterns that are evolutionarily conserved between different organisms [39, 43].
The structures discovered by Isomap help to understand biological phenomena underlying these structures. Visualization tools of Isomap models may be applied as data mining tools for microarray data sets. Similar tools were used for modeling co-regulated genes in C. elegans with VxInsight  and generating high-resolution temporal models of CNS development . Moreover, Isomap is a promising tool of unbiased class discovery because it is performed in a completely unsupervised manner. This is a key advantage of the Isomap algorithm considering the limited number of class discovery tools [23, 45, 46] compared to the abundance of class prediction methods. The Isomap algorithm is capable of revealing structures in microarray data sets, which may provide insight into underlying biological networks [4, 5, 7, 22, 25, 29, 30, 42, 43, 47].
Another application of the Isomap algorithm is predicting class membership. We previously demonstrated the use of Isomap for class prediction in a breast cancer outcome microarray data set . The selection of good classifiers that are robust, insensitive to outliers, medically interpretable, having high generalization power, and based on the smallest possible subset with the maximal discriminatory features  is important and should be evaluated for Isomap in the future. Aclass prediction model using the Isomap algorithm does not need to use all the genes in the microarray. Somorjai et al. argue that no matter what feature selection approach is used to the microarray data, generally 50 or more features are needed for classification . In this paper, we show that only four genes are used as input for an Isomap model (Fig. 4E). Isomap is a type of dimensionality reduction method and may be evaluated as input to supervised machine learning techniques, such as ANN and SVM. Besides gene expression microarrays, Isomap can be used on other types of biological data, such as genomic, proteomic, and metabolomic data sets to reveal low dimensional structures related to diet-genome interactions, genotype-disease associations, and drug-gene-disease relationships.
Isomap, a nonlinear dimensionality reduction algorithm, discovers low-dimensional structures embedded in high-dimensional Affymetrix high-density oligonucleotide microarray data sets. These structures correspond to and help to interpret underlying biological phenomena present in these data. Our work provides examples of three experiments with temporal, spatial, and functional processes revealed by the Isomap algorithm. Visualization of Isomap models helps to understand these processes and provides means of data mining from gene expression data sets. The low-dimensional models generated by Isomap help to reveal new sample classes and are potentially useful for class prediction. In summary, Isomap is a promising new algorithm for the unbiased analysis of high-density oligonucleotide microarray data sets.
Spinal cord injury data set: A large data set consisting of 170 Affymetrix rat U34A high density oligonucleotide arrays with 8,799 genes on each array was accessed at the Gene Expression Omnibus (GEO) GSE 464. Rat multiple tissue data set: Another data set consisting of 122 Affymetrix rat U34A arrays with 8,799 genes on each array was accessed at the GEO GSE 952. High throughput screening data set: A gene expression data set based on high throughput screening (GE-HST) was accessed at the GEO GSE 995. This series is a combination of three other series: GSE 976, 982, and 985. The accessed data set contains 86 human genomic U133A Affymetrix arrays with 18,400 transcripts and variants on each array and 30 human 6800 arrays with 7,129 genes on each array.
Probe-level microarray data analysis
All microarray data sets were downloaded to a local computer in Affymetrix CEL file format. Some of these files were text files, others were binary CEL files. Probe-level data analysis was carried out with Bioconductor 1.3.28 libraries  in the R 1.8.0 environment on a dual processor PC with RedHat 8.0 operating system installed. Raw data were first normalized with quantile normalization; background correction and gene expression levels were then computed with Robust Multichip Average (RMA) .
Gene expression levels were expressed as RMA values and imported into Gene Cluster 3.0 . Genes were centered to the mean and filtered with the criteria of genes with expression levels ≥ 1.0 in at least 10% of the samples. Since the RMA values are base-two logarithm-converted values, this statement is equivalent to a requirement that the expression levels of the selected genes be ≥ 2-fold or ≤ 1/2 of the mean expression level of that gene in at least 10% of the samples. In case of the spinal cord injury data set, 397 of 8,799 genes passed these filtering criteria. The selected genes and all the samples were clustered using centroid linkage hierarchical clustering based on uncentered correlation similarity metric. Branches of the clustering trees were then folded using SOM and visualized using Java TreeView 1.0.1. The result of this analysis with the spinal cord injury data set is presented in Figure 1.
Isomap analysis of the microarray data sets
Gene expression levels of all genes were expressed as RMA values and the result table was imported into the Matlab environment. Sample-to-sample differences between all pairs of genes were expressed as Euclidean distances of RMA values. Isomap was carried out in Matlab with the algorithm provided by Joshua Tenenbaum et al. . We used the nearest neighbor method with k = 2, 3, or 4 dependent on which value of k generated the minimal residual variance. Other data processing and visualization steps with the Isomap models were carried out with custom written Matlab functions. Source code is presented in Additional Files 5, 6, 7, 8.
Isomap analysis of the monocytic and neutrophilic differentiation markers
Stegmaier et al. computed "monocyte and neutrophil scores" based on four parameters: interleukin 1 receptor antagonist (IL1RN) and secreted phosphoprotein 1 (SPP1) for the monocyte signature genes and autosomal chronic granulomatous disease protein (NCF1) and orosomucoid 1 (ORM1) for the neutrophil signature genes . In a supplementary table of their paper, these four variables were presented for undifferentiated HL-60 cells as well as for cells treated with one of 16 chemical compounds. In this study, we expressed sample-to-sample differences as Euclidean distances of the base-two logarithm-converted mean values of the four marker genes. An Isomap model using the nearest neighbor method with k = 2 was performed with the algorithm provided by Joshua Tenenbaum . All other data processing and visualization steps were carried out with custom written Matlab functions. The result of this analysis is presented in Figure 4E.
This work was funded by grant (P60MD00222) from the NIH to the NCMHD Center of Excellence in Nutritional Genomics.
- 2.Arrays Come of Age Genome Technology 2004, 42: 38–39.Google Scholar
- 3.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531CrossRefPubMedGoogle Scholar
- 6.Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406: 747–752. 10.1038/35021093CrossRefPubMedGoogle Scholar
- 10.Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745PubMedCentralCrossRefPubMedGoogle Scholar
- 11.Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN: A gene expression database for the molecular pharmacology of cancer. Nat Genet 2000, 24: 236–244. 10.1038/73439CrossRefPubMedGoogle Scholar
- 12.Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 1999, 96: 2907–2912. 10.1073/pnas.96.6.2907PubMedCentralCrossRefPubMedGoogle Scholar
- 13.Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR: Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci U S A 2001, 98: 10787–10792. 10.1073/pnas.191368598PubMedCentralCrossRefPubMedGoogle Scholar
- 16.Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR: Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A 2001, 98: 15149–15154. 10.1073/pnas.211566398PubMedCentralCrossRefPubMedGoogle Scholar
- 19.Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 2001, 7: 673–679. 10.1038/89044PubMedCentralCrossRefPubMedGoogle Scholar
- 23.Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 2000, 406: 536–540. 10.1038/35020115CrossRefPubMedGoogle Scholar
- 26.Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002, 415: 436–442. 10.1038/415436aCrossRefPubMedGoogle Scholar
- 28.Nielsen TO, West RB, Linn SC, Alter O, Knowling MA, O'Connell JX, Zhu S, Fero M, Sherlock G, Pollack JR, Brown PO, Botstein D, van de Rijn M: Molecular characterisation of soft tissue tumours: a gene expression study. Lancet 2002, 359: 1301–1307. 10.1016/S0140-6736(02)08270-3CrossRefPubMedGoogle Scholar
- 36.Dawson K, Rodriguez RL, Malyj WV: Gene Expression and Proteomics Data Analysis in Cancer Diagnosis and Prognosis Using Isomap, a Non-Linear Algorithm. Breast Cancer Research 2003., 5:Google Scholar
- 46.Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson JJ, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–511. 10.1038/35000501CrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.