Introduction

The vast majority of deaths due to breast cancer for nearly half a million people annually worldwide are due to distant metastases in the lung, liver, and brain [1]. Numerous studies have focused on breast cancer metastases and how they might differ from primary breast tumors; however, controversy remains regarding (A) the predisposition of specific classes of breast tumors to spread to distant sites and (B) the degree of similarity between primary breast tumors and their associated metastases.

Estrogen receptor (ER) status is known to be associated with breast cancer relapse in specific organs [2]. In 2008, this organ selectivity was refined by contrasting relapse patterns in 344 patients who had their tumors genomically subtyped as luminal A or B, HER2-enriched, basal-like, or normal-like [3]. In general, bone metastases were associated with the luminal subtypes, whereas basal-like and HER2-enriched tumors were significantly associated with brain and lung relapse. Similar results were also observed in an immunohistochemical-based study on 3,726 patients [4]. Recently, a new breast cancer subtype was identified, named claudin-low [57]. This subtype exhibits aggressive characteristics including expression of mesenchymal markers and low expression of genes involved in tight junctions and cell–cell adhesion. The lack of epithelial cell features and expression of mesenchymal traits is reminiscent of features associated with breast stem cells [8]. Since breast cancer stem cells are relatively resistant to both chemotherapy and radiation [9, 10], and because metastases frequently progress despite treatment, it is important to determine if these claudin-low/mesenchymal cells are associated with metastatic potential.

To better understand the biology driving breast cancer metastases, 1,319 human gene expression microarrays from primary tumors, metastases, and cancer cell lines were analyzed here. Tumors and their associated metastases, on average, were much more similar to each other than they were different. By including the recently defined claudin-low subtype we extend previous findings [3, 4] and better define the metastatic predilections of each intrinsic subtype. Increasingly “undifferentiated” breast cancer cells [as quantitatively measured by a Differentiation Score predictor (DS)] tend to express stem cell signatures and preferentially metastasize to the brain and lung. These results identify that breast cancer intrinsic subtype is maintained throughout disease progression, and that a combination of several genomic signatures can add prognostic value and therefore direct where disease monitoring should be focused.

Results

Genetic similarity among tumors and metastases

Previously, we examined the genome-wide gene expression profiles of five primary breast tumor/matched metastatic pairs and noted an overall high degree of similarity within a pair [11]. To further examine the degree of relatedness of breast tumors and their metastases, we performed correlation analysis using thousands of genes, and hundreds of pre-defined gene expression signatures/modules [12] incorporating a large set of tumors and paired metastases. Intra-class correlation (ICC) values were determined between pairs of samples using multiple classification/grouping methods: (1) different pieces of the same primary tumor (“intrinsic pairs”), (2) tumors and their matched metastases [all metastases, or further separated into either lymph node (LN) or distant], (3) tumors and their matched metachronous metastases, (4) sets of synchronous metastases from the same patient, (5) tumors from different patients grouped by intrinsic subtype, and (6) metastases from different patients (Fig. 1a). On average when using all expressed genes, there was high concordance between two pieces of the same primary tumor (ICC = 0.9 [0.89–0.91]), while pairs of tumors and their metastases exhibit lower concordance values (0.82 [0.8–0.83]). As observed by the metachronously paired tumor-metastasis samples, gene expression did not change substantially over time. The autopsy patient data (0.72 [0.68–0.75]) suggest that normal organ RNA may be the variable most responsible for the decreased similarity between tumor and metastasis pairs. This hypothesis was supported by increased ICC values of 20 matched pairs of laser-captured tumors and LN metastases [13] (0.9 [0.85–0.94]).

Fig. 1
figure 1

Genomic similarity of breast tumors and metastases. Microarrays were performed on 265 primary tumors and 85 metastases and the overall similarity was measured by intra-class correlation (ICC), with estimates plotted showing 95% confidence intervals. a Using all variably expressed genes, gene expression concordance values were measured in matched samples from the same patient; primary tumors split in 2 (n = 40), tumor-metastasis pairs (n = 34), tumor–LN metastasis pairs (n = 24), tumor-distant metastasis (n = 10), autopsy patient metastases from multiple organs within the same patient (n = 33), metachronous tumor–metastasis pairs (n = 10), or from independent patient samples; normal breast (n = 17), luminal A tumors (n = 86), luminal B tumors (n = 50), HER2-enriched tumors (n = 25), basal-like tumors (n = 44), claudin-low tumors (n = 45), LN metastases (n = 21), and distant metastases (n = 45). b ICC of 298 gene expression signatures/modules [12] using the same samples and pairing used in (a)

Individual gene measurements can be fraught with “noise.” Thus, to further test the relationship between tumors and metastases, ICC values were identified using a compendium of 298 different gene expression signatures/modules [12], where each module is a summary measure of tens to hundreds of genes. The overall ICC values were higher than individual genes (thus showing greater robustness for gene signatures) and the breast tumor–metastasis pairs showed high conservation of pathways (Fig. 1b). The signatures with the most variability between tumors and matched metastases were associated with extracellular matrix (ECM) proteins. These genes may be microenvironment-induced or may be due to different amounts of fibroblasts found in tumors as compared to metastases (Supplemental Table 1).

Association of subtypes and sites of metastasis

Since the majority of genes maintain their RNA expression levels when growing as either primary tumors in the breast or as metastases, we sought to determine if the different intrinsic subtypes showed a predilection for metastasis to specific organs using genomic data arising from primary tumors only. Therefore, we combined four public microarray datasets with Distance Weighted Discrimination [14], providing 855 tumors with documented first site of relapse (Supplemental Table 2) [1518]. Principal components analysis found that the overall variation of gene expression was due to the biology of the tumors, and not by cohort/source or microarray platform (Supplemental Fig. 1). Status for ER, progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) was recorded for 852, 537, and 499 tumors, respectively, and of the 482 tumors with defined status for all three markers, 110 were triple negative (TN); Kaplan–Meier analyses for site of relapse with these markers are shown in Supplemental Fig. 2. For all sites of relapse, ER/PR negativity was associated with increased metastases, except for bone, in which both ER+ and ER− tumors recurred. Clinical HER2+ and TN status were associated with liver and brain/lung relapse, respectively.

Next, each tumor’s intrinsic subtype was calculated for this combined data set using the PAM50 [19] and the claudin-low subtype predictors [6] (Supplemental Table 3). Of the 855 tumors, 76 were identified as normal breast-like, and since this tumor classification is reflective of mostly normal breast tissue [19], these tumors/samples were excluded from further analyses, leaving a dataset of 779 tumors. Based on the site of first relapse data for liver, lung, brain, and bone, Kaplan–Meier plots were generated, and we determined that intrinsic subtype was correlated with site of relapse (Fig. 2, Supplemental Fig. 3). Compared to luminal A, basal-like and HER2-enriched tumors showed the highest hazard ratio (HR) of relapse to any site (basal-like vs. luminal A hazard ratio [HR] 2.1, P < 0.0001; HER2-enriched vs. luminal A HR 2.0, P < 0.0001) followed by luminal B (HR 1.69, P < 0.001) and claudin-low (HR 1.47, P = 0.051) tumors. Important findings included: (1) bone metastasis was the most common—regardless of subtype (Table 1), (2) brain relapse occurred most frequently in non-luminal samples, (3) liver relapse was associated with HER2-enriched tumors, and (4) lung relapse occurred often within the claudin-low and basal-like subtypes. In all analyses, luminal B tumors were more metastatic than luminal A tumors, thus providing a useful stratification within ER+ tumors.

Fig. 2
figure 2

Association of breast cancer subtype with site of first relapse. Shown are Kaplan–Meier plots and log rank tests of first site of relapse in each breast tumor subtype in the 779 tumor dataset. If a patient showed two or more simultaneous sites of relapse, then this patient was counted as being site of first relapse for both. Organ of first relapse; a any, b brain, c lung, d bone, e liver

Table 1 Site of first relapse of the 779 tumors from each cohort according to intrinsic subtype

Undifferentiated tumors and brain metastases

In 2009, Bos et al. [16] utilized two human breast cancer cell lines, CN34 and variants of the MDA-MB-231 human breast cancer cell line (a claudin-low cell line[6]), along with gene expression data from human breast tumors, to identify 17 genes whose expression correlated with brain relapse (BrMS). Given the clear associations observed for the intrinsic subtypes and sites of metastases, we hypothesized that the BrMS would correlate with basal-like and/or claudin-low subtypes. ANOVA from two different datasets supported this hypothesis (Fig. 3a, b). A lung metastasis signature (LMS) [20] is also associated with intrinsic subtype (Fig. 3c, d).

Fig. 3
figure 3

Association of the brain (BrMS) and lung (LMS) cell line-based metastasis signatures with intrinsic subtype. Box-and-whisker plots are shown for each signature on multiple breast tumor microarray data sets according to intrinsic subtype. P values were calculated with ANOVA. Shown are the same data sets used for the testing of the BrMS (a) or LMS (c) signatures, as well as an independent UNC dataset (b, d)

Recently, a genomic method to quantify breast epithelial cell differentiation status, known as the Differentiation Score (DS) predictor [6] was developed. This predictor is based on the genomic signatures of FACS purified populations of mammary stem cells, luminal progenitors, and mature luminal cells of the normal human breast [8]. The scoring of the DS predictor is based on the premise that mammary stem cells are the least differentiated cells in the breast and they give rise to luminal progenitors, which then produce mature luminal cells; for the DS, higher scores represent greater differentiation along this axis that starts with the mammary stem cell signature and culminates in mature ER+ luminal cells. In this spectrum, claudin-low tumors are the least differentiated, followed by basal-like, HER2-enriched, and ending with luminal B and A tumors [6]. Since claudin-low and basal-like tumors were associated with brain relapse, we postulated that the more undifferentiated a tumor is on this axis, the more likely it would be to metastasize to the brain. To test this hypothesis, gene expression data from parental and organ-tropic (brain, lung, and bone) MDA-MB-231 cell lines were obtained from the Gene Expression Omnibus, and their DS calculated and plotted on the DS axis (Fig. 4a). Shown on the same scale are the 779 breast tumor dataset (Fig. 4b), cancer cell lines of various tissue origins (NCI60) [21] (Fig. 4c), and the MDA-MB-231 series [16, 20, 22] (Fig. 4d). Overall, claudin-low and luminal breast cancer cells lines show the same relative differences in differentiation status as is seen in primary tumors. Importantly, the MDA-MB-231 cells from the NCI60 and Massagué studies showed nearly identical DS, and the brain-tropic MDA-MB-231 cells were significantly less differentiated than the parental cell line.

Fig. 4
figure 4

Differentiation Score analysis of the 779 human breast tumors, NCI60 cell lines, and MDA-MB-231 cell lines. a Differentiation axis diagram based on FACS fractions Lim et al. [8], which is described in Prat et al. [6]. b Box-and-whisker plots of the distributions of scores from the 779 tumor dataset according to intrinsic subtype. c NCI60 cancer cell lines gene DS values [21], with the breast cancer cell lines divided into claudin-low (dashed circle value for MDA-MB-231) or luminal cell lines. d MDA-MB-231 parental, lung-tropic, brain-tropic, and bone-tropic cell lines from the studies of Massagué and colleagues. The asterisk indicates statistical significance difference in DS between parental and brain-tropic lines (T test P = 0.002)

To identify other features shared between low DS tumors and brain metastasis, we analyzed the NCI60 [21] cell line series. Interestingly, DS were found to be similar in claudin-low breast cancer cell lines, central nervous system (CNS), and melanoma cell lines, a tumor type known to aggressively spread to the brain[23] (Fig. 4c). To identify genes that mediate cerebral colonization, significance analysis of microarrays (SAM) was performed on the NCI60 data set by comparing these three cancer cell line types versus the rest. Two-hundred and sixty-five genes were identified as being highly expressed (FDR = 0%) in claudin-low, CNS and melanoma cell lines; Ingenuity Systems Pathway Analysis found that “cellular movement” was the top biological function associated with these genes (Supplemental Fig. 4).

The triple-negative SUM149PT breast tumor-derived cell line contains two distinct populations of breast cancer cells[24], which can be separated by FACS to yield one population with basal-like and another with claudin-low-like features and a lower DS [6]. To test if lower DS correlates with increased migration, we fluorescence-activated cell sorted (FACS) the SUM149PT cell line into CD49f+/Epcam−/low and CD49f+/high/Epcam+ subpopulations, performed Boyden chamber migration assays, and determined that the less differentiated (i.e., lower DS) SUM149PT CD49f+/Epcam−/low cells were significantly (P < 0.001) more migratory than the more differentiated Epcam+ population (Supplemental Fig. 5).

Differentiation Scores and metastasis

We next sought to better understand the information that DS provides for predicting site of metastasis. Since there is a range of differentiation within each intrinsic subtype (Fig. 4b), we tested if the least differentiated basal-like/claudin-low tumors were more metastatic than the more differentiated basal-like/claudin-low tumors. Kaplan–Meier analysis and log-rank tests determined that the least differentiated half of these tumor subtypes were associated with significantly more relapse to brain (P = 2E−03, log rank-test) and lung (P = 2.4E−02). This same approach applied within luminal and HER2-enriched tumors found no association of DS with bone or liver relapse, thus this association appears specific for brain and lung relapses, although it should be noted that the least differentiated luminal and HER2-enriched tumors do not have low overall DS.

To visualize the information that DS and intrinsic subtypes provide for predicting site of metastasis, we plotted the DS of the 779 tumors versus the HR for each site of metastasis (Fig. 5a). The tumors were then ordered based on DS and all genes (11,068) hierarchical clustered (Fig. 5b). Interestingly, tumors with the lowest DS have a much higher HR for brain and lung metastases, and this risk drops off quickly as differentiation increases. Importantly, this analysis identified a subset of tumors within the largely ER− claudin-low and basal-like tumors that aggressively metastasize.

Fig. 5
figure 5

Relationship of Differentiation Score, breast cancer subtype, and likelihood of site of metastasis. 779 tumors with known first site of relapse were ordered based on low to high DS. a Hazard ratios for each site of metastasis were estimated by grouping a sliding window of 50 samples with consecutive DS and contrasting against those outside the window. Estimates were then smoothed with Lowess prior to plotting. b Hierarchical clustering of all genes. Below the dendogram is a colored bar identifying the intrinsic subtype of each tumor (yellow claudin-low, red basal-like, pink HER2-enriched, dark blue luminal A, light blue luminal B)

Stem cell signatures correlate with brain and lung metastases

Several studies have shown an association of stem cell characteristics and metastatic proclivity [2527]. Therefore, the 855 tumor dataset was used to test if several previously published stem cell signatures contained within our set of 298 gene modules [12] were associated with site of relapse. Univariate Cox proportional hazards models identified that many of the signatures with the strongest associations for brain (and lung) relapse were either expressed in normal brain and/or have been identified as essential components of embryonic stem cells and tumor initiating cells [26, 27] (Supplemental Table 4). Of the 13 embryonic stem cell signatures analyzed in Ben-Porath et al. [27], all were significantly associated with relapse to brain/lung, 11 with LN metastasis, 10 with liver, and 5 with bone. Nearly all the signatures that predicted for brain relapse correlated with low DS, and those not strongly correlated with DS were correlated with proliferation. Some of these signatures further identified subsets of basal-like and claudin-low tumors most likely to metastasize to the brain (log-rank test: PRC2_targets; P = 0.0090, MM_WapINT3; P = 0.0001). Thus, ES cell signatures, DS, and proliferation appear to be strong predictors of CNS and lung metastases, and in general, the signatures most predominant for brain/lung relapse were weakly expressed in tumors that spread to the bone.

Univariate and multivariable survival analyses

The ability to predict the presence and/or location of a tumor recurrence could influence the location and frequency of radiographic surveillance for patients with a history of breast cancer. Therefore, we sought to identify the most informative signature, or combination of signatures that predicts metastasis to specific sites. First, we performed univariate survival analyses for multiple signatures, including the many described above and our previously published VEGF/hypoxia signature [28]. As shown in Table 2A, all signatures tested were highly prognostic overall and, interestingly, both BrMS and LMS signatures predicted lung and brain relapse, providing evidence that metastases to these two organs utilize similar genetic mechanisms. Second, we performed multivariate analysis using the backward stepwise procedure and observed that subtype information (i.e., subtype calls or risk of relapse categories based on subtype [ROR-S]) was selected in each evaluation (Table 2B). For liver relapse, specifically, knowing the subtype call instead of the ROR-S risk category was found particularly informative; indeed, the risk of liver relapse of the HER2-enriched subtype was 4.0 times higher compared to the luminal A subtype despite that the HER2 status (as determined by gene expression) was also included. In addition to intrinsic subtype information, other signatures were found statistically significant in the various MVA final models, such as the upregulated genes of the BrMS in brain relapse, or the VEGF/hypoxia signature and the downregulated genes of the LMS in lung relapse. Interestingly, the BrMS and VEGF/hypoxia-signature were found highly correlated with DS (Pearson = −0.68), and correspondingly, the BrMS, DS, and VEGF/hypoxia-signature identify a subset of basal-like/claudin-low tumors that spread to the brain (P < 0.05). Thus, when each metastatic site is individually examined, a unique combination of signatures is chosen that includes intrinsic subtype (individual subtype or ROR-S) as well as another signature or two, ultimately resulting in the optimal set of variables for predicting relapse to that organ.

Table 2 Breast cancer metastasis-free survival (A) univariate and (B) multivariable analyses among all patients

Discussion

Metastases are the main cause of death for breast cancer patients and predicting a tumor’s likelihood to spread, and organ of relapse, is clinically important information. Analysis of 265 breast tumors and 85 metastases found that a breast tumor’s overall gene expression phenotype is largely maintained in its metastases. The gene expression differences that do occur may be due to a combination of different amounts of epithelial/stromal cells (Fig. 1, Supplemental Table 1), and/or clonal expansion of a more aggressive subclone of a tumor [4, 29]. The microenvironment also effects gene expression and response to therapeutics [30], therefore, targeting the host organ cells, vascular cells, as well as tumor cell specific targets may be the best approach to inhibit disease progression [31]. This overall similarity, however, does suggest that important information about metastatic potential can be revealed by studying primary tumors.

Basal-like and claudin-low breast cancers both exhibit a high probability to metastasize to the brain and lung while HER2-enriched subtype tumors preferentially colonize the liver (Fig. 2; Table 1). The basal-like and claudin-low tumor types are genomically related [6], exhibit similar treatment response characteristics, and as shown here, have similar metastasis patterns. The CD49f+/Epcam−/low fraction of the SUM149PT cell line (which is enriched for claudin-low tumor features) was significantly more migratory than the more differentiated basal-like component cells. Interestingly over time, the SUM149PT cells with claudin-low characteristics asymmetrically divide into two distinct populations of more (i.e., basal-like) and less-differentiated cells, whereas the more differentiated fraction produces similarly differentiated cells [6]. Since the less-differentiated claudin-low-like cells contain higher levels of genes that facilitate cellular movement (Supplemental Figs. 4, 5), we hypothesize that these cells may initiate the metastatic cascade; after seeding a host organ, they asymmetrically divide, spawning both more and less differentiated cells. Precisely why these cells show predilection for the brain and lung requires further investigation, however, the cell line studies of Massagué and colleagues using the claudin-low MDA-MB-231 cells are providing for some initial candidates. These studies have shown that the cells that are relatively more capable of spreading to the CNS express genes that function to increase cellular extravasation and blood brain barrier penetration [16], while also upregulating glycolytic pathways and increasing vascularization [28].

Our re-analyses of the data presented by Bos et al. [16] find that the DS of brain-tropic breast cancer cells is significantly lower than the parental cell line (Fig. 4); correspondingly, low DS was also found to associate with brain relapse in patients (Fig. 5). While basal-like and claudin-low breast tumors can relapse in bone, recurrence in vital organs, such as the brain and lung is more symptomatic. Thus, first site of recorded relapse for basal-like and claudin-low tumors is typically not bone. DS, however, is not the only factor that determines metastagenicity. For example, luminal A and B tumors have similar DS, yet luminal B tumors are much more likely to relapse. Perhaps all luminal tumors can effectively seed certain organs, however, the faster proliferation rate inherent to luminal B tumors accounts for the differential relapse frequency. Correspondingly, 58% of luminal B tumors present with multiple organs as first relapse events, compared to only 21% from luminal A.

After observing the metastasis patterns of the less-differentiated basal-like and claudin-low breast tumors, it was not surprising that the BrMS and LMS signatures associate with subtype and DS. The BoMS was not strongly expressed in any subtype, a finding which may reflect the fact that bone was the most common site of metastasis in our study. These findings complement analyses by Culhane and Quackenbush [32] who found that a different lung metastasis signature [33] was a surrogate for the basal-like subtype. This does not argue, however, that these signatures are not biologically important. In fact, the BrMS identifies some of the least differentiated tumors within the claudin-low and basal-like subtypes and these data support continued investigation of select genes within the BrMS as targeting these genes, along with others that function to increase cellular differentiation, may serve to slow metastatic progression.

To gain a mechanistic understanding for site-specific tumor colonization, we tested a compendium of 298 expression signatures as individual predictors of site of relapse. These analyses showed enrichment for stem cell signatures in brain/lung relapse (Supplemental Table 4). The majority of these signatures provide information that is encoded within DS; however, some of the signatures further divide ER-negative tumors into two distinct groups that are more or less likely to metastasize to the brain/lung. As an example, one such signature is the MM_WapINT3, which is a signature derived from a transgenic mouse mammary tumor model that over-expresses Notch4 and aggressively spreads to the lung [34]. This is a clinically relevant finding in that half of patients with advanced triple negative breast cancer relapse within the brain [35], and survival following CNS relapse is less than 4 months [36], regardless of receipt of systemic therapy.

Overall, the results from Table 2 reveal shared and unique features predicting relapse to distinct sites. For example, intrinsic subtype (as represented by individual subtypes or the ROR-S score) make every final MVA model, but then each site of relapse shows individual characteristics. For brain, the BrMS signature and HER2 status add important information, while for lung the VEGF/hypoxia and LMS signature add information, for bone the DS score was valuable, and for liver, most information was carried by the HER2-enriched subtype; thus for the most accurate site of metastasis predictions, multiple signatures and/or clinical variables are needed. Our ability to predict patients at the highest risk for CNS relapse may impact the manner in which we approach CNS screening and future prevention strategies. The data presented herein provides clinically useful information that could be used to identify patients most likely to experience site-specific breast cancer relapse.

Materials and methods

Human breast tumor microarray datasets

Two distinct microarray data sets were studied here. The first was based upon Agilent Technologies DNA microarrays taken from Prat et al., with 42 new additional metastasis samples profiled here using identical protocols as previously described [6, 19, 37]. All human tumor and normal tissue samples were collected using IRB-approved protocols and all microarray and patient clinical data are available at UNC Microarray Database (https://genome.unc.edu) and have been deposited in the Gene Expression Omnibus (GEO) under the accession number GSE26338. The probes/genes for these analyses were filtered by requiring the Lowess normalized intensity values in both sample and control to be >10. All probes for each gene were averaged. The normalized log2 ratios (Cy5sample/Cy3 control) of probes mapping to the same gene (Entrez ID as defined by the manufacturer) were averaged to generate independent expression estimates.

The second data consisted of a combined microarray data set of four studies taken from the public domain. We utilized the microarray as presented in the following breast cancer datasets: GSE2034, GSE12276, GSE2603, and the NKI295 (microarray-pubs.stanford.edu/wound_NKI/Clinical_Data_Supplement.xls). The clinical data from these patients was obtained from previous studies [16, 38]. NCI60 cell line microarray data was obtained from http://genome-www.stanford.edu/nci60/. Additional microarrays from the GEO for the MDA-MB-231 cells were downloaded from GSE12237 and GSE2603. Probes in these external sets were assigned to Entrez Gene identifiers and replicate gene names were collapsed to the median. The data from the four tumor datasets were then combined using Distance Weighted Discrimination [14] to remove the systematic biases present in different microarray datasets. In all datasets, samples were standardized to zero mean and unit variances before other analyses were performed.

Microarray data processing

Samples in the final normalized data were assigned to the five subtypes (luminal A, luminal B, Her2-enriched, basal-like, and normal-like) using the PAM50 classifier [19]. Assignment of claudin-low and DS were performed according to the protocol described in Prat et al. [6]. 298 gene expression modules first characterized in Fan et al. [12] were applied to both data sets and expression estimates obtained for each tumor in each data set; the gene list corresponding to each module was summarized to the mean expression within each sample, or the principal component, or according to a predetermined algorithm. Testing for differential expression of the modules between primary and metastatic pairs from the same individual was performed with the SAM [39] two class paired test.

Statistics and survival analyses

The intra-class correlation (ICC) [40] was utilized to estimate concordance within specific groups of samples. For groups of paired samples, the ICC was calculated for each pair and then summarized by the mean ICC for each group of interest. ICC values for groups of unpaired samples were estimated from all samples in the group. ICC estimates were performed identically for the set of modules or the set of all genes. All ICC estimates were generated using the R package “irr.” Principal components analyses were performed in R. Categorical survival analyses were performed using a log-rank test and visualized with Kaplan–Meier plots. Box-and-whisker plots were used to observe the relationship of the intrinsic subtypes with the organ-specific metastasis signatures and were performed in R. Univariate and multivariable Cox proportional hazard analyses were used to estimate HR and determine the significance of the intrinsic subtypes and gene signatures. Subtypes and DS were compared along with ER status and published signatures using time to first relapse (for each site) as the end point. To visualize the association of DS with Subtype and site of metastasis HR for each site of metastasis were identified by using a sliding window of 50 samples with consecutive DS, and the HR was calculated by contrasting the samples in the window versus those outside the window. HR estimates were smoothed across DS with Lowess.

Functional analysis of gene sets

261 genes that were differently expressed in the three undifferentiated NCI 60 cell lines (as compared to the rest) were uploaded into Ingenuity Systems Pathway Analysis (www.ingenuity.com) based on their Entrez gene identification number.

Boyden chamber migration assays

SUM149PT cells were fluorescence associated cell sorted after immunolabeling with CD49f and Epcam as previously described [6]. CD49f+/high/Epcam+ and CD49f+/Epcam−/low cells were plated in 0% FBS Boyden chambers with 8.0μm pores and chemoattracted to 0.5% FBS for 24 h. Migrated cells were stained with crystal violet, and then solubilized and read at 470 nm.