Keywords

1 Introduction

Histology sections provide wealth of information about the tissue architecture that contains multiple cell types at different states of cell cycles. These sections are often stained with hematoxylin and eosin (H&E) stains, which label DNA (e.g., nuclei) and protein contents, respectively, in various shades of color. Morphometric abberations in tumor architecture often lead to disease progression, and it is desirable to quantify indices associated with these abberations since they can be tested against the clinical outcome, e.g., survival, response to therapy.

For the quantitative analysis of the H&E stained sections, several excellent reviews can be found in [7, 8]. Fundamentally, the trend has been based either on nuclear segmentation and corresponding morphometric representation, or patch-based representation of the histology sections that aids in clinical association. The major challenge for tissue morphometric representation is the large amounts of technical and biological variations in the data. To overcome this problem, recent studies have focused on either fine tuning human engineered features [1, 4, 11, 12], or applying automatic feature learning [5, 9, 15, 16, 19, 20], for robust representation and characterization.

Even though there are inter- and intra- observer variations  [6], a trained pathologist always uses rich content (e.g., various cell types, cellular organization, cell state and health), in context, to characterize tumor architecture and heterogeneity for the assessment of disease state. Motivated by the works of  [13, 18], we encode cellular morphometric signatures within the spatial pyramid matching (SPM) framework for robust representation (i.e., cellular morphometric context) of WSIs in a large cohort with the emphasis on tumor architecture and tumor heterogeneity, based on which an integrative analysis pipeline is constructed for the association of celllular morphometric context with clinical outcomes and molecular data, with the potential in hypothesis generation regarding the imaging biomarkers for personalized diagnosis or treatment. The proposed approach is applied to the TCGA LGG cohort, where experimental results (i) reveal several clinically relevant cellular morphometric types, which enables both perceptual interpretation/validation and further investigation through gene set enrichment analysis; and (ii) indicate the significantly increased survival rates in one of the cellular morphometric context subtypes derived from the cellular morphometric context.

2 Approaches

The proposed approach starts with the construction of cellular morphometric types and cellular morphometric context, followed by integrative analysis with both clinical and molecular data. Specifically, the nuclear segmentation method in [4] was adopted given its demonstrated robustness in the presence of biological and technical variations, where the corresponding nuclear morphometric descriptors are described in [3], and the constructed cellular morphometric context representations are released on our websiteFootnote 1.

2.1 Construction of Cellular Morphometric Types and Cellular Morphometric Context

For a set of WSIs and corresponding nuclear segmentation results, let M be the total number of segmented nuclei; N be the number of morphometric descriptors extracted from each segmented nucleus, e.g. nuclear size, and nuclear intensity; and \(\mathbf {X}\) be the set of morphometric descriptors for all segmented nuclei, where \(\mathbf {X}=[\mathbf {x}_1, ..., \mathbf {x}_M]^{\top } \in \mathbb {R}^{M \times N}\). The construction of cellular morphometric types and cellular morphometric context are described as follows,

  1. 1.

    Construct cellular morphometric types (\(\mathbf {D}\)), where \(\mathbf {D}=[\mathbf {d}_1,...,\mathbf {d}_K]^{\top }\) are the K cellular morphometric types to be learned by the following optimization:

    $$\begin{aligned}&\min _{\mathbf {D},\mathbf {Z}} \sum _{m=1}^{M}||\mathbf {x}_m-\mathbf {z}_m\mathbf {D}||^2\\&\text {subject to } card(\mathbf {z}_m) = 1, |\mathbf {z}_m| = 1, \mathbf {z}_m \succeq 0, \forall m \nonumber \end{aligned}$$
    (1)

    where \(\mathbf {Z}=[\mathbf {z}_1,...,\mathbf {z}_M]^{\top }\) indicates the assignment of the cellular morphometric type, \(card(\mathbf {z}_m)\) is a cardinality constraint enforcing only one nonzero element of \(\mathbf {z}_m\), \(\mathbf {z}_m \succeq 0\) is a non-negative constraint on the elements of \(\mathbf {z}_m\), and \(|\mathbf {z}_m|\) is the L1-norm of \(\mathbf {z}_m\). During training, Eq. 1 is optimized with respect to both \(\mathbf {Z}\) and \(\mathbf {D}\); In the coding phase, for a new set of \(\mathbf {X}\), the learned \(\mathbf {D}\) is applied, and Eq. 1 is optimized with respect to \(\mathbf {Z}\) only.

  2. 2.

    Construct cellular morphometric context vis SPM. This is done by repeatedly subdividing an image and computing the histograms of different cellular morphometric types over the resulting subregions. As a result, the spatial histogram, H, is formed by concatenating the appropriately weighted histograms of all cellular morphometric types at all resolutions. For more details about SPM, please refer to  [13].

In our experiment, K is fixed to be 64. Meanwhile, given the fact that each patient may contain multiple WSIs, SPM is applied at a single scale for the convenient construction of cellular morphometric context as well as the integrative analysis at patient level, where both cellular morphometric types and the subtypes of cellular morphometric context are associated with clinical outcomes, and molecular information.

2.2 Integrative Analysis

The construction of cellular morphometric context at patient level in a large cohort enables the integrative analysis with both clinical and molecular information, which contains the components as follows,

  1. 1.

    Identification of cellular morphometric subtypes/clusters: consensus clustering [14] is performed for identifying subtypes/clusters across patients. The input of consensus clustering are the cellular morphometric context at the patient level. Consensus clustering aggregates consensus across multiple runs for a base clustering algorithm. Moreover, it provides a visualization tool to explore the number of clusters in the data, as well as assessing the stability of the discovered clusters.

  2. 2.

    Survival analysis: Cox proportional hazards (PH) regression model is used for survival analysis.

  3. 3.

    Enrichment analysis: Fisher’s exact test is used for the enrichment analysis between cellular morphometric context subtypes and genomic subtypes.

  4. 4.

    Genomic association: linear models are used for assessing differential expression of genes between subtypes of cellular morphometric context, and the correlation between genes and cellular morphometric types.

3 Experiments and Discussion

The proposed approach has been applied on the TCGA LGG cohort, including 215 WSIs from 209 patients, where the clinical annotation of 203 patients are available. For the quality control purpose, background and border portions of each whole slide image were detected and removed from the analysis.

3.1 Phenotypic Visualization and Integrative Analysis of Cellular Morphometric Types

The TCGA LGG cohort consists of \(\sim 80\) million segmented nuclear regions, from which 2 million were randomly selected for construction of cellular morphometric types. As described in Sect. 2, the cellular morphometric context representation for each patient is a 64-dimensional vector, where each dimension represents the normalized frequency of a specific cellular morphometric type appearing in the WSIs of the patient. Initial integrative analysis is performed by linking individual cellular morphometric types to clinical outcomes and molecular data. Each cellular morphometric type is chosen as the predictor variable in the Cox proportional hazards (PH) regression model together with the age of the patient (implemented through the R survival package). For each cellular morphometric type, the frequencies are further correlated with the gene expression values across all patients. The top-ranked genes of positive correlation and negative correlation, respectively, are imported into the MSigDB [17] for gene set enrichment analysis. Table 1 summarizes cellular morphometric types that best predict the survival distribution, and the corresponding enriched gene sets. Figure 1 shows the top-ranked examples for these cellular morphemetric types.

Table 1. Top cellular morphometric types for predicting the survival distribution based on the Cox proportional hazards (PH) regression model, and the corresponding enriched gene sets with respect to genes that best correlate the frequency of the cellular morphometric type appearing in the WSIs of the patient, positively or negatively. Hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions with a unit difference of an explanatory variable, and higher HR indicates higher hazard of death.
Fig. 1.
figure 1

Top-ranked examples for cellular morphometric types that best predict the survival distribution, as shown in Table 1. Each example is an image patch of \(101\times 101\) pixels centered by the retrieved cell marked with the green dot. The first four cellular morphometric types (hazard ratio\(>1\)) indicate a worse prognosis and the last four cellular morphometric types (hazard ratio\(<1\)) indicates a protective effect. Note, this figure is best viewed in color at 400 % zoom-in.

As shown in Table 1, 8 out of 64 cellular morphometric types are clinically relevant to survival (FDR adjusted p-value \(<0.01\)) with statistical significance. The first four cellular morphometric types in Fig. 1 all have a hazard ratio \(>1\), indicating that a higher frequency of these cellular morphometric types may lead to a worse prognosis. A common phenotypic property of these cellular morphometric types is the loss of chromatin content in the nuclear regions, which may be associated with poor prognosis of lower grade glioma. The last four cellular morphometric types in Fig. 1 all have a hazard ratio\(<1\), indicating that a higher frequency of these cellular morphometric types may lead to a better prognosis.

Table 1 also indicates the enrichment of genes up-regulated in response to IFNG in cellular morphometric types \(\#28\), \(\#29\) and \(\#52\). In the glioma microenvironment, tumor cells and local T cells produce abnormally low levels of IFNG. IFNG acts on cell-surface receptors, and activates transcription of genes that offer potentials in the treatment of brain tumors by increasing tumor immunogenicity, disrupting proliferative mechanisms, and inhibiting tumor angiogenesis [10]. The observations of IFNG as a positive survival factor confirms the prognostic effect of these cellular morphometric types: \(\#28\) – negative correlation and worse prognosis; \(\#29\) and \(\#52\) – positive correlation and better prognosis. Other interesting observations include that three cellular morphometric types of better prognosis are enriched with genes up-regulated by IL6 via STAT3, and two cellular morphometric types of better prognosis are enriched with genes regulated by NF-kB in response to TNF and genes up-regulated in response to TGFB1, respectively.

3.2 Subtyping and Integrative Analysis of Cellular Morphometric Context

Hierarchical clustering was adopted as the clustering algorithm for consensus clustering, which is implemented via R Bioconductor ConsensusClusterPlus package with \(\chi ^2\) distance as the distance function. The procedure was run for 500 iterations with a sampling rate of 0.8 on 203 patients, and the corresponding consensus clustering matrices with 2 to 9 clusters are shown in Fig. 2, where the matrices with 2 to 5 clusters reveal different levels of similarity among patients and matrices with 6 to 9 clusters provide little further information. Thus, we use the five-cluster result for integrative analysis with clinical outcomes and genomic signatures, where, due to insufficient patients in subtypes \(\#1\) (1 patient) and \(\#2\) (2 patients), we focus on the remaining three subtypes.

Fig. 2.
figure 2

Consensus clustering matrices and corresponding consensus CDFs of 203 TCGA patients with LGG for cluster number of \(N=2\) to \(N=9\) based on cellular morphometric context.

Fig. 3.
figure 3

(a) Kaplan-Meier plot for three major subtypes associated with patient survival, where subtypes \(\#3\) (53 patients) \(\#4\) (65 patients) and \(\#5\) (82 patients) correspond to the three major subtypes from top-left to bottom-right, respectively, in Fig. 2 (\(N=5\)). (b) Top genes that are differently expressed between the subtype \(\#5\) and subtypes \( \#3 \& \#4\).

Figure 3(a) shows the Kaplan-Meier survival plot for three major subtypes of the five-cluster consensus clustering result. The log-rank test p-value of \(2.82e^{-5}\) indicates that the difference between survival times of subtype \(\#5\) patients and subtypes \( \#3 \& \#4\) patients is statistically significant. The integration of genome-wide data from multiple platforms uncovered three molecular classes of lower-grade gliomas that were best represented by IDH and 1p/19q status: wild-type IDH, IDH mutation with 1p/19q codeletion, and IDH mutation without 1p/19q codeletion [2]. Further Fisher’s exact test reveals no enrichment between the cellular morphometric subtypes and these molecular subtypes. On the other hand, differential expressed genes between subtype \(\#5\) and subtypes \( \#3 \& \#4\) (Fig. 3(b)), indicate enrichment of genes that mediate programmed cell death (apoptosis) by activation of caspases, and genes defining epithelial-mesenchymal transition, as in wound healing, fibrosis and metastasis (via MSigDB).

4 Conclusion and Future Work

In this paper, we encode cellular morphometric signatures within the SPM framework for robust representation (i.e., cellular morphometric context) of WSIs in a large cohort at patient level, based on which an integrative analysis pipeline is constructed for the association of celllular morphometric context with clinical outcomes and molecular data. The integrative analysis, performed on TCGA LGG cohort, reveals clinically relevant cellular morphometric types and morphometric context subtypes, and the corresponding enriched gene sets. We believe that the proposed approach has the potential to contribute to hypothesis generation regarding the imaging biomarkers for personalized diagnosis or treatment, which will be further validated on independent cohort.