Keywords

1 Introduction

In the past few years, Convolutional Neural Networks (CNN’s) have attained immense success in medical imaging problems such as detection and classification [2,3,4,5,6, 12, 15]. For example, in [3], a magnification independent framework and CNN model is presented to detect H&E stained breast cancer cells. In [2], a simple 3-layer CNN architecture is presented with data augmentation to classify Immunofluorescence images of HEp-2 cells. In [4, 5], deep neural networks are investigated for mitosis detection. Apart from classification and detection, CNNs have also been used in medical image segmentation [13].

CNNs used in these problems are typically trained in the RGB colorspace. However, the ideal discriminating features in medical microscopic images may not be the pixel intensities in the RGB color space, but the stain quantities that are absorbed and are characteristics of the tissue. Previous works have shown that the stain quantities can be estimated in the Optical Density (OD) space through the application of Beer-Lambert’s law [8,9,10, 14]. This transformation from RGB color space to the stain quantity space is commonly termed as stain deconvolution. Motivated with the above, we propose Stain Deconvolution Layer (hereby named as SD-Layer) that is a biomedically relevant CNN layer and can be prefixed to any CNN model and performs the following functions:

  1. (i)

    It transforms the input RGB images to the Optical Density (OD) space.

  2. (ii)

    Initialized with the stain basis vector of one of the cell image, this layer learns the optimal stain basis vectors of cell/tissue of interest for class labels through backpropagation.

  3. (iii)

    It deconvolves OD image with the learned stain color basis and provides tissue-specific stain absorption quantities that are used as input to the following CNN architecture.

To the best of our knowledge, this is the first work where deep learning based classification of medical images has been employed in the OD space using the Beer-Lambert law based stain deconvolution. We evaluate the performance of the proposed SD-Layer by prefixing it to two standard CNNs (AlexNet and T-CNN) on the challenging problem of differentiating malignant immature White Blood Cells (WBCs) from hematogones (benign immature WBCs) for cancer detection in Acute Lymphoblastic Leukemia (ALL). ALL detection has been carried out in the past using machine learning algorithms on hand-crafted features [11, 15].

However, the datasets considered in these studies are typically small and hence, generalization error of these on the real world unseen data may be higher. In this paper, we have applied deep learning based proposed architecture on nearly 9000 immature WBCs (total) for malignant versus normal WBC blast classification with a 5-fold cross validation accuracy of 93.2%. This is to note that the novelty of the paper lies in the proposed deep learning based architecture that can be applied to other classification problems in medical imaging. The remainder of the paper is organized as follows. In Sect. 2, we review some of the relevant theory. In Sect. 3, we propose the SD-Layer formulation. In Sect. 4, we evaluate the performance of the proposed SD-Layer. Section 5 presents a small discussion, followed by some conclusions in Sect. 6.

2 Background

This section presents a brief review of the theory required to understand the proposed work. Assume that a given stained slide is illuminated by light of intensity \(I_{o,c}\) in color channel c (red, green, or blue) and I(pc) denotes intensity captured by the camera at pixel location p in channel c. Beer Lambert’s law is defined as:

$$\begin{aligned} I(p,c) = I_{o,c} e^{-\sum _{i=1}^{N}Q(p,i)S(i,c)}, \end{aligned}$$
(1)

where Q(pi) is the quantity of the \(i^{\text {th}}\) stain constituent absorbed at pixel location p, S(ic) is the characteristic absorbance of the \(i^{\text {th}}\) stain constituent in the channel c, \(I_{o,c}\) can also be viewed as the maximum pixel intensity in channel c where no staining chemical is absorbed, and N is the number of stain constituents. From (1), it is noted that the observed pixel intensity I(pc) varies non-linearly with the quantity of staining chemical Q in the RGB colorspace. However, the optical density O(pc) defined as the negative log of (1) varies linearly with Q as below:

$$\begin{aligned} O(p,c) = -log_{10}\frac{I(p,c)}{I_{o,c}}= \sum _{i=1}^{3}Q(p,i)S(i,c). \end{aligned}$$
(2)

In the matrix notation, this can be written as

$$\begin{aligned} \mathbf O = \mathbf Q \mathbf S , \end{aligned}$$
(3)

where \(\mathbf O \) and \(\mathbf Q \) are matrices of dimension \(MN \times 3\), \(\mathbf S \) is the stain color matrix of dimension \(3 \times 3\). Each row of \(\mathbf S \) constitutes one stain basis vector, while each column of \(\mathbf Q \) refers to the quantity of each of these stain basis vectors present at different pixel positions.

Generally, Beer-Lambert law based deconvolution proceeds as follows. A given input image in RGB space is first converted to OD space via (2) to obtain \(\mathbf O \). Next, \(\mathbf Q \) and \(\mathbf S \) are estimated through different matrix factorization strategies such as Singular value decomposition (SVD) [9], non negative matrix factorization (NMF) [8], and sparse NMF (S-NMF) [14]. In this paper, we use the widely popular SVD based method to achieve stain deconvolution.

3 Proposed Stain Deconvolution Layer (SD-Layer)

In this section, we present the proposed SD-Layer that is built on the understanding of the staining based imaging of biological tissues.

Firstly, as has been discussed earlier, absorption of stain quantities at different positions correspond to the tissue properties. Since variation in stain absorption by tissues/cells lead to the formation of corresponding medical image, it is more appropriate to design a classifier using stain quantities. Thus, we propose to train the CNN on the stain quantities absorbed \(\mathbf Q \) obtained via deconvolution of the OD space image \(\mathbf O \) with the stain basis vectors in \(\mathbf S \) as below:

$$\begin{aligned} \mathbf Q = \mathbf O \mathbf S ^ {-1}. \end{aligned}$$
(4)

Here, \(\mathbf S \) can be obtained via SVD of \(\mathbf O \) [9]. In practice, stain matrix \(\mathbf S \) determined using (SVD) would vary from image to image due to several factors such as illumination variation, over/under staining, ageing of the staining chemicals, etc. [8]. Thus, full microscopic images are stain normalized prior to cell segmentation and classification. However, stain normalization carried out on the full slide (containing large number of cells) may still lead to stain variations at the individual cell level. Since classification is required at the cell level, training a CNN on \(\mathbf Q \), obtained using (4) via stain matrix \(\mathbf S \) estimated on the full slide reference image, may not yield desired classification accuracy. Thus, we would like to fine tune the stain matrix \(\mathbf S \) at the cell level via the proposed SD-Layer.

Fig. 1.
figure 1

Interpreting matrix multiplication in (4) as convolution. Each element of \(\mathbf Q \) can be viewed as convolution of rows of \(\mathbf O \) with columns of \(\mathbf S ^{-1}\).

Fig. 2.
figure 2

Illustration of SD-Layer. \(\phi _i\)’s are learnable \(1\times 1\times 3\) sized convolutional filters.

In order to realize this, we interpret the matrix multiplication between \(\mathbf O \) and \(\mathbf S ^{-1}\) in (4) as convolution between the rows of \(\mathbf O \) and the columns of \(\mathbf S ^{-1}\) as shown in Fig. 1. Thus, each column of \(\mathbf S ^{-1}\) is equivalent to a convolution filter of dimension \(1\times 1\times 3\) and stride 1. This interpretation allows to learn \(\mathbf S ^ {-1}\) optimally at the cell level through backpropagation. It is important to note that accurate learning of \(\mathbf S ^ {-1}\) is heavily dependent on its initialization. We found that initializing the convolution filters using the columns of \(\mathbf S ^ {-1}\), determined through SVD on the reference image, led to good results. We experiment with other initializations in the experiments section.

To sum up the discussion, the SD-Layer (shown in Fig. 2) performs two functions. Firstly, it transforms the input image from RGB to OD space using (2). Secondly, it determines the stain quantities present at each pixel using (4). The stain matrix, initialized through stain deconvolution of the reference image, is optimally learned at the cell level through backpropagation. This introduces only 9 additional learnable parameters that is insignificant compared to the total number of weights in the model. Thus, the gain in classification accuracy as presented in the next section is due to the more biologically relevant input image representation rather than the enhanced model capacity.

4 Experiments

In this section, we evaluate the performance of SD-Layer appended to the front end of two CNN architectures: AlexNet [7] and Texture-CNN [1]. AlexNet is a widely studied standard CNN model. It consists of 5 convolution layers followed by 2 fully connected layers, followed by a softmax layer. For input image dimension of our dataset, AlexNet contains \(\approx \)146 million weights.

Texture-CNN (T-CNN) was recently proposed in [1] and was shown to achieve superior results on texture datasets. It modifies AlexNet, by computing features from the 3-D activation map of the last convolutional layer instead of simply flattening it. These features act as order-less texture descriptors. So, for a 3-D map of dimension \(H \times W \times D\), computing channel-wise mean results in D number of features wheras flattening would give HWD features. With the reduced number features that are fed to the subsequent fully connected layers, T-CNN contains \(\approx \)20 million learnable parameters.

Fig. 3.
figure 3

Example images from our dataset. (a) Nucleus of normal WBC immature cell (b) Nucleus of malignant WBC blast.

4.1 Dataset

Our Data consists of microscopic image slides prepared from the bone marrow aspirate of normal and ALL subjects. These images are stained with Jenner-Giemsa stain. A trained oncologist hand-labeled the normal and malignant WBC immature cells. All the images were normalized using [9] for stain variation. The nuclei of the labeled cells were then segmented. In total, our dataset consists of 8938 cell nuclei, 4469 nuclei of each class. We used random rotations through 180 degrees and vertical flipping in each epoch as two data augmentation strategies during the training phase. To account for the varying sizes of the segmented nuclei, we embed the nuclei in a 400\(\,\times \,\)400 black colored patch. Re-sizing of cell images provide poor results since texture is an important feature that gets altered with re-sizing. Example images from our dataset are shown in Fig. 3.

Table 1. 5-fold cross-validation accuracy of Alexnet and T-CNN, with and without SD-Layer
Fig. 4.
figure 4

Plots of test accuracy v/s epochs for T-CNN model without and with SD-Layer, evaluated on a single fold.

4.2 Experiment 1: AlexNet vs T-CNN with and Without SD-Layer

To establish a baseline performance, we evaluate performance of two models: AlexNet [7] and Texture-CNN [1] on our dataset. All models were trained using stochastic gradient descent (SGD) for 400 epochs. The initial learning rate for AlexNet was set to 0.01 and for AlexNet with SDLayer to 0.001. For T-CNN, with and without SD-Layer, the initial learning rate was set to 0.01, which was reduced by a factor of 10 on epochs 300 and 350. The momentum and decay were set to 0.9 and \(10^{-6}\) for all models. The 5-fold cross validation accuracy and f-score are shown in the first two rows of Table 1. Since texture is an important discriminative feature that AlexNet is unable to tap despite its larger model capacity, we note that T-CNN outperforms Alexnet by a large margin.

Next, we prefix SD-Layer at the front end of both models. Two settings are considered for the SD-Layer: (1) frozen - convolution filters are not allowed to train post initialization and (2) trainable - filters are allowed to train to the best possible representation. In the first case, CNN performs poorly as is evident from Fig. 4a. This is because the stain vector initialized using SVD on the full slide reference image cannot fully overcome cell-level stain variations. On the other hand, significantly higher test accuracy is obtained on the second setting of ‘trainable’, wherein filters are allowed to fine tune to cell level stain normalization. We report 5-fold cross validation results for AlexNet and TCNN prefixed with the trainable SD-Layer in the bottom two rows of Table 1. We note a significant jump in accuracy, from 87.9% to 88.5% for AlexNet and from 92.48% to 93.2% for T-CNN, with the proposed SD-Layer.

4.3 Experiment 2: Results with Different Initializations of SD-Layer

As stated earlier, the initialization of filters in the SD-Layer plays a significant role. In this sub-section, we evaluate the performance of T-CNN with SD-Layer initialized using three different strategies: (1) Identity matrix, (2) uniform random distribution from \([-0.05,0.05]\), and (3) with the columns of \(\mathbf S ^{-1}\) determined using SVD on the reference image. Table 2 summarizes maximum test accuracy achieved using each initialization on a single fold. The corresponding, test accuracy v/s epochs plots are shown in Fig. 4b. From this table and figure, we note that the randomly initialized model fails to train. This is expected since, in this case, the input to the CNN is a random image. For the identity initialization though, the model trains towards some intermediate representation starting from the original RGB image. However this representation neither improves accuracy, nor allows us to draw some understandable interpretation. The best test accuracy is achieved through SVD based initialization.

Table 2. Classification accuracy of T-CNN+SD-Layer with different \(\mathbf S \) initialization with single fold.

5 Discussion

We claim that SD-Layer trains the stain colour matrix to a representation better suited to classification. This can be verified by generating RGB images using (3), by preserving only a single column of \(\mathbf S \) at a time and setting the other two to zeroes. This scheme, is equivalent to generating images containing only a single stain. This visualization, for the case of \(\mathbf S \) obtained through (1) SVD and (2) after training SD-Layer using T-CNN, are shown in Fig. 5. It is observed that initial images of (b)-(d) of malignant blast are modified to (e)-(g), wherein (e) seems to capture shape, (f) seems to capture texture, while (g) is having no information. Similar observation is observed for the normal cell shown in the bottom row of Fig. 5.

Fig. 5.
figure 5

Top row is an example image of a lymphoblast nucleus (malignant). (a) Original image, (b)-(d) Stain deconvolved images using stain vectors obtained via SVD, (e)-(g) Stain deconvolved images using stain vectors learned via T-CNN+SD-Layer over 400 epochs. Bottom row shows corresponding images for the normal WBC nucleus.

6 Conclusion

In this paper, we have proposed a biomedical microscopic imaging relevant deep CNN network architecture where the staining of tissues/cells are involved. We have proposed stain deconvolution layer (SD-Layer) that operates in the Optical Density space and offers a more fundamental view of the tissue and stain interactions to the following CNN architecture. The concept of initializing and tuning the stain matrix has been incorporated into the SD-Layer that will deal with stain variations present at the cell level. With only an 9 additional learnable parameters, we are able to achieve significant gain in the classification accuracy on two standard models AlexNet and T-CNN fitted with SD-Layer. This suggests that SD-Layer leads to a better representation of the input image.