Keywords

1 Introduction

The prevalence of Lung cancer in India can be observed by the fact that for the year 2018 a total of 67,795 new cases were registered for both the sexes and 63,475 deaths related to lung cancer were reported [1]. Non-small cell lung cancer (NSCLC) accounts for more than 80% of lung cancer spectrum. NSCLC can be further subtyped into two major types, viz. adenocarcinoma (ADC) and Squamous Cell Carcinoma (SCC).

It is imperative for differentiating the NSCLC into subtypes as the prognosis of adenocarcinoma (ADC) is found out to be low [2] as compared to the other subtype. Also, for NSCLC, disease management of different subtype is different. SCC outcome is worst [3] among the two NSCLC subtypes. A tissue biopsy can reveal whether the nodule is malignant or benign as well as the subtype if it is malignant. Tissue samples can be extracted by an invasive surgical procedure, core needle biopsy or fine needle aspiration biopsy. Needle biopsy is often favored over surgical biopsy [4]. The classification of lung tissue into ADC or SCC is not straightforward since the architecture that distinguishes them are complex and depends on the grade of the tumour [5].

The histopathological images are captured in RGB colour space. It is found in the literature that converting it into other colour space can improve the result. RGB to L*a*b* colour space is done for breast [6], lung [7] and prostate [8]. HSV (Hue, Saturation and Value), for head and neck cancer detection [9] and lung tissue type classification [7]. However, there are other colour spaces like YCbCr, which is used for prostate [10] and CMY for breast cancer [11].

The method described in this work is based on texture characterization of the lung tissue. Texture characterization of tissue can confirm the malignancy of a tumour [12] as filter banks are used for detecting breast cancer [11] and grading of prostate cancer [13]. The fractal dimension, along with the variance of the wavelet coefficient is implemented for prostate cancer grading [10]. Further, the wavelet coefficient as one of the feature to a support vector machine (SVM) for classification of prostate lesions [14] is implemented. Wavelets have been used to analyse texture effectively as they provide multiple scale partition of the image spectrum [15]. We have selected Continuous Wavelet Transform (CWT) as opposed to discrete wavelets as CWT gives a high degree of frequency selectivity [16] of a texture.

To the best of our knowledge, classification of ADC and SCC is done only using Raman scattering microscopy where they use domain specific clinical diagnosis knowledge as feature sets [17]. As the classification of these two subtypes is very complex, coding the domain specific rules into an algorithm like morphology feature may not always produce higher classification rate [17]. This paper is focused on classifying the two subtypes of NSCLC by quantitatively extracting the feature from each subtype automatically. Two dimensional Marr and isotropic Morlet wavelet are used to transform the image into wavelet domain so as to characterize the texture of the image. The wavelet coefficients are modelled with Generalised Gaussian Distribution (GGD). SVM as a classifier, is used, the features for SVM are selected through Recursive Feature Elimination (RFE) [18] method.

2 Methods and Materials

2.1 Data Set

The slides containing the lung tissue are collected from an NABL (National Accreditation Board for Testing and Calibration, India) laboratory. The slide containing lung tissue were prepared from core needle biopsy samples, these samples are stained with Hematoxylin and Eosin (H&E) and are sectioned 5 μm thickness. The images are captured using Leica ICC50 HD digital microscope, digitisation is done at magnification of 20x which forms image pixel resolution of 0.32 μm, the images are of 2048 × 1536 resolution and each pixel is of 24 bit to incorporate 3 channels i.e. Red (R), Green (G) and Blue (B) of 8 bit each. The histopathological images of the lung contain various cellular-based tissue types, like but not limited to: the tumour, red blood cells, fibrosis, necrosis, carbon particles, normal cells.

A surgical pathologist delineates the malignant tumour portion on the digital image, this Region of Interest (ROI) is stored as a binary mask. The subtype of the NSCLC is also mentioned and stored as a label for the binary mask, the rest of the portion of the image containing tissue structures are labelled as Unspecified Lung Region (ULR). As the aim of the study is to classify subclasses of NSCLC from the background, i.e. the portion of the image not containing any lung tissue structure is also labelled as ULR. The ROI and the ULR are segmented into overlapping blocks of size 256 × 256, it is empirically found out that this size gives the highest level of classification for both the classifiers. The number of blocks for ROI and ULR is almost equal to avoid biasness in classification [19]. A visual inspection is done on all the images and only focused images are taken (due to human error unfocused images are sometimes produced). In total slides from 72 core needle biopsy are used of which 34 are of SCC and 38 are of ADC type. Table 1 illustrate the data set.

Table 1. The spread of data for ADC and SCC of the lung.

2.2 Colour Space Transform and Normalization

In H&E stain the nucleus of the cell is stained blue whereas cytoplasm and extracellular materials are stained with varying degree of the colour pink. The colour appearance of the tissue in light microscopy varies due to a wide range of factors, the sensor type of the digital camera, H&E reagents from different manufacturers or from different batches, the concentration of the stain, time for which stains are applied, and many more factors are there. If most of the procedure of staining is standardised still the colour fades with time. Normalization of colour is a necessary pre-processing for histopathological images, in this work we have selected a method that nonlinearly maps a source image (an image that needs to be normalized) to a target image (an image that is used to train the system) [20]. This method shows stable representations as it is not sensitive to the imaging condition and digital sensor used.

The captured digital images of lung histology are in RGB colour space. The 3 colour channels of the RGB colour space are not independent and change in one channel changes other channels. To avoid such pitfall of RGB colour space and to closely resemble the colour perception of human L*a*b* colour space is chosen. The L* channel corresponds to luminance, a* and b* channel represent the variance of red to green and yellow to blue respectively, a* and b* together define the chrominance. Figure 1, shows the representation of different channels. Only two levels of headings should be numbered. Lower level headings remain unnumbered; they are formatted as run-in headings.

Fig. 1.
figure 1

A lung histopathological image captured at 20x magnification and stained with H&E stain is used to represent different channels in RGB and CIE L*a*b* colour spaces. (Color figure online)

2.3 Wavelet Coefficients Modelling and Similarity Measurement

Marr wavelet is a real, rotation invariant wavelet. In some literature, it is also known as Mexican hat wavelet. A 2D Marr wavelet is defined as (see Fig. 2(a) for representation):

Fig. 2.
figure 2

(a) A 2D Marr wavelet (b) Isotropic Morlet wavelet (c) An cropped image of lung histopathology image (d) The 3D plot of the coefficient results from convolving the image with a Marr wavelet \( \left( {a = 6} \right) \). (Color figure online)

$$ \psi \left( {x,y} \right) = \left( {2 - x^{2} - y^{2} } \right)\exp \left[ { - \frac{1}{2}\left( {x^{2} + y^{2} } \right)} \right] $$
(1)

Marr wavelet was selected for its good localization feature and also its affinity towards representing the nucleus in an effective way in the transform domain as evident from Fig. 2(c) and (d). As nucleus features are essential and they are available only in the L* and b* colour channels since L* is for luminance and nucleus are coloured blue/purple with H&E stain which is darker than the pale pink/red the colour of cytoplasm.

The b* channel record changes from bluish to yellowish colour component thus nucleus are available in this channel (as evident from Fig. 1). Marr wavelet is used for these two colour channels.

Isotropic Morlet wavelet is a complex-valued wavelet. A simple Morlet wavelet can be a plain wave modulated by a Gaussian envelope with a well-localized frequency domain with power only near its fundamental frequency. An isotropic wavelet is given by [21]:

$$ \psi \left( {x,y} \right) = \pi^{ - 1/4} { \exp }\left[ { - i\omega_{0} \left( {x + y} \right)\left] { { \exp }} \right[ - \left( {x^{2} + y^{2} } \right)/2} \right] $$
(2)

where \( \omega_{0} = \left( {0,\omega_{0} } \right) \) is a wave vector with \( \omega_{0} > 5.5 \). Phase information is important for texture characterisation since Morlet wavelet are complex valued we can compute the phase as well as magnitude information.

This wavelet is used on the a* channel, which codes non nucleus information. Since a* is for change of colour information from reddish to greenish and most non nucleus material are pale pink and pale red in appearance (see Fig. 1).

Let \( {\text{I}}_{h} \) be an image, the 2D CWT of the image is given by:

$$ {\mathcal{C}}_{\psi } \left( {a,x^{\prime},y^{\prime}} \right) = \int\limits_{{R^{2} }} {{\text{\rm I}}_{h} \left( {x,y} \right)\psi_{{a,x^{\prime},y^{\prime}}} \left( {x,y} \right)dx dy} $$
(3)

where \( {\mathcal{C}}_{\psi } \left( {a,x^{\prime},y^{\prime}} \right) \) is the wavelet coefficient at location \( x^{\prime},y^{\prime} \) and having scale \( \varvec{a} \) (\( a > 0 \)) (for a M × N image MN number of coefficients are extracted). \( \psi \) is the complex conjugate of those defined in Eqs. (1) and (2). The scale parameter range plays and important part in our analysis. Since it is computationally intensive to use many scale parameters and also not all scale parameter represents the details of the lung histopathology, small scale and large scale parameters will have over or low detailed information about the image. We empirically found \( a = 3 \) to 6 suited for our work.

2.4 Wavelet Coefficients Modelling and Similarity Measurement

The marginal distribution of the Marr and isotropic Morlet wavelet coefficients are long tailed, bell-shaped and centered around zero (see Fig. 3). To model such a distribution GGD [22] is used. Two varying parameters can be used to approximate the coefficient of the wavelet transform as shown below:

Fig. 3.
figure 3

(a) Marr wavelet coefficient for a particular subband is represented as a histogram for a random 256 × 265 image in L* colour space and selected from ADC and SCC data set. (b) Histogram representing the magnitude of isotropic Morlet wavelet coefficient for a subband, random images of size 256 × 256 each in a* colour space for ADC, SCC and ULR is used.

$$ p\left( {x;\alpha ,\beta } \right) = \frac{\beta }{{2\alpha\Gamma \left( {1/\beta } \right)}}e^{{ - \left( {\left| x \right|/\alpha } \right)^{\beta } }} $$
(4)

Where \( \Gamma \) is a gamma function and \( \alpha \) is a scale parameter which model the width of the Probability density function(PDF) peak, while \( \beta \) is the shape parameter which is inversely proportional to the decreasing rate of the peak. In Fig. 3(b) the distributions are created from the images using isotropic Morlet wavelet, to model shapes described in the figure Laplace distribution can be use. Since Laplace distribution is a special case of GGD (\( \beta = 0 \)), so only the latter distribution is considered. The parameters \( \alpha \) and \( \beta \) are estimated by maximum likelihood estimator (MLE) [22]. The scale and shape parameters are used as features for the classifier. Various statistical measures [23] of the GGD such as variance, kurtosis and entropy are used also used as an input feature vector to the classifier.

To quantify the difference between two empirical distributions a distance measure was used. The distance between two distributions was calculated using Kullback-Leibler divergence (KL-D), KL-D cannot be used as a metric since it is not symmetric and a symmetric version [24] of KL-D was implemented for this work. Jensen-Shannon divergence (J-divergence) with multiple probability distributions [25] is used to calculate the similarity of more than two distributions, J-divergence is symmetric. To quantify the goodness of fit of the GGD model to the observed distribution, symmetric KL-D and the \( \chi^{2} \) test is used.

3 Results and Discussion

This section provides evidence that the method that is proposed in this work is implemented correctly and the features that are used for classification are indeed can classify different lung tissue textures. The results reveal the accuracy of our method in classifying ADC, SCC and ULR from a data set. The specifics of the data set are tabled in Table 1, the SVM classifier is used to classify the subtype of the NSCLC with input feature vectors obtained from various methods used in this study.

3.1 Goodness of Fit

The goodness of fit of the model into the empirical distribution of the coefficients is calculated with symmetric KL-divergence and \( \chi^{2} \) test at 5% confidence level. The model which we were trying to fit to the observed distribution was assumed as the null hypothesis. Considering all the data 96.75% have accepted the null hypothesis, i.e. the chi square values in these cases are found to be lower than the upper limit of \( \chi^{2} \) distribution i.e. \( \chi_{{\left( {0.05} \right)}}^{2} = 3.841 \) with degree of freedom equals 1. Table 2 represent the goodness of fit of the data represented by symmetric KL-divergence and Pearson’s \( \chi^{2} \) values of the distribution. The GGD model fits the isotropic Morlet distribution more accurately since the distributions produced by isotropic Morlet for the given data set have near symmetric values on the both sides around zero (see Fig. 3(b)).

Table 2. Symmetric KL-D and Pearson’s \( \chi^{2} \) values for the distributions created by the coefficients of Marr wavelet and isotropic Morlet wavelet fitted with a GGD model, averaged over the data set.

3.2 Similarity Measurement of Empirical Distributions of Various Classes

The similarity of distributions within a class (ADC, SCC or ULR) is high since the J-Divergence (J-D) of all the distributions of a class using a particular wavelet function and scale parameter is low as shown in Table 3.

Table 3. Intraclass similarity calculated from J-divergence for multiple distributions for Marr and isotropic Morlet wavelet with different colour channels and scales.

As intra class similarity is high to calculate inter class similarity a distribution need not be compared with the all the distribution of the comparing class, a distribution from one class is taken and KL-D was applied with \( n^{\prime} \) (we take the value of \( n^{\prime} \) such that \( n^{\prime}\,{<<}\,N \), where N is the total number of distribution for the comparing class) number of distribution from the other class and the average of these KL-D values is taken as the KL-D value of the distribution with the other class. Table 4 shows the inter class KL-D variations, as variations exists these values were used as feature vector for the classifier.

Table 4. Similarity between various classes represented as KL-D value.

3.3 Classification Results

The features for the SVM are selected through Recursive Feature Elimination (RFE), there are eight distinct features: shape and size parameter from the GGD modelling, three symmetric KL-D values for the three class and three statistical measures (variance, kurtosis and entropy). These eight values are extracted for four scale (a = 3, 4, 5, 6) and each colour channel. Data set as defined in Sect. 2.1 is used and to validate the system a ten-fold cross validation method is employed. The effect of different combinations of features for classifying ULR and malignant tissue (ADC and SCC) is shown in Fig. 4(a).

Fig. 4.
figure 4

(a) Accuracy for different features calculated by SVM with RFE, a total of 96 features are used to classify ULR from malignant (ADC and SCC) tissue. (b) The accuracy level of SVM for classifying ADC Vs SCC, using a RFE technique for feature selection.

To classify the ULR, a subset of 41 features from 96 features is used. Using the 41 features an accuracy of 96.2% for ULR and 95.1% malignant tissue is achieved (Fig. 5(a)). Few texture structure of ULR might be similar to malignant texture representation since a ULR consists of many different tissue elements some texture represented by mild necrosis while other may be normal tissue but get damaged due to sample preparation or undergoing mitosis.

Fig. 5.
figure 5

(a) Accuracy of classifying ULR and malignant tissue using SVM with 41 features. (b) classification accuracy of SVM in identifying ADC and SCC using 62 features.

Classification accuracy of SVM is maximum with 62 features out of total 96 features used for classifying ADC and SCC, Fig. 4(b) shows the variation of classification accuracy between ADC and SCC for various feature sets. An accuracy of 77.2% is achieved in classifying ADC and the method gives an accuracy of 75.8% for classifying SCC, refer Fig. 5(b). These results are satisfactory since conclusive diagnosis of these two subtypes of NSCLC is even contradictory to different pathologists, since the complex organisation of the tissue structures can be seen for different stages of cancer. To have a concrete diagnostic answer often molecular analysis is carried out.

Proposed method is also compared with Gray Level Co-occurrence Matrix (GLCM), with angle \( \left( {0^{ \circ } ,45^{ \circ } ,90^{ \circ } \;and\; 135^{ \circ } } \right) \) and four properties viz. energy, contrast, correlation and homogeneity. The accuracy for ULR, Malignant tissue, ADC, SCC are 78.8%, 81.1%, 65.9%,67.8% respectively for GLCM.

The effect of shape and scale parameter of the GGD on the classification accuracy is very acute, features from L* and b* plays an important role in differentiating ULR from ADC and SCC in our proposed method.

4 Conclusion

In this work, we proposed a method to classify the two important subtypes of NSCLC i.e. Adenocarcinoma and Squamous cell carcinoma. The features for classifying ADC and SCC are not clinical diagnostic features, rather features extracted automatically by a wavelet function from an image. Since colour plays a role in understanding the histological slides by a pathologist, we used the colour information provided by H&E stain. The digitized colour images were transformed into a L*a*b* colour space, this colour space helped in segregating nucleus of a cell from its surrounding. The results we obtained are very promising as characterization of these subtypes of lung cancer are done without any prior knowledge about their morphology coded into the system.