Fourier transform infrared spectroscopic imaging of colon tissues: evaluating the significance of amide I and C–H stretching bands in diagnostic applications with machine learning
Fourier transform infrared (FTIR) spectroscopic imaging of colon biopsy tissues in transmission combined with machine learning for the classification of different stages of colon malignancy was carried out in this study. Two different approaches, an optical and a computational one, were applied for the elimination of the scattering background during the measurements and compared with the results of the machine learning model without correction for the scattering. Several different data processing pathways were implemented in order to obtain a high accuracy of the prediction model. This study demonstrates, for the first time, that C–H stretching and amide I bands are of little to no significance in the classification of the colon malignancy, based on the Gini importance values by random forest (RF). The best prediction outcome is found when supervised RF classification was carried out in the fingerprint region of the spectral data between 1500 and 1000 cm−1 (excluding the contribution of amide I and II bands). An overall prediction accuracy higher than 90% is achieved through the RF. The results also show that dysplastic and hyperplastic tissues are well distinguished. This leads to the insight that the important differences between hyperplastic and dysplastic colon tissues lie within the fingerprint region of FTIR spectra. In this study, computational correction performed better than optical correction, but the findings show that the disease states of colon biopsies can be distinguished effectively without elimination of Mie scattering effect.
KeywordsFourier transform infrared spectroscopic imaging Colon polyps and cancer Correcting lens approach Machine learning K-means clustering Random forest supervised classification
Colon cancer is a disease in the large intestine in which abnormal cells divide uncontrollably. Most cases of the colon cancer begin as a small adenomatous polyp which lines the inner surface of the colon . In the UK, colon cancer is the fourth most common cancer with 16,000 deaths every year, making it the second most common cause of cancer death in 2016 . Early detection of colon cancer can help reduce mortality and morbidity. The current diagnostic approach for this disease includes biopsy collection followed by histopathology during colonoscopy or surgery. In recent years, FTIR spectroscopy has been shown as a promising technique to enhance the clinical diagnosis in a label-free way by investigating the chemical content of the biopsy samples [5, 19, 25, 27].
Although FTIR spectroscopy has the potential as a clinical prognostic tool, there are several challenges associated with it, most notably the reflection and scattering contribution (‘dispersion artefact’) that arise from the spatial inhomogeneity of the sample. In fact, the dispersion artefact is found to be largely dominated by resonant Mie scattering in pure transmission experiment, as opposed to measurement in transflection mode where the reflection artefact becomes significant . The scattering contribution can lead to spectral distortion, for example, a decrease in the absorbance of the amide I band, and to a greater extent, manifest itself as a derivative-like baseline at the high wavenumber side of the amide I band. It can also result in significant frequency shifting of spectral bands that are used extensively to classify biological specimens. To be able to interpret FTIR spectra correctly, it requires the correction of the dispersion artefact aforementioned. The origin of dispersion artefact and their subsequent effect is detailed in published articles [6, 7].
Resonant Mie scattering (RMieS) algorithm developed by Bassan et al. was proved to be successful at correcting the ‘dispersion arterfact’ . This algorithm is used in this manuscript; however, the correction algorithm is computationally intensive and time consuming. In addition, physical alteration of imaging set-up for measurements in transmission by employing an additional lens on top of the window that forms pseudo-hemisphere has shown to be effective at producing aberration-free and high-quality spectra from tissues and from cells [14, 15, 20, 36]. The other challenge in FTIR spectroscopic measurements is the presence of spectral bands of water vapour in the sample spectra, which hampers the analysis of protein secondary structure in the amide I region (1700–1600 cm−1) [1, 2, 18]. This water vapour interference can be minimised by computational subtraction of the pure water vapour spectrum from the sample spectrum, with algorithm described by Brunn et al. .
FTIR spectra contain a wealth of information about the sample. As such, in analysis of spectra of biological systems, multivariate statistics and machine learning algorithms are frequently applied to extract the important information. The two main strategies in chemometrics used to analyse FTIR spectral data are unsupervised learning and supervised learning. The variety of the methods is detailed by Goodacre . The aim of this paper is to utilise the well-established machine learning approach, random forest in this case, to examine the spectral ranges that could potentially contain the most important spectral biomarkers that distinguish between colon specimens of various degree of malignancy.
Materials and methods
The samples are formalin-fixed paraffin-embedded (FFPE) colon biopsies at different disease stages of malignancy (hyperplasia, dysplasia, and cancer), provided by St. Mary’s Hospital (Imperial College London, UK), following standard clinical protocols. The samples were microtomed at 3 μm thickness from a specimen block and mounted onto a 2-mm-thick CaF2 window (Crystran Ltd., UK) for FTIR analysis. The adjacent section was mounted onto a glass slide, stained with haematoxylin and eosin (H&E) and assessed by a trained pathologist. The sections deposited on CaF2 windows were deparaffinised as per the procedure described by Song et al. [34, 35] and stored with a desiccant before use.
FTIR spectroscopic imaging measurements
The experiments were carried out in transmission mode at × 15 magnification (NA = 0.4), with a Hyperion 3000 FTIR microscope coupled to Tensor 27 FTIR spectrometer (Bruker Corp.). A liquid nitrogen cooled 64 × 64-pixel focal plane array (FPA), which has a field of view of 170 × 170 μm2, is used for simultaneous acquisition of FTIR spectral dataset. As imaging was combined with mapping, 3 × 3 individual images were stitched into one, resulting in a total measured area of 510 × 510 μm2 for each tissue. The spectral images from 12 sample areas were acquired. A new background was recorded before measuring each individual image. All measurements were taken in the mid-IR range from 3900 to 900 cm−1, at 4 cm−1 spectral resolution and with 521 co-added scans. An additional CaF2 lens, which has been shown to significantly reduce Mie scattering , was also employed for imaging of the exact same tissue areas. The design and set-up of the lens for combining imaging with mapping were described in details by Kimber et al. . To put it briefly, the added lens is kept in focus with an external holder whilst the stage is shifted in x- and y-direction for different areas to be measured.
The additional lens implemented to correct for the chromatic aberration in infrared measurement is referred to as ‘correcting lens’ from this point onwards in this paper. To the authors’ knowledge, the assessment of the performance of the correcting lens has not been closely examined with advanced machine learning approaches.
Data processing and chemometric analytical procedure
The spectral data were processed with MATLAB R2018b (The MathWorks, Inc.). The spectral data in the range of 1800–1000 cm−1 and 3000–2800 cm−1 were used for further analysis. The region between 2800 and 1800 cm−1 contains no important spectral information whilst the region > 3000 cm−1 is sensitive to water content within the tissues. Baseline correction and water vapour subtraction were not applied to the data. Second derivatives of the obtained spectra were calculated with Savitzky-Golay 9-point smoothing, which were then vector normalised. The spectra, second derivatives, and normalised second derivative data were then separately subjected to unsupervised machine learning, in this instance, the K-means clustering algorithm (tested for 2 to 6 clusters, each with 5 replicates and infinite iteration until the solution converges to a local minimum). A total of 2000 individual sample spectra were recorded and used for machine learning. Only eight chemical images of the different tissue sections are shown here for demonstration purpose. Training and test models were created, each made up of 500 random spectra sampled from each cluster without replacement, for tissue at the same disease stage. In other words, the model consists of 2000 spectral data (500 for healthy (H), 500 for hyperplastic polyps (HY), 500 for dysplastic polyps (D), and 500 for cancer sections (C)) which are identified by H&E staining. The models were from different individuals ensuring that the inter-patient variability is included in the study. Employing machine learning to study imaging data has been demonstrated in previous works [10, 17].
Figure 1 shows all the pathways that are tested with unsupervised and supervised approach via training and re-testing of the spectral data based on Gini importance (see figure in section ‘Gini index obtained from RF classifier’). The features in this study are not independent of one another; thus, important spectral range is discussed instead of the single features. The paper starts off by describing Mie scattering and the implementation of correcting lens on the FTIR spectra of colon tissue, followed by the unsupervised learning on the second derivative spectra without RMieS correction. With supervised classifier, the performance of both model with and without any RMieS correction is compared and discussed to establish a proof of concept that RMieS correction might not be necessary in this case study. On top of that, this paper serves to demonstrate that amide I band plays a very little role in the differences between specimens via feature selection in machine learning. The results are detailed as follows.
Results and discussion
Physical and computational correction of Mie scattering effect
As can be seen in the curves of pixel count Fig. 4, the integrated absorbance of nucleic acid band at 1271–1184 cm−1 is lowest for healthy colon biopsy when a 95% confidence interval is taken, likewise for amide I band. The opposite is observed for the lipid spectral band within 2944–2880 cm−1, whereby the lowest integrated absorbance is achieved in cancer tissues. This is in agreement with the high nucleic acid-to-cytoplasmic ratio observed in colon cancer tissues  as well as the loss of normal glandular architecture. The inner lining or mucosa of healthy colon is lined with columnar epithelium and large number of goblet cells, where numerous secretory vesicles containing mucus (glycoprotein) are present, in addition to the secreted mucin in the intestinal epithelial surface layer. Mucus is a complex biochemical layer made up of carbohydrates, antimicrobial peptides, immunoglobulins, electrolytes, and lipids . For diseased tissue, however, the goblet cells are not differentiated well to perform its function; instead, they become highly metastasizing cells with high metabolic rate, which might progress to cancer (an aggregation of undifferentiated cells).
The difference between different stages of cancer is also highlighted in the mean average spectrum obtained after taking their second derivatives. The evaluation of the variation is not very straightforward, thus the need for machine learning to perform the task of classification of colon disease. The interpretation of the second derivative spectra is not included in the main discussion as machine learning only requires the input of ‘features’, which is the absorbance at various wavenumbers, and ‘label’, the stage of disease. The second derivative spectral bands and their corresponding band assignment are, nonetheless, provided in ESM Fig. S4 and Table S2 to demonstrate the potential variation that might be picked up by the machine learning classification model. The most significant differences lie in the peak shift and the intensity of the trough of the second derivative data.
Gini index obtained from RF classifier
From the mean second derivative spectra by averaging all the pixels within the same cluster, the tissue regions are effectively classified into low and high lipid absorbance region (cluster 1 and cluster 2 respectively), which are fed into the supervised learning algorithm separately. This reinforces the findings by Song et al.  that spectral bands of lipid are still useful biomarkers for intra-tissue classification, despite the lower Gini importance index. The lipid spectral region contains a wealth of information. Bassan et al. has also demonstrated that the high wavenumber spectral range (O−H, N−H, and C−H stretches occurring at ca. 3800–2500 cm−1) is useful for the generation of false colour classification image of breast tissue microarrays on glass substrate . They are free from interference with the spectral bands of water vapour and Mie scattering, with the only possible variation coming from the deparaffinisation process on the formalin-fixed tissues. This variation is controlled and minimised by strictly adhering to the deparaffinisation protocol.
It is possible that the cancerous tissues are more susceptible to change during solvent-based removal of material, required prior to the paraffin embedding process. The FFPE process requires fixation of fresh tissue in formalin for 6 to 24 h, followed by multiple washes in ethanol/water with increasing ethanol concentration until water has been removed. Xylene, or possibly isopropanol, is then used to remove the ethanol, taking with it much of the fats within the natural tissue. Finally, the tissue is soaked in molten paraffin, usually at 60 °C. Precautions were taken to conduct the de-waxing process in a closely controlled manner, so that each of the three samples were treated in the same way; however, the manner in which the FFPE was first conducted is out of our control, including the amount of fats and other materials that may have been removed in that process. That said, surprisingly, similar observations were made on prostate cancer tissues that are supplied by different pathologists but de-waxed with the same procedure  that this wavenumber region (3000–2800 cm−1) is different between normal and cancer samples. Thus, the explanation that tissues of different malignancy retain various amount of fats after deparaffinisation essentially still offers a different kind of ‘key biomarker’ for cancer differentiation in FTIR imaging study.
Supervised machine learning
Random forest classifier was shown to be an efficient supervised machine learning technique for the classification of spectral data in previous studies [3, 17, 33]. In this study, second derivative data (for measurements with and without correcting lens) from various spectral ranges were used to train the algorithm—model 1, between 1800 and 1000 cm−1 and 3000–2800 cm−1 (all range); model 2, 1800–1000 cm−1 only (fingerprint region with amide bands); model 3, 1500–1000 cm−1 only (fingerprint region); and model 4, 3000–2800 cm−1 only (lipid region). To clarify, re-training and re-testing of the RF models is still required after Gini selection to subjectively assess the prediction performance; hence, the results are organised in the way shown in the workflow (Fig. 1) in this manuscript.
Figure 7 shows that overall prediction accuracy is higher for data in cluster 2, region of higher lipid absorbance, than cluster 1. A comparison of the performance of measurements with and without correcting lens can be achieved by analysing cluster 2, which reveals that apart from model 1 and model 4, the measurements with correcting lens, despite its ability to minimise Mie scattering at the edges of the tissues, generally underperform compared with measurements without the added lens. The lowest accuracy of cluster 2 prediction is obtained from model 3 with added lens. This is because whilst the added lens approach removes the scattering effect and thus improves the quality of amide I band, the spectra collected in the range of 1100–1000 cm−1 suffer from enhanced noise, which is not an issue with computational approach. This happens because the additional stacking of lens on top of the CaF2 substrate (from the way the correcting lens is set-up) reduces the throughput of light. Due to the lower photon counts that pass through the sample and the fact that CaF2 has a cut-off at ~ 900 cm−1 in transmission, the spectral quality in the low wavenumber region deteriorates significantly compared with the set-up without correcting lens.
Model 2 (with lens) gives a slightly better performance when amide bands are factored into consideration as added lens is shown to improve the absorbance of the spectral band of amide I. Model 4 which considers the data exclusively from the lipid region is undeterred by the noise introduced by the extra lens configuration and model 1 which takes into consideration all the spectral regions shows similar performance with and without additional lens, for reasons discussed above. Instead of CaF2, a pseudo-hemispherical ZnS lens with infrared cut-off at ~ 700 cm−1 was suggested to improve the spectral quality . However, Mie scattering correction does not play a significant role in optimising the performance of the supervised learning, reinforced by the finding that the highest prediction accuracy of 92.7% can be achieved with model 3 (fingerprint region). In other words, the second derivative spectral data within 1500–1000 cm−1 from cluster of high lipid absorbance region alone is sufficient to achieve effective discrimination of all the different grades of colon cancer as the fingerprint region is least affected by Mie scattering. The interference of the water vapour spectra within this region is also minimal (as shown in ESM Fig. S5). Therefore, removal of Mie scattering effect is not necessary as the amide spectral range (1700–1500 cm−1) does not need to be included in data analysis at all, as demonstrated here.
From these results, it is apparent that healthy and malignant tissues are easily distinguished from other stages of the disease, whereas dysplastic tissue is often misclassified as hyperplasia, if the correct spectral range is not implemented. Hyperplastic and dysplastic tissues rely heavily on differences within 1500–1000 cm−1, possibly from the change in concentration of the nucleic acid and carbohydrates in the tissues , and can be classified at a high accuracy when only the fingerprint region is used. Hyperplasia and dysplasia exhibit very similar spectral pattern above 1500 cm−1; hence, they are best differentiated from each other when the amide and lipid bands, which have higher absorbance and would dominate over the nucleic acid bands when no vector normalisation is carried out, are eliminated from the training dataset (model 3). The results from supervised learning give a significant insight into assessing the spectral biomarkers of colon cancer.
The findings are reinforced by comparing the prediction outcome with that obtained from spectral data after correction with RMieS algorithm (and without the correcting lens). The performance of the fingerprint region with amide bands after RMieS correction shows significant improvement in the prediction accuracy compared with the second derivative data before correction (from 81 to 91% prediction accuracy), despite being slightly lower than that of the fingerprint region alone, due to the correction of the amide I band (92%). Correction with the RMieS algorithm is computational whilst correction with the added lens is a practical optical approach; thus, as expected, the RMieS algorithm provides a more precise solution which indeed yields a better overall prediction accuracy. The confusion matrices for both cases are provided in ESM Fig. S2 and Fig. S3 respectively. In this study, both correcting lens and RMieS correction are shown to be useful at correcting the scattering effect on amide I band but might not be necessary if classification of the stages of the colon adenocarcinoma via machine learning technique is the main objective as the training model without any correction for Mie scattering is sufficient to yield accuracy comparable with that after correction.
Conclusion and outlook
Spectral data of colon biopsies obtained with a correcting lens for FTIR imaging show a significant reduction in spectral aberrations due to inhibiting Mie scattering, as was shown in our studies with other types of cancer tissues. Optical modification of the FTIR spectroscopic imaging with a CaF2 correcting lens has the advantage that the Mie scattering correction algorithm does not need to be performed. However, for this study the correction effect was not as good compared to the computational method. Here, we report the insignificance of the role of amide I band in machine learning for the first time. Importantly, the findings show that the disease states can be distinguished without resorting to the correction of Mie scattering effect. By using K-means clustering and RF classifier with PCA reduction, our work has demonstrated that optimisation of the training model by refining the selected range of FTIR spectral data can alter the prediction outcome.
The novelty of this work showed that the best prediction outcome for the studied colon biopsy samples were obtained when unsupervised learning of the C-H stretching bands is coupled with supervised learning of the spectral region between 1500-1000 cm-1. Hence, whilst the C-H stretching region is useful for intra-tissue segmentation, only the spectral range of 1500–1000 cm−1 is important for supervised machine learning. The amide I band can be excluded from data analysis altogether, as evidenced in the Gini indices obtained in this work. In addition, reliance on the C-H stretching spectral region (3000–2800 cm−1) alone in supervised learning gives the worst prediction. This exploratory study involving a manageable number of datasets successfully highlights the extraction of the most meaningful parts of the spectral data, which sets a framework for further validation of the predictive ability of a more sophisticated deep learning model in future work.
To summarise, further application of this method to an unknown colon biopsy sample is straightforward and potentially fully automated with simple programming. Initial K-means clustering (with the number of clusters set to two) on the C-H stretching bands alone will pick up regions of high lipid absorbance which will subsequently be fed into the already trained RF model that predicts the outcome of the malignancy stage of the specimen. The findings, though significant, are limited to FTIR spectroscopic imaging of the colon biopsy. Furthermore, Mie scattering is more pronounced in single-cell imaging than tissue; the results of this study are strictly limited to differentiation of disease progression in colon tissue specimens.
We thank Prof. Peter Gardner (University of Manchester, UK) for providing us with the RMieS–EMSC algorithm  for correction of Mie scattering in this study.
Compliance with ethical standards
The Research Ethics Code for the colon polyp is 14/EE/0024 and was approved by Imperial College London Research and Ethics Committee.
Conflict of interest
The authors declare that they have no conflict of interest.
- 3.Balbekova A, Lohninger H, van Tilborg GAF, Dijkhuizen RM, Bonta M, Limbeck A, et al. Fourier transform infrared (FT-IR) and laser ablation inductively coupled plasma–mass spectrometry (LA-ICP-MS) imaging of cerebral ischemia: combined analysis of rat brain thin cuts toward improved tissue classification. Appl Spectrosc. 2018;72(2):241–50.CrossRefGoogle Scholar
- 11.Bowel cancer statistics. Cancer Research UK. 2015 [cited 2019 Jan 8]. Available from: https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/bowel-cancer
- 27.Li Q, Hao C, Kang X, Zhang J, Sun X, Wang W, et al. Colorectal cancer and colitis diagnosis using Fourier transform infrared spectroscopy and an improved K-nearest-neighbour classifier. Sensors [Internet] 2017;17(12). Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5750796/.
- 29.MATLAB Statistics and Machine Learning Toolbox Release 2018b, The MathWorks, Inc., Natick, Massachusetts, United States.Google Scholar
- 37.Stuart BH. Infrared spectroscopy: fundamentals and applications: stuart/infrared spectroscopy: fundamentals and applications [Internet]. Chichester: Wiley; 2004. [cited 2019 Feb 11]. (Analytical Techniques in the Sciences). Available from: http://doi.wiley.com/10.1002/0470011149 CrossRefGoogle Scholar
- 38.Tests to Detect Colorectal Cancer and Polyps [Internet]. National Cancer Institute. 2016 [cited 2019 Jan 8]. Available from: https://www.cancer.gov/types/colorectal/screening-fact-sheet
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.