Introduction

The coronavirus disease 2019 (COVID-19) is highly infectious (Ro = 3) and caused by SARS-CoV-2, the single-stranded RNA virus referred to as “severe acute respiratory syndrome coronavirus.” This disease leads to complications like pneumonia, acute respiratory distress syndrome (ARDS), damage to the heart, acute strokes, or even systemic hyper-inflammation syndrome, which, in turn, leads to multiorgan failure [1]. As of 20 August 2020, nearly 23 million people have been infected by COVID-19, and nearly 800,000 subsequent deaths have been recorded worldwide [2]. Most of the mortalities have occurred within eight countries—namely the USA, Brazil, the UK, Mexico, Italy, France, India, and Spain [2].

COVID-19 affects the lungs and causes respiratory difficulties. Common symptoms of COVID-19 include breathlessness, dry cough, fatigue, and fever [3]. Some relatively uncommon symptoms of COVID-19 include a loss of taste or smell, sore throat, and vomiting [4]. The danger posed by COVID-19, as well as its spread, is worsened by the fact that many people infected with COVID-19 are asymptomatic [3]. COVID-19 impacts the pulmonary tissues of the lungs, resulting in ARDS, [5] and a considerable percentage of the patients end up needing ventilator support [6]. Many of the initial victims of COVID-19 in China were hospitalised because they exhibited lower respiratory tract (LRT) symptoms [3,7] though these symptoms varied considerably among patients. Some patients exhibited minimal symptoms, while others suffered from hypoxia due to ARDS. For some patients, LRT transformed into ARDS within nine days [7]. It has also been discovered that patients suffering from COVID-19-induced ARDS are prone to organ failure [8,9].

Radiologists primarily use radiography, computerised tomography (CT), or ultrasounds to diagnose lung disease [10,11,12]. These methods allow symptomatic patients to be tested for COVID-19 quickly when tests like real-time transcription polymerase chain reaction (RT-PCR) are not available [13]. Researchers have demonstrated that CT is a more sensitive COVID-19 detection method than traditional techniques for symptomatic patients [14]. One recent study showed that chest radiography could not be used to detect the opaque image features of COVID-19 [15]. Lung ultrasounds can be used as an alternative to CT to detect COVID-19, although CT is still considered the gold standard for detecting pulmonary infections [16].

Apart from conventional techniques, many researchers have also employed artificial intelligence (AI)-based machine learning (ML), deep learning (DL), and transfer learning (TL) techniques to diagnose COVID-19. One group of researchers provided a novel technique to classify COVID-19 infection from lung CT images using weakly supervised DL; this method was also utilised to localise the inflammation caused by COVID-19 [17]. In other work, Xiao et al. developed a multiple instance learning module based on ResNet34 to predict the severity of COVID-19 cases using lung CT scans [18].

Meanwhile, other researchers used UNet +  + architecture for segmenting COVID-19-infected lung areas using CT images [19]. They transformed their study into an online platform to provide fast COVID-19 diagnostic tools that are accessible worldwide [20]. Another group of researchers created a DL and “deep reinforcement learning” model that can automatically quantify COVID-19-related lung abnormalities such as ground-glass opacities and consolidations [21,22,]. Their proposed architecture produces two metrics that can accurately quantify the spread of COVID-19.

Several other pieces of research have proposed new methods for diagnosing COVID-19 using TL on lung CT scans. TL is used when COVID-19 data are very less, or existing deep learning models can be improved by artistically utilising it [22,23,24]. However, TL works efficiently only if the model is trained using data that are similar to the target problem [25] (i.e., COVID-19 lung CT data). Otherwise, performance gains are minimal or insignificant.

In this study, we compared six state-of-the-art AI models (two traditional ML models, two TL models, and two DL models) using K-fold cross-validation to solve the COVID-19 detection problem related to lung CT data. To the best of our knowledge, no study has benchmarked the comparative efficacy of traditional machine learning, deep learning, and transfer learning architectures on COVID-19 lung CT data. As such, doing so is one of the objectives of the present study. Another important objective is to design COVID severity using output class probability values using AI models and then clinically validate against radiologist’s greyscale feature scores. As part of the clinical validation, we demonstrate the association of AI’s correlation with ground-glass opacities (GGO) values, thus validating the hypothesis on COVID severity estimation. We also performed 2D and 3D bispectrum analyses to classify COVID pneumonia (CoP) patients using CT images. Our results show that even though TL can reduce the training time of the model, DL and ML models match or surpass TL regarding the performance benchmarks of COVID-19 classification.

The aggressiveness of the COVID-19 severity can be seen using the imaging-based tests. If the Troponin is released, we know that it is likely to cause a heart attack. Similarly, if CT images can infer to tell the COVID-19 severity due to hyper-intensity distribution in the lung CT (which cannot be known from the swap sample), more aggressive care can be given to the patient. Therefore, the main clinical advantage of CT-based imaging is the determination of aggressiveness of the care which needs to be given to the patient.

Second benefit of doing this study is the development of the AI-based tool to avoid bias by the expert radiologist or pulmonologist. Due to fatigue of the over-length stay of the physicians at the hospital, the results can vary from radiologist to radiologist, so-called inter- and intra-observer variability. Thus, using the AI-based solutions, this major weakness can also be overcome. Third, if tropin is released when COVID-19 pneumonia CT has GGO, we know that it is likely to cause a heart attack too. Lastly, if CT shows pathology that means you, we have pneumonia, it is therefore important to quantify the risk using CT.

The rest of the paper is organised as follows. Section 2 discusses the pathophysiology of COVID-19 cases that develop into ARDS. Section 3 overviews the methodology. Section 4 discusses the experimental results using the K10 protocol and bispectrum analysis. The AI models’ performance is evaluated in Sect. 5 based on the ROC curve, and multiple classification metrics. We discuss our findings in Sect. 6. Sections 7 and 8 provide conclusions and references, respectively.

Methodology

Patient demographics

The CT images of 130 patients were collected. There were 100 CoP patients (68 males and 32 females) from the 17–93 age group (mean age = 61.49 ± 16). The remaining 30 cases (nine males and 21 females) from the age group of 17–93 (mean age = 51.4 ± 2 years) were NCoP patients.

Data acquisition and baseline characteristic

The methodology of this study consists of the design and development of a CADx that has three components. These components are divided based on their functionality. The first component is the region-of-interest extraction, which envelops the CT lung region. The second component of the system consists of the automatic classification of CoP patients and non-COVID pneumonia (NCoP) patients. The final stage of the CADx system consists of a performance evaluation that implements (1) a standardised analysis (e.g., ROC), (2) DOR validation (see Fig. S8 Online Resources 1), and (3) CoP validation using a bispectrum analysis paradigm. Before we dive into these three subsystems, we present the patient demographics and data acquisition systems.

Data acquisition

CT images were collected using a Philips Ingenuity Core CT Scanner, while patients were in a deep inspiration breath-hold (DIBH) supine position. The patients were not given any oral contrast or intravenous agents. The CT scan was done at 120 kV, 225 mAs. The spiral pitch factor, gantry rotation time, and detector configurations were fixed at 1.08, 0.5 s, and 65 × 0.625, respectively. A 768 × 768 lung window and a 512 × 512 mediastinal window size, were used to reconstruct 1-mm-thick images with soft tissue kernel. The CT images were reviewed using twin 35 × 43 EIZO PACS displays with a 2048 × 1536 matrix. The final data comprised 2788 CT images for CoP patients and 990 CT images for NCoP patients. For 100 COVID-19 patients, we took 27–28 scans per patient which helped us obtain 100*27–100*28, i.e., 2758 CT scans. Similarly, for healthy patients, we took around 33 scans for each of 30 patients, resulting in 30*33 = 990 CT scans.

Baseline characteristics

The baseline characteristics of the Italian cohort’s COVID-19 data are presented in Table 1. We have utilised the “R package” to perform a t-test on the data, with the level of significance set to P <  = 0.05. The table shows the essential characteristic traits of CoP patients. The baseline characteristics reflect the visual characteristics of the CT lung data (row #3 to row #6). The ground-glass opacity (GGO) is significant in differentiating between CoP and NCoP classes (P  = 0.00001). Lung consolidations (CONS) also differentiates the two classes from one another (P  = 0.00453). The pleural effusion (PLE) attribute is also significant in the classification of CoP and NCoP patients (P  = 0.00413). The most common physiological symptom of CoP is fever, which is also be correlated with body temperature (P  = 0.00313).

Table 1 Baseline characteristics of CoP and NCoP patients

Three kinds of AI architectures for classification

We have shortlisted two representative candidates from ML algorithms—namely k nearest neighbours (k-NN) and random forest (RF). The developed framework is a modified version of our previous work [26].

For TL, we utilised VGG19 and InceptionV3 pre-trained models [27] (see Fig. S5, S6 (Online Resources 1) and changed only the model top. VGG19 is a 19-layered deep model consisting of sixteen convolution layers to extract visual features, five max pool filters to reduce the spatial size of the extracted features, and three fully connected layers for classifying the image. InceptionV3 is a 42-layered deep model consisting of 11 inception modules (each comprising of multiple convolution layers and max-pooling filters), followed by three fully connected layers and a softmax activation layer.

The initial layers of TL were made nontrainable, and only last layers were made trainable. The reason for not training the entire network in case of transfer learning is that it can save computation time because the network would already be able to extract generic features from images. The network will not have to learn extracting generic features from scratch. A neural network works by abstracting and transforming information in steps. In the initial layers, the features extracted are generic, and independent of a particular task. It is the latter layers which are much more tuned specific for a particular task. So, by freezing the initial stages, we get a network which can already extract meaningful general features. We would unfreeze the last few stages (or just the new untrained layers), which would be tuned for our paradigm. It is not recommended to unfreeze all layers if we have any new/untrained layers in our model. These untrained layers will train as if initialised by random (and not pre-trained) weights which would lose the basic idea of transfer learning.

For DL, we developed our custom architectures (CNN and iCNN), consisting of a multi-layer convolution network (see Fig. S7, Table S5 (Online Resources 1). It contains three convolution layers, each of which is followed by a max-pooling filter, and two fully connected layers. A two-class probability score is obtained by passing the output to a softmax activation function. In iCNN, we slightly changed the “ReLU” activation function in the hidden layers to σ = (max(0, x))1.00001. Here, x is the input value, sigma is the activated output value, max is a function that gives the maximum value between zero and the input value, and the exponent 1.00001 slightly scales the output.

Several lightweight convolution neural network models have been experimented with 3, 4, 5 convolution layers for COVID disease identification, and it has been shown that these models provide very good results with 3 convolution layer model giving best accuracy. In the proposed three convolution layer model, 32, 16, and 8 hidden units are there in hidden layers 1, 2, and 3, respectively. Moreover, each convolution layer is followed by a max-pooling layer. After the last max-pooling layer, the flattened layer is present which converts the 2-D matrix to 1-D column vector which is densely connected with a layer having 128 hidden units, followed by the output layer. To provide nonlinearity in the model, the standard ReLU activation function has been modified and used in hidden layers.

Results

Accuracy of the two ML, two TL, and two DL models

We compared the K10 classification accuracy of all the six AI models for the COVID-19 data, as shown in Table S2 (Online Resources 1). Our observations demonstrate that accuracies are in the following order DL > TL > ML. Further, DL-based iCNN and CNN architectures had accuracies of 99.69 ± 0.66% and 99.53 ± 1.05%, respectively, making them the two most accurate models among the six tested models. Of the TL architectures, only VGG19 fared well against DL architectures, as it had a classification accuracy of 99.53 ± 0.75%. The other TL architecture (i.e., InceptionV3) achieved a classification accuracy of only 94.84 ± 2.85%. The two ML architectures varied considerably in terms of their performance; their RF scoring was 96.84 ± 1.28%, and their k-NN scoring was 74.58 ± 2.24%. The mean accuracy figures of all six AI models are summarised in Fig. 1.

Fig. 1
figure 1

Mean K10 classification accuracies (in %) of two ML, two TL, and two DL architectures. The bar chart is presented in increasing order of accuracy

CT lung characterisation using bispectrum analysis

We characterised CoP and NCoP CT lung tissues using bispectrum analysis based on a higher-order spectrum (HOS). Bispectrum analysis is based on the principle of coupling of components of spectral signals. If there is a sudden change in grayscale image density (as is the case for COVID-19-infected tissues), then higher bispectrum (or B) values are generated. This property of bispectrum analysis can be exploited to identify COVID-19-infected tissue quickly. This study is intended to identify NCoP and CoP patients without using AI-based techniques.

Generally, COVID-19-infected lungs are characterised by a hyper-intensity region. We separated those pixels from lung CT images and passed them into a Radon transform, which acts as a signal for HOS to generate B values. The images of CoP patients have much higher B values. The 2D and 3D bispectrum plots for CoP and NCoP patients are shown in Figs. 2 and 3.

Fig. 2
figure 2

Comparison of bispectrum (2D) plots of CoP and NCoP patients

Fig. 3
figure 3

Comparison of bispectrum (3D) plots of CoP and NCoP patients

Performance evaluation of AI models and its clinical validation

Receiver operating characteristics

The ability of all six AI models to differentiate CoP and NCoP data sets is illustrated in Fig. 4. We used the K10 protocol to compute receiver operating characteristic (ROC) curves. As expected, the simplest ML model (i.e., k-NN) performed the worst in this regard, achieving a score of just 0.744 area under the curve (AUC) (P < 0.0001). The best-performing model was the novel iCNN DL, whose AUC score was 0.993 (P < 0.0001). Other AI models based on their increasing AUC values are TL-based InceptionV3, machine learning-based RF, transfer learning-based VGG19, and our custom deep learning CNN.

Fig. 4
figure 4

ROC plots for the six AI models (two ML, two TL, and two DL), along with their corresponding AUC values

A comparison of six AI models based on multiple classification metrics

We compared six AI models based on a COVID-19 data set containing 377 samples (99 NCoP patients and 278 CoP patients). We choose ten classification metrics for this comparison: sensitivity, specificity, precision, negative prediction value (NPR), false positive rate (FPR), false discovery rate (FDR), false negative rate (FNR), F1 score, Matthews correlation coefficient (MCC), and Cohen’s Kappa coefficient. Cohen Kappa and F1 score are measure of AI methods performance metrics calculated based on true positive, false positive and true negative and false negative values. F1 score [37] can be calculated using the formula:

$$ F_{1} = \frac{{{\text{TP}}}}{{{\text{TP}} + \frac{1}{2}\left( {{\text{FP}} + {\text{FN}}} \right)}} $$
(1)

We adopted Matthew’s correlation coefficient [28] for quantifying the quality of binary classification since it is typically used in machine learning. It was in 1975 that the biochemist Brian W. Matthews had introduced this measure. Given the truth table values represented as TP: true positive, FP: false positive, TN: true negative, FN: false negative, we mathematically express MCC as shown in Eq. 2.

$$ {\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} }} $$
(2)

Note that MCC represents the correlation between predicted and observed binary classification. It returns a value between −1 or +1. The perfect prediction is represented when MCC is +1, and −1 represents total disagreement between prediction and observation.

The results of the study are summarised in Table 2. Both the DL models (CNN and iCNN) and one of the TL models (VGG19) performed equally well. Both ML models (RF and k-NN) and the second TL model (InceptionV3) did not perform well in comparison with the DL models.

Table 2 Comparison of the six AI models on the basis of multiple classification metrics

COVID risk stratification

Figure 5 presents the COVID-19 risk levels of patients as predicted by our custom CNN DL model. We created the frequency distribution (Fig. 5a) by using a softmax function in the output layer of the model such that the model produced a probability score (ranging from 0 to 1) that indicates a patients’ COVID-19 risk. We divided the overall probability range into ten bins and added each CT image sample to one of the bins based on the output of the model. We considered three levels of risk: low risk (probability score of 0 to 0.3), moderate risk (0.3 to 0.7), and high risk (0.7 to 1). A cumulative distribution plot of all 3788 lung CT samples is given in Fig. 5b. This distribution was computed by summing all the CT samples for each bin by adding the previous total of samples until all the COVID-19 risk probability bins are completed.

Fig. 5
figure 5

COVID risk assessment: a frequency distribution of COVID-19 risk for CoP and NCoP patients; b cumulative distribution of COVID-19 risk

Clinical validation of COVID risk stratification

The ground-glass opacity values (GGO) correlation with CNN model was determined for each patient. For this, the mean of all CT scan slices of patient probability score was calculated and compared with GGO values. Similarly, bispectrum mean for each patient was calculated and compared with GGO values. CONS values were also tested for their correlation with COVID severity and bispectrum values. A list of all patients’ values of GGO, CONS, severity, and bispectrum B values is given in Table S3 (Online Resources 1). The correlation between these fields among themselves is also given in Table S4 (Online Resources 1).

The association linear curve between COVID severity and GGO is shown in Fig. 6 and that between bispectrum (B) value and GGO is shown in Fig. 7. Similarly, the curve between bispectrum and COVID severity is also shown in Fig. 8.

Fig. 6
figure 6

Association between GGO and COVID severity

Fig. 7
figure 7

Association between GGO and bispectrum B values

Fig. 8
figure 8

Association between COVID severity and bispectrum B values

Discussion

In this study, we tested our two custom DL models against two state-of-the-art TL models, using two popular ML models as baselines to resolve the CoP vs NCoP classification problem. We used the K10 protocol and compared these models’ accuracy. We used COVID-19 data that we collected from patients, following specific privacy laws. Our relatively simple nine-layered iCNN model was the most accurate among the investigated models, and it achieved the highest AUC score of 0.993 (P < 0.0001). Surprisingly, we found that architectures that are even more straightforward compared to iCNN model (e.g., RF) can match which are comparable to the state-of-the-art TL models (e.g., InceptionV3) in terms of accuracy and AUC score when used for COVID-19 classification. TL models’ unremarkable performance could be because these models were not trained on CT images or any other radiology data. Moreover, the high separability in training data, which is being caught by other AI models, is not noticed by TL models.

The COVID risk stratification for each patient was validated by showing a strong correlation with ground-glass opacity values of the patient’s CT scans. Similarly, bispectrum was also validated against GGO values. The clinical tests also show the AI models which are having similar classification capabilities and which are significantly differing in accuracy values. This is more clear than visual inspection of accuracy and standard deviation values of each AI-model.

Benchmarking

Table 3 presents benchmarking data to compare the six AI models examined in our research with those considered in existing work on COVID classification. We have shortlisted four criteria for benchmarking: (1) the COVID-19 dataset used, (2) the AI model used by the researchers, (3) the accuracy of their proposed models, and (4) any other performance measures used by the authors. Rows R1 to R5 present the research done by other researchers, and row R6 represents our research. It can be observed that the performance of our custom iCNN model is on par with models proposed by other researchers.

Table 3 Benchmarking of six AI models with the existing work on COVID-19 classification

3D validation

The lung CT data of our Italian cohort was processed so that we could evaluate the degradation and fibrosis of lung parenchyma of CoP vs NCoP patients (Fig. 9). We used the image segmentation tool to process data in DICOM format. Using profile lining, we applied segmentation based on the Hounsfield value (grey value) of the pixels belonging to the lung section [33]. A stacking process [34] was then applied to obtain a union, forming a 3D volume of the segmented region of interest [35]. This process was followed by region growing to develop the region of interest (in this case, the lung). The 3D volume was computed for the grown region to evaluate the volume and spatial distribution of lung parenchyma. We computed the spatial distribution of parenchyma associated with the rear end of the lung because the influence of spike proteins of COVID-19 is more significant in the deeper volume of the lung parenchyma [36].

Fig. 9
figure 9

(a1), (a2), and (a3): CoP lung samples showing the degradation and fibrosis of lung parenchyma; (b1), (b2), and (b3): three NCoP lung samples

Interpretation

DL models, particularly the CNN model that we used, are very good at recognising the spatial features of images without human intervention, which supports our hypothesis. Both of our custom models ran well likely because of the visual features of COVID-19 in the lung CT images (e.g., ground-glass opacities, consolidations, and pleural effusions). These features are very distinct for CoP when compared to NCoP. This notion is supported by the data representing the baselines characteristics of patients. If traditional ML classifiers are to work efficiently, their features need to be handcrafted, and their performance depends on the ingenuity of the model’s designer. TL models work better than DL models when there are relatively little data and training time. However, they must be pre-trained using similar dataset for which they are expected to be used. This limits the application of TL models in medical imaging unless such a model has been pre-trained on similar data.

Strengths, weakness, and extensions

Strengths: The architectures that we designed and developed in this work are relatively simple and easy to use in research and clinical settings. Even without augmentation, we demonstrated that their classification accuracies are high enough to be considered within the clinical range according to recent publications. Although the pilot trials were successful, the data sets that we used could be more balanced and could be multi-ethnic.

Weakness: Due to lack of non-COVID pneumonia data sets, the current models could not be tried. We intend to extend this to multiclass paradigms in future research [37]. Due to the limitation on the data sets regarding the “censorship” and “survival”, it was not possible to compute the survival analysis such as hazard curves and survival curves. However, in future, we will be collecting this information even though vaccines distributions have started.

Extensions: Even though the pilot study showed powerful results, one can design more robust automated segmentation step using stochastic segmentation strategies [38,39,40]. Extensive ML features can be computed under ML framework in future [41,42]. More validations using multimodality spatial images can be conducted such as PET and CT based on registration methods [43,44]. Superior lung CAD models can be designed to improve scientific validation [12,45]. Since AI has fast developed and more transfer learning approaches have been developed, one can try extending the TL models using the pre-trained weights [37]. While six AI models were tried on a single set of data, multi-centre study could be conducted using the same models to avoid any bias. Thus, the current study can be a launching pad for multi-centre, multimodality, multi-ethnic, and multi-regional analysis.

Conclusion

We presented six types of AI-based models for CoP vs NCoP classification via CT lung scans taken from an Italian cohort. The proposed CNN-based AI-model outperformed the TL and ML systems that were investigated. Further, we showed that when using higher-order spectra, bispectrum could differentiate CoP patients from NCoP patients, thus further validating our hypothesis. As part of clinical validation, a novel COVID risk factor calculation was introduced using CNN output probability values and validated against GGO values of all patients.

Our AI system was implemented on a multi-GPU system such that the online system was a few seconds per scan. The system can be extended to multiclass data sets where data can also be taken from community pneumonia or interstitial viral pneumonia. The system was validated against the well-accepted existing data sets (e.g., a biometric data set and a DL animal data set).