1 Introduction

COVID-19 is a new member of the family of coronaviruses belonging to the Acute Respiratory Syndromes (SARS-CoV) and has been called SARS-CoV-2 [1]. This coronavirus outbreak appeared in China at the end of 2019 and was notified to the world on December 31 of that year, since then to date millions of people have been infected with the disease.Footnote 1 The main symptoms of the virus are: fever, sore throat, dry cough, muscle ache, and acute respiratory distress [2].

The rapid spread of the coronavirus and the serious effects it causes in humans, make an early diagnosis of the disease imperative [3]. To this day, the gold standard for detecting the presence of the virus is from the Reverse Transcription Polymerase Chain Reaction (RT-PCR). This test was designed by the Nobel laureate in Chemistry, Kary Mullis in the 1980s, which allows making a small amount of DNA millions of copies, so that there is enough to analyze it. Very high variability is introduced into the test sampling process, depending on the site where it is taken, the personnel taking it and the person's viral load at that time [4]. Furthermore, the procedure for PCR testing is a time-consuming process, around 6 to 9 h to confirm infection [5]. On the other hand, the tests have a sensitivity of between 60 and 70% depending of the stage of the disease [6].

One of the variants for the detection of positive patients may be based on the analysis of medical images [7]. The typical characteristics of the images and their evolution play an important role in the detection and management of the disease. The specialists rely on radiological studies, either by chest X-Rays (CXR) or computed tomography (CT) to follow the evolution of the disease. In a CT image, the overlapping structures are removed among slices, improving image contrast and making the internal anatomy more apparent. Studies confirm visible abnormalities in radiographic images, making this an important decision-making tool for human specialists [8]. However, 50% of patients have a normal CT scan within the first two days after symptoms of COVID-19 appear [8]. It is important to note that there are patients who present positive PCR, but do not develop signs or symptoms of the disease. These patients have normal radiographic studies. Therefore, they cannot be detected as positive using an image of their lungs.

The use of CT as a diagnostic method for COVID-19 has several drawbacks. In many hospitals the necessary equipment to acquire the image is not available and the cost of a tomographic study is not cheap. The dose of ionizing radiation delivered to the patient in this equipment is relatively high. The disinfection time among patients for the CT equipment and the room is approximately 15 min. On the other hand, CXR images have some advantages compared to CT, which make this modality a more extended way to patients. For example, this technology is available in most health care facilities. There is a portable modality that prevents the patient to move, minimizing the possibility of spreading the virus and exposing the patient to a lower dose of ionizing radiation and it is cheaper than a CT scan.

In both cases, the main role of diagnosis lies in the presence of radiologists for image analysis. However, the COVID-19 findings are in many cases very subtle. Expert radiologists are able to identify only 65% of positive patients [9]. One way to mitigate this drawback would be the application of Artificial Intelligence (AI) techniques. In this way, clinicians can be equipped with an X-ray imaging-based early warning tool for the detection of COVID-19.

Following this idea, a large number of researchers have been working on the issue of automatic classification of COVID-19 from CXR images [10, 11, 12, 13, 14, 15, 16, 17,18]. These studies report systems with high performance rates. In fact, these results are well above those obtained by experienced radiologists [9]. This issue must be handled with care in order not to generate false expectations in the area [19]. Therefore, this research critically analyzes the main methodologies and results achieved in the works published to date on the subject. Both, studies published in refereed journals and in digital repositories have been taken into account. The aim of the research is to present to the scientific community a summary of the work developed on this topic worldwide in the year 2020. In addition, to make a critical presentation, in the opinion of the authors of this work, of why most of this research leads to unreliable results. This is the main difference of our research with other review studies like [20, 21, 22, 23, 24, 25, 26, 27, 28, 2930] that analyze automatic classification of COVID-19 using CXR images, because none of them address the problems related to the lack of generalization reported in several papers [31]– [34].

2 Use of AI in CXR image classification

Computer vision (CV) tasks in recent years have been dominated by deep learning (DL) techniques, implemented by the deep neural networks (DNN) [35]. Compared with traditional neural networks, DNN have the ability to extract hidden and sophisticated structures (both, linear and non-linear features) contained in the raw data. Such ability is intrinsically related, on the one hand, to the capacity to model their own internal representation and, on the other hand, to their ability for generalizing any kind of knowledge. Also, they are extremely flexible in the types of data they can support. Moreover, their learning procedure can be adapted to a great variety of learning strategies, from unsupervised to supervised techniques, going through intermediate strategies. Specifically, convolutional neural networks (CNN) have been used, which specialize in the classification of images. DL has been favored due to three fundamental factors. The first is related to the increase in existing data in the present digital age, as there are large data sets used in the training of these algorithms. The second is related to the increase in computing capacities, with the use of specialized processors such as GPUs (graphic process unit) and TPUs (tensor process unit), implementing advanced processing techniques such as batch partition, in particular on parallel and distributed architectures, allowing DNN models to scale better when dealing with large amounts of data. Finally, there are the high-performance rates achieved in complicated applications that are difficult to explain for humans [36]. Among the technological applications of DL are: audio processing, text analysis, natural language processing and image recognition, among others [37].

These potentialities achieved by DL suggest that it could be an ideal candidate to support radiologists in their diagnosis. In fact, one of the tasks addressed has been the automatic classification of CXR images. Sets of this type of images are available,Footnote 2 on which many researchers have proposed novel solutions that improve the visual analysis that could be done a priori of the different pathologies. In addition, works have been done to identify the different types of pneumonia from these images [38]. The results of using CNN to diagnose disease have been promising, but X-ray trained models from one hospital or group of hospitals have not yet been shown to work equally well in different hospitals [39]. Among the existing limitations there are the biases that the image sets may contain [40, 41]. For example, in the works [39, 42] there are discrepancies in terms of the results achieved when training and evaluating the DL algorithms on sets that do not come from the same source. Specifically, in the work [42], there were four sets A, B, C and D. It was observed that when training and evaluating on set A (using appropriately techniques to divide the sets) the results are higher than, when trains using sets B, C, and D, and evaluate using set A.

That is, CNN performance estimates, based on test data from CXR systems used for model training, may exaggerate their likely performance in actual clinical routine. For example, it was shown that the site of acquisition, both with respect to the CXR system used and the specific department within a hospital, can be predicted with very high precision [39]. This feature should be taken into account when training models of this type, as the network can learn the source of the images rather than the pathology being identified. On the other hand, normally, the greater the amount of data (images) used to train the algorithm, the greater the power of generalization it must have [43]. However, this is not entirely true in these cases, due to possible biases related to imbalances in the amounts of positive and negative images used for training, most of the time with different origin, as well as the different characteristics of the images in each set, due to different mAs, kVp, detection geometry, image size, pixel intensity, artifacts and labels, among others, which if not handled properly, can lead to erroneous results, as will be discussed in the following sections.

3 CXR and CT in AI models for COVID-19 classification

Diagnosing COVID-19 from CXR images is a complicated task for radiologists. They must identify typical patterns of the disease that are often shared with other types of viral pneumonia, which leads to errors in their diagnosis. A more accurate alternative for disease detection is CT imaging. This technique is considered the most accurate in identifying typical findings in the lungs of COVID-19 [44] and plays a fundamental role in the diagnosis and evaluation of COVID-19 pneumonia [45]. Note that ground glass opacities in the periphery of the right lower lobe on CT, which is one of the typical findings of the disease, are often not visible on CXR [46].

Contrary to what has been explained, the results reported to date seem to be more favorable for CXR than for CT. For example: a comprehensive review of the main sets of images, methods and performance indices achieved in automatic classifications is presented in papers [23, 26]. For example, in [26] a total of 80 articles published between February 21 to June 20, 2020 are reviewed. Of these works, 52 use CXR images, 30 use CT and 2 use both types of images. Taking into account the performance indices reported in the studies consulted, it is observed that automatic classifications using CXR achieve better results than when using CT. Note that the average accuracy (Acc) for CT is 90% and for CXR 96%. These results coincide with those reported in the works of [23, 27] where it is also reported that the performance indices of the models were higher using CXR images than when using CT images. In [22], works [47, 48, 49, 50, 5152] were reviewed and it was observed that they were based on small and poorly balanced data sets, with questionable evaluation procedures and without a plan for their inclusion in the flows clinical work.

Several are the advances reported in the scientific literature related to the automatic classification of CXR and CT images for the detection of COVID-19 [20, 21, 23]– [28]. These revision works constitute a starting point since they systematize the main knowledge achieved so far. The main objective of reviewing these works was to learn from the successes and errors of previous research, and to learn about aspects that have been overlooked or slightly studied.

The first published work that reviews the progress made using X-ray images to detect COVID-19 was [20]. This research also explains the role of AI in the prognosis of outbreaks of the disease. As one of the existing challenges to achieve a correct classification using CXR images, the need for large quantities of quality images is raised, which, in general, are not available in international databases. The studies analyzed were [10, 53, 54, 5556]. In these investigations, the number of positive images used in the training was less than 100, which greatly limits the generalization power of the models, under the CNN paradigm. In previous studies, binary classification (COVID-19 vs Normal) was performed. It is known that since COVID-19 is a type of pneumonia, a more challenging task is to identify, among the different types of pneumonia, those caused by coronavirus.

The medical imaging scientific community has been assisted by AI in managing COVID-19, an issue reflected in [21]. There is a need to use segmentation methods for the identification of COVID-19, which must be applied in two directions. The first to determine the region of the lungs and the second to fix the lesions that appear within them. However, segmentation in CXR images is a more challenging task compared to CT. In CT, each slice removes the amount of information that is above and below it, improving image contrast. On the other hand, in CXR images the ribs and soft tissues are projected in 2D, thus producing an overlap of information that affects the image contrast. According to what was reviewed in [21], until now, there was no method developed to segment CXR images specific for COVID-19. In fact, the investigations that review the work [18, 47, 54, 56] do not use segmentation methods to locate the region of the lungs, nor to locate the lesions on these. It should be mentioned that due to the dissimilar manifestations of the disease, it is difficult to select regions of interest with useful findings for classification, since they can appear in almost all regions of the lungs. Note that the disease has to be diagnosed only using an image that contains the region of the lungs, which means its bounding box. According to these studies, the COVID-19 positive CXR images used in the experimentation came mostly from the set collected by Cohen [57], which contained 70 images of positive patients. The works [23, 26, 28] confirm this set of images available on GitHubFootnote 3 as the most used, followed by the sets available on Kaggle2,.Footnote 4

In [25], works published in reliable databases such as IEEE explore, Web of Science, Science Direct, PubMed and Scopus are analyzed. The study resulted in the review of 11 articles of which only 6 are based on CXR to identify COVID-19, these were [15,16,17], [58,59,60]. It was confirmed that the quality and size of the existing images for the task differs greatly from one set to another, as well as the limited number of images that exist for experimentation. Among the proposed alternatives is the increase of the data and the segmentation of regions of interest (ROI). One of the important aspects in obtaining reliable models, according to the authors, is the selection and pre-processing of image sets.

There is a consensus among all these studies that the results obtained in the diagnosis of the disease, based on medical images of CT and CXR are encouraging. Likewise, there is a criticism regarding the limited number of positive images for the correct evaluation of the robustness of the methods, or to obtain models with the power of generalization to be used in clinical settings. Due to this lack of images, the approaches used do not take into account the patients' disease, important information that physicians must handle. In [61] it is stated that the most common causes of risk of bias in diagnostic models based on medical images are, the lack of information to evaluate the selection bias and the lack of a clear report of the image annotation procedures and quality control.

Due to the high complexity of the DNN where a lot of parameters needing to be determined or tuned, a large number of training samples are usually required for deep learning methods. However, previous work agrees that insufficient imaging for training has led research to advance with small sets of images available and apply data augmentation techniques when possible. Even though, the research does not discuss the limitations of the approaches used for the automatic classification of COVID-19. The high performances achieved by the methods used are not questioned either. It should be taken into account that the results obtained by human specialists from the CXR technique are far below of those obtained using AI techniques. Furthermore, the CT technique is considered the most accurate in identifying typical COVID-19 findings, however, the best results using AI techniques are obtained when using CXR.

4 Biases in used CXR images sets

One of the fundamental aspects to achieve a significant contribution of AI in the battle against the coronavirus, is the compilation of an adequate set of images in terms of quality and quantity. Despite the high number of patients with COVID-19 worldwide, there is no a free set of CXR images with the necessary quality for the construction of a diagnostic system with clinical value for the detection and follow-up of this disease with the use of AI. Radiologists have expressed concern about the limited availability of images to train AI-based models and the possible bias in these models [61], mainly related to the origin place of the positive images to COVID-19.

On the other hand, it is the right of the patient to decide when, how, and to what extent, others can access their medical information. Therefore, the informed consent of the patient must be obtained when their data is used for scientific research purposes. In this case, a process is carried out that includes anonymizing the data. In our view, this is the main reason for the relative low availability of data at present. Hospitals generally protect their patients' confidential information, as improper handling of data over networks can lead to legal problems.

From the publication by Cohen et al. [57] where a set of COVID-19 positive images is freely placed at the service of the international scientific community, a large number of works have been carried out that apply AI techniques for automatic classification of the illness. That is, to this day, this is the main source of COVID-19 positive images freely available worldwide. The formula used by most of the investigations to increase the number of negative images (that do not present COVID-19) has been adding images from sets available from other sources, which have different origin. This way of generating the sets introduces serious problems, which affect the results of the algorithms. For example, if there is any bias in the data set, such as corner labels, typical characteristics of a medical device, or other factors such as similar age of patients, same sex, etc., the classification model learns to recognizing these biases in the data set, rather than focusing on the findings they are trying to determine. In fact, the images contain little or no metadata on age, gender, pathologies present in the subjects, or other necessary information to detect this type of bias.

Another aspect that can introduce biases in the sets is the acquisition parameters such as mAs and kVp, something that the deep model could learn to discriminate. That is, a model can group images according to the scan tool used for the exam; if some scan configurations correspond to all the pneumonia examples, they will generate a false correlation, which the model can exploit to produce apparently favorable classification accuracy. Another example is given by the textual labeling in the images, if all negative examples contain similar markings, the deep model could learn to recognize this characteristic instead of focusing on the lung content, etc. In addition, these sets of images do not represent the severity of the disease in the same amount, with the majority of patients in an advanced stage of the disease, where the signs are more pronounced [62].

Due to the above, it is suspected that the high-performance values obtained so far by AI techniques are mainly due to the fact that the images can present marked differences that make the learning task an easy process for the algorithm. In [31] the current assessment protocols for the identification of COVID-19 from CXR images are strongly criticized. Mainly, the use of the complete image without selecting the region of the lungs and keeping the labels on the images and especially, the non-use of an evaluation set that does not come from any of the sources used in the training. In this study, it is tested how the CNN used was able to classify images that did not contain the region of the lungs. This was replaced by a black square, and even so, the classification was successful, with an Acc greater than 95%. It was demonstrated that the classification algorithms are learning patterns from the set of images, which do not correlate with the presence of the disease to be detected. The heterogeneity of the images makes the CNN learn characteristics that do not belong in themselves to COVID-19 [31, 33, 34]. Due to the existing limit in terms of pages allowed in writing, it was limited to creating Table 1 with the works published in peer-reviewed journals that make use of this methodology of selecting images from different sources to create their sets of images. This way of evaluating the algorithms does not guarantee their generalizability as will be discussed in later sections. Note that the number of images by classes presented in the table refers to the number used at the time of publication of the cited study. Therefore, these amounts may have varied from then to date.

Table 1 Main papers published in peer-reviewed journals for COVID-19 detection using CXR

Another important aspect that works against the good performance and reliability of the systems that have been proposed is the large number of artifacts that the images contain. Many of the positive images for COVID-19 present intubated patients, with electrodes and their cables, pacemakers, bras (in women), zippers, among others. This aspect can be another considerable source of bias, since when images acquired under other conditions are classified, not taking into account these characteristics could lead to false negatives. A detailed description of the characteristics of the image sets used in COVID identification studies appears in [63]. This research highlights the biases that exist in each of these sets that can confuse the algorithms. In [59] three sets of public access images are combined. The positive images were obtained from the combination of the images available on GitHub3 and Kaggle,Footnote 5 76 and 219 respectively. The normal class contains 65 images and the pneumonia class contains 98 images. The image set used is available from Kaggle.Footnote 6 Figure 1 shows a selection of these images. There are marked differences among the groups of images, perceptible to a not trained human eye; that are not related to differences produced by the diseases they contain. For example, notice in (a) at the top left how a light-colored label always appears. Also, in (a) the black background cannot be seen in the rest of the images. On the other hand, in (c) pulmonary structures are observed totally different from the rest, since they belong to children.

Fig. 1
figure 1

Representation of three groups of images. In (a) images positive for COVID-19, in (b) normal images and in (c) images with pneumonia of another type. Taken from [59]

There is no doubt that these sets of images are important for COVID-19 identification studies. However, great attention must be paid to how to use them. Most of the investigations that use sets obtained in a similar way to that explained above, obtain very high-performance indices. Note that the sensitivity of human specialists is around 65% [9]. All of the above suggests that it is necessary to investigate and work on the digital pre-processing of the images to be used to train and validate the systems, so that it is aimed at eliminating the origin biases that the data have, which are generating an overfitting of the algorithms and little or no level of generalizability for their clinical use.

5 Pre-processing and data augmentation

Medical images can be affected by various sources of distortion and artifacts. As a consequence, the visual evaluation of these images by human specialists, or by AI algorithms, becomes a difficult task. Therefore, one of the initial tasks to obtain better results is the pre-processing of the image. In DL environments, large amounts of images are required to perform training properly and to avoid algorithms overfitting. These large amounts of images are generally not available in medical settings, which involve a variety of techniques. One of the variants used to avoid overfitting of the DL algorithms has been the increase of the set of images [69]. This technique is called data augmentation and consists of applying transformations on the images, with the aim of increasing the set to be used. The main modifications made to the image set as part of its pre-processing, as well as to increase its quantity are discussed below.

One techniques used in the training process to increase the set of images have been, moving the image a number of pixels by rows and / or columns, flipping horizontally and / or vertically, as well as rotating in all directions [70]. In addition, other variants have been applied such as modification of the intensity of the pixels [58, 71] and types of filtering [72].

Although CXR images are grayscale, some studies have used techniques to recolorize them. In [73] four pre-processing and data augmentation schemes are tested in the image set. These were: using the original image without performing any pre-processing, using the CLAHE technique [74], complementing the image and finally combining these modifications in each of the channels. Another alternative has been to use diffuse color techniques as presented in [59]. New images have also been generated based on generative adversarial nets (GAN) technique [75]. In [33] a variant of the GAN technique is used to generate two images per class, which are not interpretable for humans, but help to improve the performance of the algorithms from 77% effectiveness up to 81%.

Another of the applied techniques is the modification of the intensity of the pixels, from the adjustment of the contrast, or simply increasing or decreasing the intensity by a certain amount. In [15], the histogram equalization is carried out as a pre-processing stage, then a gamma correction of its intensity with γ = 0.5 to increase the contrast in the darker regions, which belong to lung, followed by a resizing to 256 × 256 pixels. This results in the intensities of the pixels for the heart and lungs having similar distributions in their histograms in different sets of images. This step should compensate for biases due to differences in the mAs and kVp acquisition parameters among the different image sets.

In the COVID-19 detection environment from CXR, several pre-processing methods have been applied to extract its characteristics or use them directly as input to CNNs. Due to the heterogeneity of images in terms of their dimensions, one of the first steps is to resize them, generally to 224 × 224x3 or 229 × 229x3 pixels. This is because most pre-trained CNNs use these fixed sizes as input. Image normalization has also been applied, using the mean and standard deviation obtained from the ImageNet image set [76]. However, better results have been reported, when training from scratch in the identification of pneumonia and after that apply transfer learning technique [12]. The CNN most used in this task has been ResNet, with different amounts of layers. Its use has been reported in a total of 27 articles [26]. In other cases, the image is resized depending on the input size of the proposed network architecture. For example, in [67] it is resized to 512 × 512 pixels. Something similar is done in [77], using images with three channels (RGB). In [18] it is resized to 480 × 480x3 pixels and in [78] to 200 × 200 pixels. The reduction of the dimensions of the images leads to lightening the computational cost of CNN training. Note that the CNN-based algorithms used in these tasks sometimes have more than 14 million parameters [16, 66].

One problem to attend to when images are resized is that algorithms generally work with square images, but the images used are not always square, which implies modifying the aspect ratio of the image to achieve this. One of the alternatives is reported in the work of [10], where they are scaled in a ratio of 1: 1.5, leaving 200 × 266 pixels. Those images that did not fit this scale were filled with zeros. This step can introduce a bias in the learning of the network. This is because if the images that come from a data set have similar dimensions to those that do not come from that set, they will be marked when they are completed with zeros.

In order to balance the training set, the data augmentation is performed before training [14, 58]. The combination of increasing the data and the balance of the classes improves the performance of the algorithms, reaching approximately 98% Acc in both investigations. However, it is not correct to also increase the test set, as is done in [58], since images that do not belong to a real set are being evaluated. Therefore, these reported results do not guarantee reliability in the final model.

The preprocessing stage corrects the intensity of the pixels to avoid appreciable differences between the different groups of images that make up these sets. However, many of the investigations do not take into account the elimination of the marks that in the images that can help the network to determine which class it belongs to, without this being related to the disease to be classified. One of the alternatives to alleviate this weakness is to use only the region that delimits the lungs. This requires applying a segmentation method. The advantages of performing this step are discussed in the next section.

6 Segmentation of the lung region

Among the alternatives used to eliminate biases from the data sets related to the labels of the images, it is proposed to work only with an image that contains the region of the lungs. The segmentation technique separates the image into different regions. Each of these regions is made up of a set of pixels that share certain common characteristics. The use of this technique in image processing allows simplifying the representation of the image into something more useful and easier to use. Segmentation can aid in more reliable detection of COVID-19 by extracting the region of the lungs. In this way, areas that do not belong to the region of interest (ROI) are left out of the analysis. Studies are reported that correctly use these methods to extract the region of the lungs and then perform the learning as seen in the works of [15], [32,33,34], [79,80,81,82].

Segmentation can be done manually by human specialists, but it is a time-consuming task. In [17] the images used are manually cropped to avoid these biases. However, there are currently segmentation algorithms capable of doing this automatically. Some DL algorithms have shown good results in segmentation tasks. In the work [15] the algorithms FC-DenceNet67, FC-Dencenet103 and U-Net are compared to segment the region of the lungs in CXR images. It was evidenced that between the last two techniques there are no significant differences in their behavior. In fact, most studies that segment the lungs use U-Net, or some of its variants [26]. Figure 2 shows one of the variants used by the researchers, where we start from a complete CXR image and arrive at a cropped image, which contains only the region of the lungs.

Fig. 2
figure 2

Process of extraction of the region of the lungs. U-Net is applied as a segmentation method and a cropped image is obtained

In [81] a new strategy based on CNN ensembles is successfully applied. It is shown that applying transfer learning over a similar domain, as well as iteratively pruning the layers of the CNNs that do not activate, and finally, combining the algorithms, yielded good results in the identification of COVID-19. To remove irrelevant information from the image and ensure reliable DL models, U-Net was applied as a segmentation method. The images used belong to four repositories available online, these were: Pediatric CXR21 [83], RSNA2 [84] which contains images from Chestx-ray8 [85], Twitter COVID-19Footnote 7 and GitHub3. A split was performed at the patient level using 90% for training and 10% for testing.

The study [80] proposed a cascade model to assist doctors in the diagnosis of COVID-19. First, a SEME-ResNet50 architecture is used to classify into three classes: normal, bacterial pneumonia, and viral pneumonia. In the second stage, SEME-DenseNet161 was used to distinguish if viral pneumonia is COVID-19 or not. To exclude the influence of non-pathological features, the images are pre-processed using U-Net in the second stage. The results show an accuracy of 85.6% in the first stage, to determine the type of pneumonia and 97.1% in the second stage, for the identification of COVID-19.

In [32] the effect of performing lung segmentation by applying CNN on CXR images to identify COVID-19 is evaluated. U-Net was used for image segmentation and three popular CNN models like Inception, ResNet and VGG were used for classification. Two explainable artificial intelligence methods were used to visualize the areas on which the models were based to perform the classification. Furthermore, the impact of constructing sets of images from different sources as well as the generalizability of the models was evaluated. However, only the positive images for COVID-19 came from different sources since the negative images came only from RSNA2. It was shown that the main findings that networks use to perform classification using the whole image mainly appear outside the region of the lungs and it is related to marks that the images present. In addition, an experiment was conducted to determine whether the network could classify the database it came from. The result was an F1-Score of 0.92 using the complete images and 0.7 using the segmented images. This shows that segmentation helps to eliminate the bias of algorithms learning to identify the source of provenance related to the labels. However, these results show that even applying the segmentation of the lung region, the network was able to identify its origin set.

These results suggest that CNNs are learning patterns that are not directly related to pathologies associated with images. By using the full images, the networks learn characteristics outside the region of the lungs. It is needed to apply an adequate evaluation protocol to determine the generalizability of the methods.

7 External set for evaluation of trained models

In previous studies, the use of an external set that did not come from any of the sources used in the training stage was not taken into account for the evaluation of the algorithms. Therefore, the generalizability of the model to new images that do not come from any of the sets used in training is unknown. The investigations that, following the previous approach, have used their own images to evaluate the proposed systems are presented below. In these cases, the results do not correspond to the high-performance values obtained in the majority of investigations that use an evaluation set that is a subset of the training set.

In [82] a cascade architecture to identify COVID-19 was presented. In the first stage, the segmentation of the lungs is carried out. This eliminates unnecessary information that is contained in the images for the purposes of classification of COVID 19 or another disease. U-Net was used to predict the segmentation mask. To prevent the system to learn inconsistent characteristics, it is identified if there is any indication of pneumonia in the region of the lungs. To do this, a binary classification is performed in "Normal" or "Pneumonia" using DenseNet-121 as CNN incrementally. In the next stage, an attempt is made to classify whether the pneumonia is due to COVID-19 or another type of cause. The public repositories used were, Padchest [86], RSNA2 and GitHub3. In addition, three other sets of images called NTUH, TMUH and NHIA were used, from hospitals in Taiwan, which are not available internationally. The training and testing process were carried out independently in the public and private sets. The results showed that, when using the images of the public sets in training and validating and testing on a partition of the same set, the results were very good. The same did not happen when the evaluation was carried out on private groups, where the results were considerably lower. The sensitivity and specificity, using the public repository as a test set, were 85.26% and 85.86% respectively. While, when using the private repository, the sensitivity decreased to 50% and the specificity to 40%, results that demonstrate a random classification. To improve the results, the sets were mixed, adding images of the private set in the training of the models. This time similar values were obtained in both test sets. Sensitivity and specificity were 91.43% and 99.44%, respectively, for the test set, composed of images from public repositories. In the case of the test set of the images from the private repository, values of 100% sensitivity and 75% specificity were obtained. This last evaluation variant does not seem to be adequate, since there is no external evaluation set, but rather the same training and evaluation protocol is followed with images that come from equal sets, and it has been shown that this variant overestimate the results.

In [33] the high sensitivity reached by most models for classification of COVID-19 is demystified. A new set of images called COVIDGR-1.0 was used that contains 754 images distributed in 377 positives and 377 negatives. All images were obtained on the same CXR equipment and using the same settings. All belong to the postero-anterior view (PA). The positive images were divided according to their severity into 76 normal, 80 mild, 145 moderate and 76 severe. This stratification in the positive class allowed to carry out an analysis of the behavior of the models according to the severity of the disease in the patients. The behavior of two of the best performing models was evaluated, these were COVIDNet [18] and COVID-CAPS [87], both trained in the COVIDx set [18]. The experiments show that these models are unable to determine the presence of COVID-19 in the COVIDGR-1.0 set since the Acc reported is approximately 50%. The COVIDNet, COVID-CAPS and ResN-50 models were re-trained using the new set and the results were slightly higher with an Acc of 65%, 61% and 72% respectively. The new proposal presented, called COVID-SDNet, surpassed the performance of the previous models, reaching 77% of Acc. An analysis was carried out by level of severity, and it showed that the model is capable of detecting with an effectiveness of 88% and 97% to moderate and severe cases respectively. However, the images with mild severity and the normal ones reached only 66% and 38%, respectively, of correct classification. This is because images that do not contain marked disease findings are difficult for systems to detect as well. In another experiment, those that were PCR positive with normal radiographs were removed from the set of images. The results showed an increase in performance indexes. The study shows that most of the models proposed to date, trained and evaluated on sets of heterogeneous images, lack the capacity for generalization. However, the study did not evaluate the proposed model on an external validation set. Therefore, there is no evidence of its generalizability power.

The studies developed in [34] appear along this same line. A new set is used for the evaluation called CORDA, obtained in Italy, which contains 447 images from 386 patients. Extensive experimentation was done in the study by combining different sets of images in training and testing. Two of the models with the best reported performances, COVID-Net [18] and ResNet-18, were evaluated. It was evidenced that not even performing the equalization of the histogram and then the segmentation of the region of the lungs, in order to try to eliminate the biases from the sets of images, it was possible to train models with the capacity of generalization. An AUC of 0.55 and 0.61 was obtained for COVID-Net and ResNet-18 respectively when evaluating on the CORDA set. These results demonstrate that algorithms learn characteristics related to the source data set, rather than the disease being classified. Therefore, an appropriate evaluation strategy in this environment is essential to build reliable models. One way to achieve a more reliable evaluation protocol is to separate the training and test images so that the images that belong to the test have a different origin than the images that were used in the training.

8 Conclusions

There is internationally a limited set of COVID-19 positive CXR images freely available on the internet for the use of the scientific community. Most of the studies complete the data with negative images from other data sources. These images have marked differences among different sets. This leads to very good results in the automatic classification of COVID-19 when evaluating using a subset of images from the set used. However, several studies report little or no power of generalization, when evaluating the trained models in their own sets. Even the models that were trained using pre-processing techniques, which tried to eliminate the biases belonging to the data sets, showed limited results. Therefore, most of the results achieved so far, which are reported in the scientific literature, present models that learn characteristics of the sets where they were trained. The absence of an adequate evaluation protocol means that most of the models developed still present little value in clinical settings.