Introduction

The emergence of artificial intelligence (AI) in the field of medical imaging has led to several breakthroughs [1, 2]. AI has already proven to be advantageous for computer-aided diagnosis in medical imaging, such as for the differential diagnosis of coronavirus disease 2019 [3], skin cancer [4], and diabetic retinopathy [5]. Moreover, it has been developed to help identify imaging-based biomarkers, leading to an improvement in the prognosis of, for example, lung cancer [6, 7], gliomas [8], and nasopharynx cancer [9]. Deep learning is an indispensable part of AI and has been reported to be extremely effective in several medical imaging-related tasks, such as image segmentation, registration, fusion, annotation, computer-aided diagnosis and prognosis analyses, lesion and landmark detection, and microscopic imaging analysis. In such studies, deep learning networks have shown capabilities to automatically extract characteristic features from images, including explicit features, such as the location, distribution, and volume size of lesions, and implicit features at different levels, which were deduced using nonlinear, independent discriminant, and invariant properties. The end-to-end automatic feature extraction does not involve human interaction, and the extracted features are the most implicit. Although the implicit features may be difficult to interpret, they are determinant for the performance of convolutional neural networks (CNNs) and play critical roles in many medical applications [10, 11].

The development of deep learning depends on the availability of a huge amount of data. It is usually challenging to gather a large cohort of patients with survival follow-up after administering the same therapeutic regime. Clinical trials are often associated with incomplete or missing follow-up due to factors such as insufficient follow-up time, patient tolerance, and compliance. This consequently hampers extensive development of deep learning methods for predicting therapeutic prognosis. Maximizing the utility of data gathered by clinical trials is thus a key area of research.

Data augmentation methods such as deformation or generative adversarial networks are often applied to support the development of deep learning methods in the field of image analysis [12]. However, the relationship among imaging, therapy, and survival is more complex than general image analyses. The increased physiological complexity makes it difficult to synthesize meaningful data for training. Furthermore, errors in data preparation may mislead algorithmic development [13]. Weakly supervised classification methods have been established using unlabeled data for regularization under particular distributional assumptions, such as cluster or smoothness assumption; however, the performance relies on the fidelity of the assumption [14,15,16], and it is usually challenging to find a proper assumption in real application. In contrast, positive–negative unlabeled (PNU) classification [15] is a weakly supervised strategy to deal with a tough task with less knowledge regarding data distribution and, therefore, is less restricted in complex applications. Despite these advantages, because PNU classification is generally applied for classification problems based on low-dimensional feature vectors [15], it is not straightforward to apply this classification to imaging data for survival follow-up in order to improve therapeutic prognosis.

Extranodal natural killer/T cell lymphoma, nasal type (ENKTL) is a rare type of lymphoma with poor survival outcome [17,18,19]. It constitutes <1% of all lymphomas in Western countries and 3–9% of all malignant lymphomas in Asia [18, 20, 21]. Several investigations have identified that almost all ENKTL lesions are fluorodeoxyglucose (FDG) avid [22, 23]. In patients with ENKTL, the use of 18F-FDG positron emission tomography/computed tomography (PET/CT) for staging is widespread [24,25,26]. Nevertheless, many contradictions exist pertaining to the value of 18F-FDG PET/CT in predicting the prognosis of ENKTL [22, 27,28,29,30]. Some studies [31, 32] have reported that maximum standardized uptake value (SUVmax) of pretreatment 18F-FDG PET/CT is not a statistically significant predictor of overall survival and progression-free survival (PFS). Tumor 18F-FDG uptake cannot reflect the aggressive biologic behavior of ENKTL; however, some studies have reported contradictory results [30, 33]. These studies found that high tumor 18F-FDG uptake was closely associated with unfavorable treatment and survival outcomes. Chang et al. [34] reported that baseline whole-body total lesion glycolysis (TLG) was a good predictor of PFS and overall survival in patients with ENKTL. However, treatment plans were not uniform in these studies, potentially affecting the treatment outcome and predictive value of pretreatment 18F-FDG PET/CT. Prospective research methods have also been used to assess the prognostic value of 18F-FDG PET/CT in ENKTL [31, 35, 36], but considering some uncertainty in the reported results, it remains unclear. A novel solution is accordingly needed. Although deep learning has been advantageous in assisting molecular imaging to optimize therapeutic prognosis [9], it is extremely difficult to develop appropriate deep learning methods for this rare condition with only a limited number of cases.

We herein propose a weakly supervised deep learning (WSDL) method based on PNU classification to maximize the utility of incomplete and missing follow-up data so as to predict the prognosis of ENKTL. We investigated the accuracy and robustness of this data enhancement strategy on a retrospective cohort to test a therapeutic regime for ENKTL.

Material and methods

Patients

One hundred and sixteen-seven patients with histopathologically diagnosed ENKTL from June 2011 to October 2020 recruited at Shanghai Ruijin Hospital were retrospectively collected. Patients who had undergone surgical resection, radiotherapy, chemotherapy, and/or bone marrow transplantation as well as those with other malignancies were excluded. All patients underwent whole-body 18F-FDG PET/CT for initial staging before therapy and were then treated with a therapeutic regime of methotrexate, etoposide, dexamethasone, and pegaspargase (MESA). Eighty-four patients were followed up for at least 2 years. Among them, 49 were sandwiched with radiotherapy for the involved local focus 21 days after two cycles of MESA. They were treated with a linear accelerator producing 6 MV photons. The radiotherapy dose was 50 Gy in 25 fractions, once a day, and 5 fractions every week. Chemotherapy was restarted 28 days after radiotherapy.

Of the 84 patients, 64 were randomly included in the training set; the remaining 20 were unobserved and included in the test set. The ratio of relapse to non-relapse individuals was kept the same in the test and training sets to avoid an extreme imbalance problem. PFS was the major endpoint. Recurrence and lymphoma infiltration were mainly diagnosed based on imaging methods and pathology. The remaining 83 patients without follow-up information or followed up for <2 years were also included in the training set using the proposed WSDL method. To further test the generalization of the WSDL method, data pertaining to the 83 patients were derived from three types of scanners: Scanner 1 (Discovery VCT, GE Healthcare, USA, 39 patients), 2 (Discovery MI, GE Healthcare, USA, 29 patients), and 3 (Biograph Vision, SIEMENS, Germany, 15 patients). The training set thus ultimately comprised 147 patients (Fig. 1).

Fig. 1
figure 1

A flow chart depicting the study plan. ENKTL: extranodal natural killer/T cell lymphoma, nasal type

The clinical features of the 84 patients, including gender, age, serum lactate dehydrogenase levels, Eastern Cooperative Oncology Group (ECOG) score, Ki67, β2-microglobulin, Epstein–Barr virus DNA, and B symptoms, were recorded. Ann Arbor stage, SUVmax, mean SUV (SUVmean), metabolic tumor volume (MTV), and TLG extracted from 18F-FDG PET/CT were also measured. All procedures in the study were performed in accordance with the ethical standards of the committee from Ruijin Hospital, Shanghai Jiao Tong University, School of Medicine. Written informed consent was obtained from all patients before treatment. Among the 84 patients enrolled in the clinical trial, 58 were alive (12 presented with persistent or recurrent disease at the last follow-up), and 26 had died due to a tumor-related disease. The clinical characteristics of patients in the training and test sets have been summarized in Table 1; data pertaining to the 83 patients diagnosed with ENKTL but with missing or incomplete follow-up information are also listed.

Table 1 Clinical characteristics of patients

18F-FDG PET/CT and preprocessing

Patients were required to fast for at least 6 h before 18F-FDG PET/CT, and the serum glucose level was maintained under 7.0 mmol/L. Whole-body PET from the head to thigh was performed 1 h after intravenously administering 5–6 MBq of 18F-FDG per kilogram of body weight. In case of Scanner 1, PET was performed in the 3D mode with an acquisition time of 2 min per bed position covering the same field as the CT scan. CT was performed using the following parameters: 120–180 mA, 140 kV, gantry rotation speed of 0.8 s, and thick axial section of 3.75 mm. After correcting attenuation (based on CT), scatter, dead time, and random coincidences, PET images were reconstructed using 3D ordered-subset expectation maximization (OSEM) with a Gaussian filter (full width at half maximum of 6 mm), leading to images with voxel size of 5.47 mm. In case of Scanner 2, PET was performed in the 3D mode with an acquisition time of 1.5 min per bed position covering the same field as the CT scan. CT was performed using the following parameters: 120–180 mA, 140 kV, and gantry rotation speed of 0.8 s. PET images were reconstructed using the block-sequential regularized expectation maximization reconstruction algorithm (Q.clear, GE Healthcare, USA), which had a β value of 550 with a 256 × 256 matrix (pixel size = 2.7 × 2.7 mm2, slice thickness = 2.79 mm). Finally, in case of Scanner 3, CT was performed using the following parameters: 146 mA, 120 kV, and spiral pitch factor of 1. Images were reconstructed using the 3D ordinary Poisson OSEM algorithm, with four iterations and five subsets, application of time-of-flight resolution modeling, and no filtering. The obtained PET images had an image matrix of 440 × 440, pixel size of 1.6 × 1.6 × 1.5 mm, and slice thickness of 2.0 mm. Lymphoma lesions in the training set were manually delineated on the fusion map of PET/CT images using ITK-SNAP (v3.6.0) by a nuclear medicine physician with 15 years of experience [9].

WSDL for feature extraction

The WSDL method based on Residual Network-18 (ResNet-18) [37] was proposed to predict disease prognosis using a well-exploiting unlabeled dataset (83 patients without follow-up information). The summarized algorithm for the WSDL method is as follows:

  • Input: 3D volumetric image I of size width × height × depth

  • Ensure: Image I is a rank 3 tensor

  1. 1:

    Train deep convolutional neural networks (DCNNs) with labeled data to obtain the baseline model

  2. 2:

    Use baseline DCNNs to extract features from labeled and unlabeled data

  3. 3:

    Build the PNU classifier to generate implicit labels for unlabeled data

  4. 4:

    Re-train DCNNs with labeled and unlabeled data to obtain the final prognosis

The ResNet is an artificial neural network that is inspired by the biological neural networks constituting animal brains. DCNNs were constructed for deep learning feature extraction. They are a simplified version of ResNet-18 and were implemented using the Python Keras package with TensorFlow as the backend. The 83 patients with missing or incomplete follow-up data were included in the training set along with 64 patients with follow-up data. Labels for the 83 patients were implicitly derived using the PNU classifier during the training procedure, leading to maximized prediction probability. Further details are provided in Supplementary Materials.

In total, 128 deep learning features were extracted from the output of the average pooling layer of DCNNs for PET/CT images in the training set, which were grouped into a 16 × 8 feature map for visualization. We herein propose a new biomarker in the form of prediction similarity index (PSI), which is the ratio of the positive predicted probability value to the negative predicted probability value. It was derived from these features to predict the probability of recurrence and non-recurrence. PSI of 1 was used to differentiate between positive and negative predictions. To determine the advantages of the WSDL method, we compared it with the conventional deep learning (CDL) method of our proposed DCNNs trained only on the 64 patients followed up for at least 2 years (Fig. 2).

Fig. 2
figure 2

An illustration of the concept of the proposed weakly supervised deep learning method

Statistics

SPSS v23.0 (SPSS Inc., Chicago, IL, USA) and GraphPad Prism 8.0.1 (GraphPad, San Diego, USA) were used for statistical analyses. Univariate analysis using the Kaplan–Meier method was performed for each variable with a potential prognostic value. Time-dependent receiver operating characteristic (ROC) analysis was performed to evaluate the discriminative ability of PSI for the prognostic prediction of ENKTL. PSI-based PFS, prediction sensitivity and specificity, and accuracy of PSI were calculated. Differences in sensitivity and specificity between the WSDL and CDL methods were compared using the Fisher’s exact test. The log-rank test was used to compare differences in PFS between the groups (PSI > 1 and PSI < 1). Multivariate analysis using the Cox proportional hazards model was used to assess the independent effects of PSI and clinical parameters of the disease. P < 0.05 indicated statistical significance.

Results

Extraction of deep learning features

One hundred and twenty-eight features were extracted from tumor ROIs outlined on 18F-FDG PET/CT scans of each patient using the proposed WSDL method. These ROIs were outlined based on lesion locations and shapes, while non-meaningful background was cut off. The 128 features were grouped into feature maps of 16 × 8 strips. The feature maps of the test set (n = 20) have been illustrated in Fig. 3. In general, characteristic differences between relapse and non-relapse patients could be visualized on these maps. The feature maps of the training set (n = 64) have been illustrated in Supplementary Figure S1 (relapse) and S2 (non-relapse), whereas those of the 83 patients with incomplete or missing follow-up data and who were imaged using the aforementioned scanners are illustrated in Figure S3. The feature maps of the test set (Figure S4) and training set (Figure S5 for relapse, Figure S6 for non-relapse) with the CDL method have also been illustrated in supplementary figures.

Fig. 3
figure 3

Visualization of the feature maps (16 × 8) representing 128 features extracted by the proposed WSDL method in the test set. Each strip represents the feature map of a patient. Red arrows indicate the characteristic difference between the (A) relapse and (B) non-relapse groups in the test cohort. PSI results with incorrect predictions have been marked by red boxes

PSI as the prognostic score

Patients with PSI > 1 were considered to show a positive response, while those with PSI < 1 were considered to show a negative response. The ROC curves of the results of the WSDL and CDL methods were compared (Fig. 4). With the WSDL method, in the training and test sets, PSI achieved area under the curve (AUC) scores of 0.986 (P = 0.000, 95% CI, 0.957–1.000) and 0.875 (P = 0.005, 95% CI, 0.706–1.000), respectively, in the prediction of PFS, while with the CDL method, PSI achieved AUC scores of 0.995 (P = 0.000, 95% CI, 0.984–1.000) and 0.734 (P = 0.083, 95% CI, 0.479–0.989), respectively (AUC of the training set was calculated only based on data pertaining to the 64 patients). Table 2 shows accuracy and prognosis results. In the training set, the sensitivity of the WSDL method was superior to that of the CDL method (86.7% vs 73.3%, P = 0.048), while the methods showed the same specificity (100%). Due to the small number of patients in the test set, a comparison was not feasible.

Fig. 4
figure 4

ROC curves comparing the predictive power of PSI for PFS in the training (A) and test (B) sets. ROC, receiver operator characteristic; AUC, area under the curve; PSI, prediction similarity index; WSDL, weakly supervised deep learning; CDL, conventional deep learning

Table 2 Deep learning feature-based detection efficiency and prognosis prediction

According to PSI, patients were divided into two groups: PSI > 1 and PSI < 1. The Kaplan–Meier survival analysis method was used to compare differences in PFS between the groups. We observed that patients with low PSI (PSI < 1) showed good prognosis and long PFS, while those with high PSI (PSI > 1) showed poor prognosis and short PFS. Figure 5 shows the Kaplan–Meier curves of PFS according to PSI. The extracted PSI was able to segregate patients in the training set with different PFS in case of both the WSDL (P < 0.0001) and CDL (P < 0.0001) methods (Fig. 5A and C). Similarly, in the test set, the WSDL (P = 0.0017) and CDL (P = 0.0177) methods could distinguish patients with different PFS (Fig. 5B and D).

Fig. 5
figure 5

Kaplan–Meier estimates of PFS in the training (A) and test (B) sets of patients with high and low PSI. PFS, progression-free survival; PSI, prediction similarity index; WSDL, weakly supervised deep learning; CDL, conventional deep learning

Predictive value of other clinical and imaging parameters and integrated analysis

Major clinical factors, such as gender, serum lactate dehydrogenase levels, ECOG score, β2-microglobulin levels, and Epstein–Barr virus DNA, were significantly associated with PFS in univariate analysis. Conventional imaging parameters, including PET/CT-based Ann Arbor stage, MTV, and TLG, were also significantly associated with PFS in univariate analysis (refer to Table 3 for more details). Furthermore, we combined PSI with these clinical parameters to analyze the prognosis of ENKTL using the multivariate Cox proportional hazard model. We found that PSI was the only independent significant predictor of PFS. The WSDL method (HR, 15.183; 95% CI, 5.479–42.077; P = 0.000) achieved better PFS prognosis than the CDL method (HR, 7.857; 95% CI, 3.276–18.843; P = 0.000) after adjustment for various cofactors, as listed above.

Table 3 Univariate analysis involving patients with follow-up data

Discussion

The prognosis of high-risk ENKTL patients is generally poor [32, 38], and treating such patients is thus challenging. Although new regimes have been proposed, the response remains suboptimal due to strong disease heterogeneity [38]. Prognostic index of natural killer lymphoma (PINK) is a well-established index based on age, serum lactate dehydrogenase level, performance status, and disease stage. The PINK model [39] is based on clinical information; patients with the same PINK score could even show different prognosis. As a clinical molecular imaging method, 18F-FDG PET/CT shows good potential to help stratify patients and optimize prognosis for the treatment of many types of cancers [9, 40,41,42]. However, considering the low incidence of ENKTL, the potential of this method for predicting the prognosis of ENKTL remains poorly explored. Conventional 18F-FDG PET/CT-related parameters, such as SUVmax, SUVmean, MTV, and TGL, have been found to show a correlation with survival, but the results have been debatable [30, 31, 36, 43]. These parameters cannot facilitate a comprehensive image-based analysis of tumors and cannot be integrated in hematological guidelines [44] because prospective studies with larger cohort of patients and methodological harmonization are needed [45]. Our univariate analysis indicated that SUVmax and SUVmean were not related to prognosis, while MTV and TGL were related to prognosis. However, multivariate analyses indicated that none of them were associated with prognosis. Considering the rarity of ENKTL, it is difficult to predict its prognosis, particularly in small cohort of patients.

Considering the potential of AI in facilitating data analyses to discover useful information, we aimed to develop and validate AI methods to overcome the restriction of limited data availability and to explore the prognostic value of 18F-FDG PET/CT in ENKTL. We herein proposed an AI model that could utilize incomplete or missing follow-up data to enhance the prediction potential of deep learning methods. This improved prediction power of AI led to the extraction of feature maps from 18F-FDG PET/CT as effective surrogates for prognosis prediction in patients with ENKTL. Furthermore, the method could automatically discover characteristic features in metabolic imaging. Our results confirmed the benefits of AI for comprehensive imaging analyses, wherein the proposed PSI was better than conventional clinical parameters and other PET-related parameters for prognosis prediction.

AI methods tend to be biased toward texture rather than shape, while human cognitive processes function in the opposite manner [46]. Conventional 18F-FDG PET/CT-related parameters, such as Ann Arbor stage, SUVmax, SUVmean, MTV, and TGL, have been already covered within the AI framework, and they reportedly have inferior predictive performance than deep learning methods [47]. The current developments occurring within the field of AI can add value to conventional PET analyses. To avoid redundancy and correlation of tested data and to lower the number of parameters tested in view of the limited size of our cohort, Ann Arbor stage, MTV, and TLG were not included in multivariate analysis, although they were found to be related to prognosis in univariate analysis. For multivariate analysis, clinical prognostic factors and PSI were included. PSI eventually emerged to be the only independent predictor of PFS.

Despite their potential, the application of AI-based methods to clinical trials remains challenging due to limited sample sizes. Deep learning research is particularly difficult for rare diseases such as ENKTL. Moreover, not all recruited patients can be finally enrolled due to missing or incomplete follow-up. Therefore, we developed a WSDL method in an attempt to solve this problem. During the training of WSDL, implicit labels are generated by exploring similarities among patients, and this diversity can be captured by a deep neural network. Most supervised data augmentation methods have been developed by using unlabeled data for regularization under particular distributional assumptions, such as cluster or smoothness assumption [48]. However, the performance of such a model can be considerably deteriorated if the real data distribution violates the assumed distribution [14]. In this study, the proposed WSDL method with integrated PNU strategy did not make additional assumptions about data distribution; therefore, the performance of prognosis prediction was efficiently and robustly improved. We conducted a pilot study to reutilize the data without follow-up information to boost the prediction accuracy of patient survival; consequently, the advantages of the proposed WSDL method were confirmed in our test set. By employing WSDL, prognoses of patients in the test set could be significantly differentiated, and the results were better than on using CDL. Therefore, the proposed WSDL method may act as a practical tool for developing individualized treatment strategies using clinical trial data.

Tumor heterogeneity in baseline PET/CT images may allow better signature characterization and improve prediction of therapy response and survival in malignant tumors [49, 50]. Ko et al. [49] investigated whether the textural features of pretreatment 18F-FDG PET images could predict the prognosis for ENKTL; they reported that dissimilarity and low-intensity short-zone emphasis were significant predictors of disease progression in patients with ENKTL and were able to improve their prognostic stratification. However, there were only 17 patients in this retrospective study and details pertaining to the regimen were not mentioned. In our study, PSI was validated as a potential index for risk stratification and future management of patients with ENKTL. Compared with texture analyses, the results of deep learning are more difficult to interpret. Deep learning–based radiomics studies [9] evidently draw several image-based texture parameters and the significance of many of them cannot be explained in a clinical perspective; this hinders the application in clinical routine. In addition to the proposed PSI, we also visualized the extracted features as strips of feature maps. Although these maps did not give us an in-depth insight into physiological interpretation, they did give us an additional view of recommendations derived from the black box, and the different activation patterns may facilitate quality control in practice. The feature maps were composed of multiple features, and, therefore, they contained more information than a single scalar value of PSI. An increase in the dimension of the features may improve prediction but may lead to overfitting. On the other hand, a single scalar value is convenient for clinical interpretation. Therefore, it may be practical to consider both PSI values and feature maps to gather better, more robust information.

This study had several limitations. First, although we employed WSDL to enhance data utilization, the sample size was still small, which may reduce the test power and predictive ability of deep learning methods. Similar to other studies based on rare diseases, the difference between overall survival and PFS was not great, and we did not perform overall survival-related survival analysis. We only performed survival analysis based on PFS. Second, tumors were outlined by a specialist in medical radiology and nuclear medicine. As with previous studies, interobserver variations may exist in the manual delineation and may influence the reported results [9]. Nevertheless, deep learning methods can automatically learn features included in the hidden layers of neural networks from imaging data, and they are less sensitive to segmentation variations [51, 52]. Third, study data were collected from a single center, and external validation is thus necessary to validate our findings. Finally, potential patient selection biases may exist because of the retrospective nature of this study.

To summarize, our proposed WDSL method was able to utilize incomplete or missing follow-up data to improve survival prediction. Deep learning involving 18F-FDG PET/CT provides an effective approach for prognosis prediction in patients with ENKTL. The identified feature maps and PSI may potentially assist the stratification of patients in therapy. Future prospective studies with external validation are nevertheless warranted to validate our findings.