Background

Lymphoma is a haematopoietic malignancy, which can be broadly categorised into Hodgkin and non-Hodgkin disease. Hodgkin lymphoma (HL) accounts for approximately 10% of all newly diagnosed cases, and its hallmark is the presence of Hodgkin and Reed–Sternberg (HRS) cells [1]. HL can be further sub-divided based on morphology and immunohistochemistry into classical Hodgkin lymphoma (cHL), which has four further sub-categories, or nodular lymphocyte-predominant Hodgkin lymphoma (NLPHL) [1]. The majority (90%) of disease is due to cHL. HL is associated with a good prognosis having an overall 5-year survival of 86.6% [2]. Non-Hodgkin lymphoma (NHL) is the most prevalent form of lymphoma with over 50 sub-types, the most common being diffuse large B cell lymphoma (DLBCL) [3]. The overall 5-year survival rate is 72% for NHL but this varies by stage and subtype [2]. DLBCL has a 5-year survival of approximately 60–80%, which has improved since the use of anthracycline-containing chemotherapy and rituximab (R-CHOP) [2, 4].

There are several pretreatment clinical prognostic tools developed to stratify both DLBCL and HL. In 1993, Shipp et al. introduced the international prognostic index (IPI) for predicting overall survival in DLBCL patients based on a retrospective study of 2031 patients treated with CHOP. The IPI has been further refined with an age-adjusted version (aa-IPI), a revised version developed following the use of R-CHOP (R-IPI), and a version based on the National Comprehensive Cancer Network database (NCCN-IPI). HL disease can be split into early (stage I and II) or advanced (stage III or stage IV) with early being split into favourable or unfavourable depending on one of the many scoring systems including, but not limited to, the German Hodgkin Study Group (GHSG), European Organisation of Research and Treatment of Cancer (EORTC), Groupe d’Etudes des Lymphomes de l’Adulte (GELA), National Cancer Institute (NCI) or National Comprehensive Cancer Network 2010 (NCCN 2010) scores. However, given the variation in the prognostic groups derived from the different scoring systems, further information obtained from imaging may improve prognostication.

2-deoxy-2-[Fluorine-18]fluoro-D-glucose (FDG) positron emission tomography/computed tomography (PET/CT) is widely used for staging and response assessment in HL and NHL [5]. Response assessment PET/CT studies are performed at various time points, including during and after treatment [5]. The parameter most commonly used in assessment is the standardised uptake value (SUV) at sites of disease, which is compared to physiological activity in reference areas such as the mediastinal blood pool and liver and is reported using an ordinal (qualitative) scale (Deauville Score (DS)).

A variety of imaging-derived quantitative parameters have been reported in the literature with potential utility for predicting prognosis or treatment outcome. These metrics range from those based on tumour volume to metabolic features, including shape and texture. At present, none have been translated into routine clinical practice. The purpose of this study was to perform a systematic review of the literature reporting the use of quantitative imaging parameters derived from pretreatment FDG PET/CT for prediction of treatment outcome for HL and DLBCL. Due to the varied nature of NHL, DLBCL was chosen as it is the most common subtype of NHL.

Methods

Search strategy and selection criteria

A search of MEDLINE/PubMed, Web of Science, Cochrane, Scopus and clinicaltrials.gov databases was performed for articles on PET/CT imaging parameters in lymphoma treatment assessment. The search strategy included three primary operator criteria linked with the “AND” function. The first criteria consisted of “lymphoma”, the second of “PET” or “positron emission tomography”, and the third of “outcome”, “prognosis”, “prediction”, “parameter”, “radiomics”, “machine learning”, “deep learning” or “artificial intelligence”. Case studies, articles not published in English, phantom studies, studies not assessing treatment outcomes using baseline imaging in HL or DLCBL, studies assessing primary anatomical presentations of lymphoma or HIV-related lymphoma, mixed pathology studies and studies assessing novel treatments were excluded. After duplications were excluded, studies were screened for eligibility based on the title, abstract and subsequently on full text. The references of the articles included in the systematic review were manually reviewed to identify further publications which met the inclusion criteria. The results were stored in bibliographic management software. Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) criteria were adhered to [6].

Quality assessment

The Quality in Prognosis Studies (QUIPS) tool was used to evaluate validity and bias which considers six areas: inclusion, attrition, prognostic factor measurement, confounders, outcome measurement, and analysis and reporting [7]. Prompting questions and modifications applied to the QUIPS tool are detailed in Supplemental Table 1. Two authors (RF and AS) independently reviewed all studies which met inclusion criteria and scored each of the six domains as high, moderate or low risk of bias. Any discrepancies were agreed in consensus. Overall risk of bias for each paper was further categorised based on the following criteria: if all domains were classified as low risk, or there was up to one moderate risk, the paper was classified as low risk of bias. If one or more domains were classified as high risk, the paper was classified as high risk of bias. All papers in between were classified as having moderate risk of bias [8].

Results

Results are current to July 2020. The database search strings yielded 2717 results after duplicates were excluded. Following screening and assessment of eligibility, 41 articles meeting the study inclusion criteria were included. Figure 1 details the study selection.

Fig. 1
figure 1

PRISMA flow diagram illustrating the methodology for study selection for the systematic review of lymphoma imaging parameters. BMI bone marrow involvement, Relapse indicates studies investigating previously treated cases

Quality assessment

No studies showed low risk of bias in all six domains (Supplemental Table 2). Only two studies demonstrated a low risk for participation; no studies had a low risk in attrition, prognostic measurement, outcome measurement or confounding factors; 33 studies had low risk for analysis and reporting. All studies were assessed as having either moderate (24/41, 59%) or high (17/41, 41%) overall risk of bias. Of the high risk studies, 6 had high risk scores of bias in participation, 5 in attrition, 8 in prognostic measurement, 8 in outcome measurement, 10 in confounding factors and 7 in analysis and reporting categories.

All studies were retrospective, with 28/41 single centre. Six reports were based on retrospective analysis of trial data from prospective studies. Four studies stated that they were compliant with the European Association of Nuclear Medicine (EANM) guidelines with their scanning protocol; 10/41 did not take into consideration important co-founders such as different treatment regimes, stage, prognostic scores or histology. Only six studies defined the method for calculation of SUV, and 7 studies used a validation cohort to test the predictive models (Table 1). Of the radiomic studies, one study referenced the image biomarker standardisation initiative (IBSI) within the discussion but none of the papers explicitly stated that they had complied with IBSI guidelines.

Table 1 Overview of study design and risk of bias for each of the studies included in the systematic review

As there were no studies deemed to be of low risk for overall bias, a decision was made to include the high risk studies in the systematic review, as removal of these would introduce its own inherent bias.

Metabolic parameters

SUV is the commonest metric extracted from PET studies. This represents a ratio of radioactivity at a given image location compared to injected whole-body radioactivity [50]. There are several iterations of SUV, including the maximum or mean SUV within a contoured area (SUVmax and SUVmean), or SUVpeak which is the average SUV of a region of interest centred on the highest uptake region within the contoured area. SUV supports other metabolic parameters such as metabolic tumour volume (MTV), which is the volume of disease contoured at a specified SUV threshold, and total lesion glycolysis (TLG), which is the MTV multiplied by SUVmean. Published evidence regarding metabolic parameters used in the pretreatment assessment of lymphoma is summarised below.

SUV metrics for prediction of outcome

  1. a)

    DLBCL

The majority of studies assessing the use of baseline SUVmax in DLBCL report no significant ability to predict progression-free survival (PFS) or overall survival (OS) (Table 2). Forest plots illustrating hazard ratios (HR) for PFS and OS are demonstrated in Figs. 2 and 3. From the results included in the forest, the overall HR was 1.35 (CI 95% 1.06–1.76) for PFS and 1.52 (CI 95% 1.15–2.02). However, there is considerable heterogeneity specifically in the PFS analysis (I2 = 77%) and reporting bias is present because a number of studies which did not report any significance did not provide the results required to calculate a HR.

Table 2 Studies assessing the use of standardised uptake value (SUV) in predicting outcomes in diffuse large B cell lymphoma (DLBCL) and Hodgkin lymphoma (HL)
Fig. 2
figure 2

Forest plot demonstrating hazard ratios for progression-free/event-free survival for patients with DLBCL using a dichotomous cut-off value derived from SUVmax. Studies which do not provide hazard ratios are included but no estimate is given

Fig. 3
figure 3

Forest plot demonstrating hazard ratios for overall survival for patients with DLBCL using a dichotomous cut-off value derived from the SUVmax. Studies which do not provide hazard ratios are included but no estimate is given

Of the studies which showed a prognostic ability for SUVmax, Gallicchio et al. reported this was the only imaging parameter able to predict PFS when compared to TLG and MTV in a small study of 52 DLBCL patients (26 early and 26 advanced stage) with a higher SUVmax associated with a longer PFS, the hazard ratio (HR) was 0.13 (0.04–0.46) [25]. A study by Kwon et al. assessing 92 DLBCL (54 stage I/II, 38 stage II/IV) patients reported that a SUVmax of 10.5 was significant in predicting PFS, but this was not an independent prognostic predictor at multivariate analysis with clinical factors such as age, Lactate Dehydrogenase (LDH) level, stage, IPI score or Eastern Cooperative Oncology Group (ECOG) status [32]. Conversely, Miyazaki et al. demonstrated that SUVmax was an independent predictor of 3-year PFS and R-IPI [38]. Chang et al. found that tumour SUVmax >19 was a significant predictor of 3-year PFS, whereas the SUVmax of sternal uptake was an independent predictor of 3-year OS in a study of 70 DLBCL patients [18]. The most extensive study evaluating SUVmax as a predictor of PFS and OS was performed by Ceriani et al. with a test cohort of 141 patients and a validation cohort of 113 patients, both containing a similar mix of stage and prognostic scores. SUVmax was not significant in predicting PFS or OS in either cohort [16].

  1. b)

    HL

Five studies have assessed the use of SUVmax as a predictive parameter in HL patients with only one reporting significance (Table 2). The largest by Akharti et al. showed no significant ability of SUVmax to predict PFS and OS in 267 stage I and II HL patients (74 early favourable) [12]. These findings were concordant with a study by Cottereau et al., who also found no significant ability of SUVmax to predict PFS or OS in 258 stage I and II patients. Angelopoulou et al. reported that SUVmax was a significant predictor of 5-year PFS in a study of 162 patients with a split of stages (stage I/II = 76, stage III/IV = 86) [14]. The cohort was stratified into three risk groups, SUVmax <9, 9–18 and > 18 with five-year PFS rate being 93%, 81% and 58% respectively, multivariate analysis was not performed. Albano et al. studied the prognostic ability of liver to lesion SUV ratio and blood pool to lesion ratio in 123 older (age > 65 years) HL patients [13]. They found that both parameters were significant (at univariate analysis) for PFS and OS. They also demonstrated these metrics to be independent prognostic markers when analysed with tumour stage, German Hodgkin Study Group (GHSG) risk group, MTV and TLG for PFS, and tumour stage, GHSG risk group and Deauville score for OS.

Factors affecting SUV such as scanner spatial resolution, image acquisition and PET reconstruction parameters combined with a relatively small number of events, variation in the number of early and advanced patients, differences in treatment and definition of PFS all influence the results [51, 52]. This is reflected by the variation in cut-off/threshold values used to risk-stratify patients within each of the studies.

Metabolic tumour volume and total lesion glycolysis for prediction of outcome

  1. a)

    DLBCL

The potential utility of baseline MTV and TLG for predicting PFS and OS in patients with DLCBL has been reported in multiple studies (Table 3, Figs. 4 and 5). However, similar to SUVmax, there is heterogeneity in the cut-off values used which has led to variability in the reported survival rates between groups. Overall, the HR for MTV in PFS was 3.47 (CI 95% 2.80–4.30) and 4.20 (CI 95% 2.80–4.30) for OS. Again, reporting bias is present because a number of studies which did not report any significance did not provide the results required to calculate a HR.

Table 3 Studies assessing the use of metabolic tumour volume (MTV) and total lesion glycolysis (TLG) in predicting outcomes in diffuse large B cell lymphoma (DLBCL)
Fig. 4
figure 4

Forest plot demonstrating hazard ratios for progression-free survival for patients with DLBCL using a dichotomous cut-off value derived from the metabolic tumour volume. Studies which do not provide hazard ratios are included but no estimate is given

Fig. 5
figure 5

Forest plot demonstrating hazard ratios for overall survival for patients with DLBCL using a dichotomous cut-off value derived from the metabolic tumour volume. Studies which do not provide hazard ratios are included but no estimate is given

One of the largest studies by Song et al. evaluated 169 patients with DLBCL (stage II and III without extranodal disease) treated with R-CHOP [44]. Patients with an MTV of <220cm3 had significantly better PFS and OS; 89.8 versus 55.6%, and 93.2 versus 58.0%, respectively [44]. MTV was predictive of PFS and OS regardless of stage. MTV remained significant when assessed using multivariate Cox regression with stage III disease, HR = 5.30 (95% 2.51–11.16) and HR = 7.01 (2.90–16.93) for 3-year PFS and 3-year OS, respectively. In another study, Song et al. reported that MTV was a prognostic predictor in 107 patients with bone marrow involvement (BMI); patients with an MTV of >601.2cm3 and BMI had worse PFS and OS survival compared to those with a smaller MTV and BMI [42]. Again, this was demonstrated to be an independent predictor when analysed with IPI, bulky disease, BMI, involved marrow MTV and > 2 cytogenetic abnormalities with an HR = 5.21 (95% CI 2.54–10.69) and HR = 5.33 (95% CI 2.60–10.90) for PFS and OS, respectively. However, there was no significant difference in survival between the smaller MTV with BMI group and a comparison cohort of patients without BMI. MTV summarises disease burden; however, it does not account for spread. Cottereau et al. studied four different spatial metrics besides TLG and MTV in 95 DLBCL patients on baseline scans to determine if a predictive model could be created [20]. The spatial parameters consisted of Dmax (distance between two of the furthest lesions), Dmax bulk (distance between the largest lesion and furthest lesion away from this), SPREADbulk (sum of all distances between bulky lesions) and SPREAD (sum of all distances between lesions). They found that a model combining MTV and Dmax could significantly distinguish between three prognostic groups. The low-risk group with an MTV <394cm3 and a Dmax <58 cm had a 4-year PFS of 94% and OS of 97%, the intermediate group with either an MTV >394cm3 or a Dmax >58 cm had a 4-year PFS of 73% and OS of 88% and the high-risk group with a MTV >394cm3 and a Dmax >58 cm had a 4-year PFS of 50% and OS of 53%.

Zhou et al. reported that although high baseline MTV and TLG were associated with poorer prognosis, only TLG was an independent predictor of PFS and OS in a study of 91 patients [49]. In this study, patients who demonstrated complete or partial remission were more likely to relapse if they had a high baseline TLG (40 versus 9%, p = 0.012). A possible explanation for the discrepancy between the prognostic ability of MTV and TLG in this study may be related to the correlation between MTV and TLG, confounded by relatively small sample sizes. Kim et al. evaluated TLG calculated using different MTVs derived using 25, 50 and 75% SUVmax thresholds in a mixed cohort (n = 140) of early and advanced stage DLBCL patients being treated with R-CHOP [31]. They found that all methods for calculating TLG were predictive of 2-year PFS, but only TLG50 was predictive of 2-year OS. Ilyas et al. also studied variation in segmentation technique and its potential to impact on predicting outcome in 147 DLBCL patients (46 stage I/II, 101 stage III/IV) all treated with R-CHOP [27]. The four segmentation techniques consisted of a threshold of SUV 2.5 on two software packages (PETTRA and Hermes), 41% SUVmax on Hermes software and an uptake higher than SUVmean of a 3-cm3 region of interest (ROI) within the right lobe of the liver (PERCIST) using the Hermes software. They found a strong agreement between all four methods, with the lowest intraclass coefficient being between PERCIST and 41% SUVmax thresholds being 0.86. They also reported similar receiver operator curves (ROC) between the four methods with the area under the curve (AUC) ranging from 0.74 to 0.76 for PFS, and 0.71 to 0.75 for OS. All four methods were significant predictors of PFS and OS. However, as stated in the paper, no method is likely to apply to all patients generally. Large heterogeneous masses are likely to be undersized with percentage thresholds, low uptake lesions may be missed using a standard threshold method and disease involving the liver may impede its use as the background value. This may have a more significant impact when further metrics are introduced, such as those based on texture when the size of the contour can also influence the reported values. The segmentation technique of choice also needs to be easily replicated. Recently, Capobianco et al. assessed the use of artificial intelligence (AI) using a convolutional neural network (CNN) to segment the MTV [15]. They found that AI-derived MTV correlated with reference MTV derived by two independent readers with a classification accuracy of 85%. Automatic segmentation is a key step required to enable implementation of MTV or TLG into clinical practice.

HL

Fewer studies have investigated the predictive ability of MTV and TLG in HL patients than in DLBCL (Table 4, Figs. 6 and 7). This is likely due to the higher survival rate of HL limiting the number of events demonstrated in a single centre and the variation in treatments and scoring systems for a favourable and unfavourable disease, which affect multi-centre studies. The majority of studies involved patients on an adaptive ABVD treatment regime, and results may not be transferrable to patients being treated with an adaptive BEACOPP regime. This confounding issue was highlighted in a study by Mettler et al. who assessed the prognostic ability of MTV in 310 patients with advanced HL being treated with eBEACOPP using four different contouring methods involving summation of the volume of each disease site using different defined thresholds: 41% SUVmax of each disease site, a threshold of liver SUVmax, a threshold of liver SUVmean and a fixed threshold of 2.5 SUV [35]. They found that MTV was predictive of interim PET response regardless of segmentation methodology; however, none was able to predict OS and PFS reliably. The divergent findings compared to previous studies are likely related to low event numbers and using a different treatment regime. Albano et al. demonstrated the significant ability of both MTV and TLG derived from 41% SUVmax in predicting PFS in both univariate and multivariate analysis in a cohort of 123 elderly patients with a mix of different treatment regimens. However, neither TLG nor MTV were predictive of OS. Cottereau et al. and Akhtari et al. both assessed the ability of MTV in cohorts of patients consisting of stage I and II disease [12, 21]. Cottereau et al. found that MTV derived from >2.5 SUV was significant in predicting 5-year PFS and OS and was significant in multivariate analysis when assessed with different early disease scoring systems. Akhtari et al. found that MTV and TLG derived from >2.5 SUV thresholding and manual soft tissue contouring were significant predictors of 5-year PFS. Reporting bias is present because a number of studies which did not report any significance did not provide the results required to calculate a HR. The overall HR for MTV in PFS was 2.13 (CI 95% 1.53–2.96) and 2.13 (1.43–3.16) in OS. Both were associated with high levels of heterogeneity, I2 = 74% for PFS and I2 = 70% for OS.

Table 4 Studies assessing the use of metabolic tumour volume (MTV) and total lesion glycolysis (TLG) and Hodgkin lymphoma (HL)
Fig. 6
figure 6

Forest plot demonstrating hazard ratios for progression-free survival for patients with HL using a dichotomous cut-off value derived from the metabolic tumour volume. Studies which do not provide hazard ratios are included but no estimate is given

Fig. 7
figure 7

Forest plot demonstrating hazard ratios for overall survival for patients with HL using a dichotomous cut-off value derived from the metabolic tumour volume. Studies which do not provide hazard ratios are included but no estimate is given

Similar to DLBCL, clinical implementation of MTV and TLG in HL depends on reaching a consensus regarding segmentation methodology, each giving different variations in the volumes measured and will be facilitated by an automated process. However, variation in treatment is likely also to play an impact, and this aspect needs assessing in larger multi-centre studies.

Textural and shape analysis for outcome prediction

Textural analysis or radiomics relates to transformation of images into mineable high-dimensional data permitting invisible feature extraction, analysis and modelling for non-invasive phenotyping and outcome prediction [53]. Radiomic features can be studied in isolation or increasingly are being combined with clinical and genomic features as part of the rapidly expanding field of integrated diagnostics [54].

Aide et al. studied the use of PET/CT-derived textural features, clinical and imaging parameters to predict 2-year PFS in DLBCL patients [10]. They split patients into training (n = 105) and validation sets (n = 27) and found that Long-Zone High-Grey Level Emphasis (LZHGE) was the only independent predictor when analysed with IPI and MTV. On the validation set, it was found that a high LZHGE > 1,264,925.92 was associated with a 2-year PFS of 60% whereas patients with a low LZGHE had a PFS of 94.1%. The study has some limitations as only the largest area of disease was analysed, a breakdown of disease stage was not presented and 14 patients did not have standard (R-CHOP) therapy. Another study by Aide et al. investigated the diagnostic and prognostic value of axial skeletal textural features derived from PET/CT in patients with DLBCL in a retrospective cohort of 82 patients [11]. The CT dataset was initially contoured using a segmentation threshold of >150 Hounsfield units (HU) with the spinal column and half of the pelvis included. They reported that the first-order parameter skewness had the highest AUC for predicting BMI and that a cut-off value of 1.26 produced a sensitivity, specificity, PPV and NPV of 82, 82, 62 and 93%, respectively. In addition, a skewness value of <1.26 was associated with a greater 2-year PFS and OS. This was true even for 60 patients without BMI. The study had a low event rate (22 patients had BMI), which limits the ability to create a robust prognostic model.

Lue et al. investigated the use of 11 first-order, 39 higher-order features and 400 wavelet features for predicting PFS and OS in 42 HL patients (20 stage I/II, 22 stage III/IV) with 21 events within the cohort (12 relapses, 9 deaths) [34]. They found 173 radiomic features, which were significant predictors of progression after correction for multiple testing. To avoid multicollinearity, they only selected the top two features according to the AUC from each group to be included in the univariate and multivariate analysis. MTV was selected based on previous studies. They demonstrated that SUV kurtosis, stage and intensity non-uniformity (INU) derived from Grey-Level Run Length Matrix (GLRLM) were independent predictors of PFS and only disease stage and INU derived from GLRLM were independent predictors of OS.

Decazes et al. retrospectively studied PET/CT scans of 215 DLBCL patients to assess the utility of total tumour surface (TTS) and tumour volume surface ratio (TVSR) as predictive biomarkers [23], TVSR being the ratio between MTV and TTS. MTV had the highest AUC for both OS and PFS (0.71 and 0.67) when compared to TTS (0.69 and 0.66) and TVSR (0.65 and 0.61) [23]. It was reported that TVSR, MTV, IPI and type of chemotherapy were all independent prognostic parameters. Milogrom et al. investigated the use of a support vector machine model based on first and second-order radiomic features derived from baseline PET/CT to predict relapse or refractory disease in 167 stage I-II HL patients with mediastinal involvement [37]. Ten of the groups formed the training set, and two were designated the validation set with each group containing a single event (n = 12). Five features were selected as the most predictive (SUVmax, MTV, InformationMeasureCorr1, InformationMeasureCorr2 and InverseVariance derived from GLCM 2.5). InformationMeasureCorr1 and InformationMeasureCorr2 are the first and second measures of theoretic correlation and Inverse-Variance is weighting of random variables to minimise variance. By combining these features, the AUC for predicting relapse for patients with mediastinal disease was 0.95. This outperformed TLG and MTV. This work highlights the potential for using AI-methods in lymphoma assessment. However, the study is limited to HL with mediastinal involvement with again small numbers of events.

Senjo et al. demonstrated that a high metabolic heterogeneity (MH) was a predictor of 5-year PFS and OS in DLBCL across both training (n = 86) and validation cohorts (n = 64) treated at two centres [41]. They found that MH remained a significant predictor for 5-year OS for both cohorts when analysed in multivariate analysis with an ECOG score of >2, and an LDH with a relative risk of 4.75 (95% CI 1.25–18.1) and relative risk of 4.92 (95% CI 1.09–17.03) in the training and validation groups, respectively. A model was created which combined MH and MTV, which successfully risk stratified the combined training and validation cohorts into three risk groups: low MH and low MTV, low MH and high MTV or high MH and low MTV, and high MH and high MTV, with the 5-year OS being 90.4 vs. 69.5 vs. 34.8%, respectively; P < 0.001 and 5-year PFS, 84.1 vs. 43.6 vs. 27.0%, P < 0.001 respectively.

Current limitations and future challenges

One issue needing to be addressed when using imaging parameters derived from PET for predictive modelling is the relatively low spatial resolution, which influences how much of the avidity is included within a volume when different thresholding techniques are utilised (Fig. 8) [55]. Meignan et al. used a phantom model to validate their MTV thresholding method for a patient cohort [56]. They found that a 41% SUVmax threshold gave the best concordance between contoured and actual volumes. 41% SUVmax thresholding also gave the best agreement between reviewers using the Lin concordance correlation coefficient (pc) (ρc = 0.986, CI 0.97–0.99). However, for successful clinical implementation, the time it takes to implement as well as the accuracy of the thresholding method needs be considered. The use of a semi-automated method such as the one reported by Burggraaff et al. [57] or a deep learning derived volume as reported by Capobianco et al. is required [15]. Predictive models also need to be tested and adapted for new treatments or histological markers [58]. The ability to be able to predict worse outcomes could allow for future treatment stratification. There is an area of unmet need with few active studies at present. There are currently only two open/recruiting studies listed on clinicaltrials.gov assessing PET/CT parameters for outcome prediction in DLBCL, and no registered studies assessing outcomes in HL patients.

Fig. 8
figure 8

Select axial (ac) and coronal slices (d) from an FDG PET/CT study from a patient with DLBCL demonstrating three different contouring methods (green = 41% SUVmax; red = 1.5 x SUVmean of the liver; purple = 4.0 SUV). For smaller lesions, the 41% SUVmax contour is larger than the other two methods, black arrow and arrowhead. For larger more heterogenous lesions, the 41% SUVmax is the smallest of the three contours (blue arrow)

Other important limitations of the published work highlighted in this systematic review are variability in methodology and lack of external validation (Table 5). This presents a number of opportunities for the future (Table 5). Further study into the use of AI for imaging-based outcome prediction in lymphoma which may permit more accurate prediction of prognosis/treatment outcome is needed. This might also facilitate more efficient image analysis and actionable clinical decision support potentially guiding tailored treatment for individual patients. However, there is the requirement for large volumes of data necessary to train algorithms which can then be vigorously validated for reproducibility and generalizability which will require cross-institutional collaboration via imaging networks to support the establishment of multi-centre trials. Implementation studies and health economic research will also be critical for clinical adoption by demonstrating that any AI application is reliable and value-based.

Table 5 Limitations of the current literature and opportunities for future work

All the described limitations have led to a medium and high risk of bias within the literature as evaluated with our QUIPS tool. The decision to retain papers with a high risk of bias was taken as it was felt that this itself would introduce bias into the review. However, this does mean the results need to be interpretated with caution. Further work in this area is clearly warranted and efforts should be made when designing future studies to carefully consider the methodology employed so as to minimise the risk of bias which is prevalent in this field of work to date.

Conclusion

Multiple reports suggest the potential utility of various PET/CT-derived imaging parameters in lymphoma outcome modelling. Most studies are retrospective and lack external validation of described models. Robustness across different scanning protocols and institutions has also not been verified, and clinical implementation remains a future aspiration. AI techniques may offer a potential solution to some limitations of predictive modelling in this clinical scenario and warrant further evaluation.