Introduction

Prostate cancer is the second most common cancer in men worldwide, with 1.3 million new cases diagnosed in 2018 [1, 2]. The worldwide incidence rates significantly increased during the last decade, most likely due to the wider application of prostate-specific antigen (PSA) screening [2]. While the 10-year survival rate of prostate cancer is approximately 90%, advanced or late-stage prostate cancer may be life-threatening, in particular, in metastasized stages of the disease [3].

The 5-year risk stratification in patients with primary prostate cancer is mainly built on clinical stage, PSA, and Gleason scores, derived from invasive biopsy samples [4]. Despite having profound effects on treatment planning and, thus, patient’s quality of life, this approach has a number of limitations [3, 5]. First, Gleason scoring relies on biopsy sampling, hence, can neither help assess the entire prostate nor fully characterize the heterogeneity of any pertinent tumor [6]. In addition, transrectal biopsy sampling has been associated with side-effects, such as haematospermia or haematuria [3]. Second, previously published risk classification systems were reported to have the tendency of incorrectly grading primary prostate cancer [3]. In patients with a high risk score and absent metastatic disease, radical prostatectomy is the treatment-of-choice [7] despite the risk of potential overtreatment [8] and at the same time, a 20–40% chance of biochemical recurrence (BCR) [9, 10].

Combined positron emission tomography/computed tomography (PET/CT) or PET/magnetic resonance imaging (PET/MRI) using radiotracers targeting prostate-specific membrane antigen (PSMA) can help to localize suspicious lesions in the prostate [11, 12]. PSMA-PET in combination with CT has been reported to improve primary tumor localization [13] and the diagnosis of recurrent prostate cancer [14, 15] in patients after radical prostatectomy even at low PSA levels [16]. In contrast, PSMA-PET/MRI was shown to support the diagnosis of intermediate and high-risk patients as well as to detect tumor recurrence [13]. Nevertheless, the diagnosis of primary prostate cancer is still based on core-needle biopsy, with non-invasive imaging playing a role in the visual identification of lesions and/or in image-guidance for biopsy sampling [17, 18].

Recently, radiomics have been argued to add value to the diagnostic pathways and patient management [19]. Various studies have been investigating the correlation of PSMA expression and clinical end-points in prostate cancer patients [14, 15]. Furthermore, radiomics combined with machine learning in MRI [20, 21] as well as in PET/CT [22,23,24] demonstrated the potential feasibility to establish novel in vivo prediction models for prostate cancer risk assessment.

In light of the potential of combining PET/MR imaging, radiomics and machine learning (ML), the objectives of this study were as follows: (a) to establish and cross-validate prostate lesion low-vs-high risk in vivo ML predictive models built on PET/MRI radiomics, (b) to establish and validate biochemical recurrence and overall patient risk (OPR) models that utilize in vivo ML scores instead of biopsy grades together with PSA and clinical stage, and (c) to compare the above patient risk models to the standard risk stratification.

Materials and methods

Patient data

Patients were selected from the database (n = 122) of a mono-centric pilot study to a prospective randomized trial (clinicaltrials.gov NCT02659527) conducted between 2014 and 2015. Fifty-two of the 122 patients underwent surgery; in these patients, PET/MRI, PSA values, pre-operative biopsy results, and post-operative whole-mount histopathology were documented [15] (Table 1). All the 52 patients underwent a dual-tracer, fully integrated PET/MRI scan ([18F]FMC and [68Ga]Ga-PSMA-11 sequentially). This study, however, only included the [68Ga]Ga-PSMA-11 PET image as well as the transverse relaxation time-weighted (T2w) and apparent diffusion coefficient (ADC) MRI sequences in the analysis (Supplement: Table 1). All patients were treated with radical prostatectomy according to guideline recommendations [3]. All surgical specimens were processed according to the institution’s standard pathologic procedures in whole mount sections. Staging and grading were performed according to the UICC TNM classification and WHO/ISUP 2005 system, respectively [25]. The study was approved by the local institutional ethical committee and patients provided their written informed consent. See Fig. 1 for the CONSORT study diagram.

Table 1 Characteristics of the 52 patients involved in this study, at the time of radical prostatectomy (RP)
Fig. 1
figure 1

The analysis workflow of the collected dataset. The pre-study of the prospective randomized trial NCT02659527 provided data records of 122 patients between 2014 and 2015. Patients having a dual-tracer positron emission tomography/magnetic resonance imaging (PET/MRI), prostate-specific antigen (PSA) screening, and whole-mount histopathology through undergone surgery were included in the analysis (n = 52). Only [68Ga]Ga-PSMA-11 PET, apparent diffusion coefficient (ADC), and transverse relaxation time-weighted (T2w) MRI images were selected for radiomic analysis. Overall 121 PET/MRI-positive lesions were delineated from the 52 patients followed by radiomics feature extraction. The 121 lesions underwent prostate specific membrane antigen (PSMA) standardized uptake value (SUV) and volume area under the receiver operator characteristics curve (AUC) analysis. Monte Carlo (MC) cross-validation scheme was utilized to generate patient training and validation sets 1000-times. This MC scheme was utilized to build lesion low-vs-high (LH) prediction models via machine learning (MLH). Biochemical recurrence (BCR, n = 36) and overall patient risk (OPR, n = 50) patient prediction models were built across the same MC folds (MBCR and MOPR respectively). All machine learning models underwent confusion matrix analytics, sham data analysis, and AUC analysis across MC folds. BCR and OPR were also predicted by standard D’Amico score

Delineation

Delineation and annotation of prostate lesions on PET/MR images were performed using the Hybrid 3D software ver. 4.0.0 (Hermes Medical Solutions, Stockholm, Sweden). Here, [68Ga]Ga-PSMA-11 PET and T2w as well as ADC MR images were viewed side-by-side with the annotated, whole-mount histopathological slices. Delineation was done over the [68Ga]Ga-PSMA-11 image using standard three-dimensional iso-count VOIs (Fig. 2). The initial lesion delineations were cross-examined and corrected manually—if required—as part of an independent review process performed by PET and MRI specialists. This step resulted in 121 lesions in total. An additional reference region was defined in the gluteus muscle to normalize the standard uptake value (SUV) of [68Ga]Ga-PSMA-11 and the T2w arbitrary voxel values to the mean of their respective reference background (26).

Fig. 2
figure 2

(A) Positron emission tomography/magnetic resonance imaging (PET/MRI) views of a prostate cancer patient with volumes of interests (VOIs) drawn over lesions with Gleason 4 (red) and high-grade pin (blue) patterns. Standard iso-count 3D VOIs were drawn over the [68Ga]Ga-PSMA-11 PET in the Hermes Hybrid 3D software. First row: [68Ga]Ga-PSMA-11 PET; second row: apparent diffusion coefficient (ADC) MRI; third row: fused [68Ga]Ga-PSMA-11 PET and transverse relaxation time-weighted (T2w) MRI images. Note that each image is represented in its own frame of reference, while the fused PET/MRI view is aligned to the frame of reference of the T2-weighted MRI. Hence, the cross-sections of the drawn VOIs look different on each view. (B) An example histopathological slice with the same color codes as in case of the PET/MRI views (red: Gleason 4, blue: high-grade pin)

Feature extraction

Each image was resampled to 2.0 × 2.0 × 2.0 uniform voxel resolution via ordinary Kriging interpolation [27, 28]. Radiomic features with “very strong” or “strong” consensus values as of the Imaging Biomarker Standardization Initiative (IBSI) guidelines were extracted from the 121 resampled [68Ga]Ga-PSMA-11, T2w and ADC lesions by the MUW Radiomics Engine (ver. 2.0) that was validated based on IBSI standards [29] (Supplement Table 1). Conventional standardized uptake values including SUXmax, SUVpeak, SUVmean, and SUVTLG were merged with the extracted 442 radiomic features to compose a 446 long feature vector for each lesion. While total lesion glycolysis (TLG) is originally proposed for [18F]FDG, it was involved in our analysis as it characterized [68Ga]Ga-PSMA-11 accumulation in prostate lesions.

Feature redundancy reduction

Feature redundancy ranking and reduction were done across the 446 features by covariance matrix analysis [19] where features were considered redundant with higher than 0.75 absolute Pearson correlation coefficient. This step resulted in keeping 80 features for further analysis.

Reference standard

The respective whole-mount histopathology patterns of each delineated lesion were dichotomized as low (≤ Gleason 3, prostatic intraepithelial neoplasia (PIN), prostatitis, benign prostatic hyperplasia (BPH)) and high (> = Gleason 4) risk respectively. Furthermore, BCR and OPR reference values were established for each patient. BCR was defined when two consecutive PSA rose above 0.2 ng/ml. Follow-up was generally every 3 months for the first 2 years, then semiannually until the fifth year, then annually. Mean follow-up was 41 months. OPR was defined high, if BCR was positive or the node-stage (clinical or pathological) or the metastases-stage (clinical or pathological) were positive.

Statistical analysis in [68Ga]Ga-PSMA-11

Area under the receiver operator characteristic curve (AUC) was calculated for conventional SUVs and the volume of each delineated lesion in the [68Ga]Ga-PSMA-11 image to estimate the performance of predicting low-vs-high lesion risk. This process included SUXmax, SUVpeak, SUVTLG, and lesion volume values.

Cross-validation scheme

Monte Carlo (MC) cross-validation scheme was utilized to randomly assign training and validation roles to the 52 patients 1000-times. In each fold, five patients were selected for the validation role, while the remaining patients got the training role. This step was necessary to avoid mixing lesions for training and validation from the same patient. No repetitions were allowed during the generation of MC folds; thus, each of the 1000-fold configurations with their training-validation selections was unique.

Machine learning scheme

Mixed ensemble learning scheme built on random forest classifiers (RF) was utilized to build models for predicting lesion LH, patient BCR as well as OPR (models denoted as MLH, MBCR and MOPR respectively) [26, 30, 31]. Nine RFs with various hyperparameters were configured for each of the three model schemes (Supplemental Table 2). The final prediction was provided by majority vote of the respective nine RFs. This approach was chosen to minimize hyperparameter bias and to increase predictive performance [32]. Furthermore, the average predictive score of the nine RFs represented a continuous value range between 0.0 and 1.0 reflecting on the prediction certainty of the mixed ensemble. Therefore, this value could be the subject of AUC analysis across MC folds.

Lesion low-vs-high risk prediction

Training and validation lesion sets were generated as of the pre-generated MC scheme roles to train and validate the MLH models in each MC fold. In order to keep model complexity minimal and to reduce the chance of overfitting, selection of the top five-ranking features was performed by R-squared ranking in the training dataset prior to establishing the MLH lesion model per fold [33]. The same five features were then selected from the respective validation dataset to evaluate. Validation model performance was estimated via confusion matrix analytics across the predictions of the validation cases of the MC folds [26]. The MLH scheme also underwent AUC analysis by evaluating the predictive performance of its averaged nine RF vote across the MC validation cases. Last, to estimate the effect of sham data in the MLH model, confusion matrix analytics were also performed over randomly permutated labels across all MC folds [24, 34].

Feature weighting

The importance of each feature in predicting lesion low-vs-high risk was determined by counting the occurrence of all selected features across the MC folds by the R-squared ranking approach.

Patient biochemical recurrence and overall risk prediction

Patient risk models for predicting BCR and OPR were established (MBCR and MOPR respectively) analyzing the PSA, the enumerated clinical stage (Supplemental Table 3), and a composite MLH score (CLH) per patient calculated by eq. 1.

$$ CLH=\sum \limits_{i=1}^k\frac{M_{LH}(i){v}_i}{V} $$
(1)

where k is the number of lesions in the given patient, MLH(i) is the predicted low-vs-high risk score of lesion i provided by the MLH model of the given fold, vi is the volume of lesion i, and \( V={\sum}_{i=1}^k{v}_i \) is the sum of lesion volumes in the given patient.

Training and validation patient sets containing the above value triplets were generated as of the pre-generated MC scheme roles to train and validate the MBCR and MOPR models in each MC fold. In case a patient with validation role in the given fold had no BCR or OPR reference value available, it was excluded from the respective cross-validation of the given patient model.

To handle class imbalance, the training set underwent class imbalance correction by synthetic minority oversampling technique (SMOTE) [24, 35] for both the MBCR and MOPR training independently. Confusion matrix analytics were calculated across the validation set of all MC folds of the MBCR and MOPR model schemes. The same process was repeated by reference label permutations across the MC folds to estimate the effect of sham data. Both the MBCR and MOPR models underwent AUC analysis across the MC cross-validation folds.

Results

Patients

Of the 52 patients, 36 had BCR during follow-up and 50 had OPR information available at the time of conducting the study. At the time of radical prostatectomy, the average PSA was 7.5. The most common pathologic stages were stage 2 (n = 20, 38%) followed by 3b (n = 17, 33%) and 3a (n = 11, 21%). Total Gleason score occurrences were GS > =8 (n = 35, 67%) followed by GS 7 (n = 14, 27%) and GS = 6 (n = 3, 6%) (Table 1). The delineated 121 lesions represented a wide-range of benign and malign pathological alterations (Table 2). The most common high-risk pattern was associated to Gleason 4 (n = 50, 41%), followed by Gleason 3 (n = 17, 14%) and Gleason 5 (n = 11, 9%). Low-vs-high risk pattern regions were represented with balanced occurrences (n = 61-vs-60) (Table 2).

Table 2 Characteristics of the 121 delineated lesions in the 52 patients

Statistical analysis in [68Ga]Ga-PSMA-11

The AUC curves of SUV metrics were SUVmax 0.80, SUVpeak 0.74 and SUVTLG 0.64. Lesion volume presented AUC of 0.53. In contrast, the low-vs-high lesion prediction model (MLH) demonstrated a cross-validation AUC of 0.86 which was the highest compared to conventional [68Ga]Ga-PSMA-11 values (Fig. 3).

Fig. 3
figure 3

Area under the receiver operator characteristics curves (AUC) of conventional standardized uptake values (SUV) as well as lesion volume together with the machine learning low-vs-high lesion risk scores. Note that the MLH AUC performance is a conservative estimate, as it is a Monte Carlo cross-validation AUC, while the SUV and volume curves were measured directly from the whole dataset

Lesion low-vs-high risk prediction

The MLH model validation performance as per the MC cross-validation scheme yielded 71% sensitivity, 90% specificity, 88% positive predictive value, 75% negative predictive value, 81% accuracy, and 0.86 AUC. Sham data analysis revealed 0.52 AUC for permutated labels in the MLH model.

Feature weighting and distribution

Overall seven features were identified as selected across the 1000 MC folds via the R-squared ranking method. Features that were always selected were coefficient of variation and gray level co-occurrence matrix (GLCM) information correlation type 1 from the [68Ga]Ga-PSMA-11 image (n = 1000). [68Ga]Ga-PSMA-11 SUVmax was the third mostly selected feature (n = 974) followed by the interquartile range of the ADC image (n = 886). GLCM joint entropy and SUVmean were moderately prominent with (n = 573) and (n = 509) respectively in the [68Ga]Ga-PSMA-11 image. The lowest ranking feature (n = 58) was high gray zone emphasis in the [68Ga]Ga-PSMA-11 image (Fig. 4).

Fig. 4
figure 4

Occurrence of the highest ranked features across the 1000-fold Monte Carlo cross-validation scheme. PSMA—[68Ga]Ga-PSMA-11 positron emission tomography (PET); stat.cov: coefficient of variation; cm.info.corr.1—gray level co-occurrence matrix information correlation type 1; ADC—apparent diffusion coefficient; stat.iqr—interquartile range; cm.joint.entr—gray level co-occurrence matrix joint entropy; dzm.hgze—gray level distance zone matrix high gray zone emphasis

Patient biochemical recurrence and overall risk prediction

The cross-validation performance revealed an average validation accuracy of 89% and 91% as well as AUC of 0.90 and 0.94 for the MBCR and MOPR patient models respectively. The MOPR model outperformed the MBCR model with 94% specificity, 93% positive predictive value, and with 87% sensitivity. The performance of MOPR and MBCR with sham data revealed 0.54 and 0.56 AUC respectively. See Fig. 5 for the detailed performance values of the MBCR and MOPR models.

Fig. 5
figure 5

Left: validation performance estimations of predicting biochemical recurrence (BCR) by MBCR and clinical standard models. Right: validation performance estimations of predicting overall patient risk (OPR) MOPR and the clinical standard models. SENS—sensitivity; SPEC—specificity; ACC—accuracy; PPV—positive predictive value; NPV—negative predictive value. Confusion matrix values are in percentages. Note that standard risk estimator had a confusion analytics performance estimation in the whole dataset, as it is an established model, while the performance of MBCR and MOPR models was calculated through Monte Carlo cross-validation

Discussion

In this study, we investigated the feasibility of predicting prostate lesion-specific low-vs-high risk built on PET/MRI radiomics and patient-specific biochemical recurrence as well as overall patient risk. We demonstrated excellent cross-validation performances for MLH (AUC 0.86) as well as for MBCR (AUC 0.90) and MOPR (AUC 0.94). Based on the above approaches and our achieved model performances, we consider that our findings have important clinical implications in the field of primary prostate cancer risk assessment as they point towards the feasibility to estimate lesion and patient risks in vivo.

Next to establishing the above models with radiomics and machine learning, conventional [68Ga]Ga-PSMA-11 SUV and volume analysis were also conducted. This analysis revealed that SUVmax had the highest predictive power (AUC 0.80) to classify low-vs-high prostate lesions followed by SUVpeak, and SUVTLG, while lesion volume had no significant predictive power (AUC 0.53). These findings are in line with previous analyses performed in PET/CT [24].

Feature ranking across our Monte Carlo folds demonstrated that [68Ga]Ga-PSMA-11 is the most important in vivo feature source to establish lesion risk prediction models compared to ADC and T2w MRI features. The highest-ranking [68Ga]Ga-PSMA-11 features were either simple statistical values such as the coefficient of variation and SUVmax or simple second-order textural ones such as information correlation from the GLCM feature category. Information correlation is a first-order GLCM feature reflecting on the information content (a.k.a. entropy) of voxel neighborhood connectivity occurrences; thus, it is a basic heterogeneity descriptor. This feature was previously also identified as highly robust across various PET imaging centers [36]. The feature ranking across MC folds identified SUVpeak, SUVTLG, and volume as low-ranking; however, SUVmax was among the highest ranking ones. While the potential of PSMA SUVmax in characterizing prostate cancer had been presented [37, 38], Cysouw et al. concluded in a recent study that prostate risk in PSMA can be better characterized by textural parameters compared to SUVmax [24]. They utilized [18F]-DCFPyL PET/CT and reported 0.81 AUC to differentiate high (GS > = 8) and low-risk prostate cases. Our findings on the other hand demonstrate that conventional SUV parameters in combination with simple textural features can yield high-performing models in [68Ga]Ga-PSMA-11 PET/MRI to characterize prostate risk.

While no T2w feature was selected as high-ranking, ADC interquartile range (also referred to as “robust” value range) was selected as high-ranking. Prior studies focusing on ADC analysis to predict prostate lesion risk consistently identified ADCmin, ADCmean as well as ADCmedian [20, 39] as highly predictive (AUC range 0.72–0.90). We consider that the above findings and ours describe the same phenomenon, namely, the strong predictive ability of simple ADC values without the need of incorporating second or higher-order radiomic features in the analysis. The above findings in prior reports demonstrate the predictive performance of PSMA PET and ADC MR images individually. Hence, we hypothesize that the high performance of our MLH model is due to the fact that it combines both [68Ga]Ga-PSMA-11 PET and ADC MRI features in one model scheme.

Further to the above findings, we also established patient biochemical recurrence (MBCR) and overall patient risk (MOPR) models. In order to provide an in vivo score per patient in lieu of biopsy grades in these models, we created a CLH score which weighted each MLH score per lesion with its respective volume in each patient. Since volume was identified as non-predictive to classify low-vs-high risk in prostate lesions (AUC 0.53), we assumed that the volume effect [40] in our high-ranking features was negligible, and thus, lesion volume was an independent value from our lesion MLH scores. This assumption allowed us to utilize volume as a weight factor for each lesion MLH score to compose the patient-specific CLH score. The resulted CLH score in combination with PSA and clinical stage values resulted in high-performing MBCR and MOPR models (0.90 and 0.94 cross-validation AUCs respectively). We assume that the accuracy performance increase of + 20% and + 21% in our MBCR and MOPR models compared to standard risk estimation are due to the following reasons: first, the clinical standard utilizes Gleason patterns from biopsy to describe lesion pattern risks in the prostate [41]. Biopsy is considered imperfect as it may not be able to describe the overall heterogeneity of the prostate lesions [19, 42]. In contrast, our CLH score could characterize whole prostate lesions in vivo. Second, the clinical standard categorizes the PSA, the Gleason, and the clinical stage values independently into three categories (low, medium, and high risk). In contrast, we incorporated PSA, clinical stage, and the CLH score without re-binning them, and thus, avoiding potential information loss. Third, the clinical standard score acts as a maximum filter across its pre-binned risk categories to estimate overall risk to the patient. In contrast, the random forest ensemble logics in our MBCR and MOPR models could describe more complex relationships among PSA, clinical stage, and our in vivo CLH score. Our results demonstrate that such relationships may be indeed present and that building on those relationships may lead to in vivo risk predictive models in prostate cancer patients with the potential to eliminate the need of biopsy sampling in the future.

This study had a number of limitations. First, it built on a single-center cohort; however, due to utilizing a pre-generated MC fold scheme for all training and validation processes, no training and validation samples were mixed in between the lesion and patient predictors. In addition, the utilized data preparation (redundancy reduction, feature ranking, and class imbalance correction) as well as training (mixed ensemble) and validation (1000-fold CV, sham data analysis) approaches minimized the chances of false discoveries. Second, due to the dual-tracer study design from which our images were taken, the [68Ga]Ga-PSMA-11 scans were not entirely exempt of [18F]FMC uptake remnants. Nevertheless, [18F]FMC can be regarded an irreversible tracer [43] and, thus, the [18F]FMC uptake in terms of tissue to lesion ratio is expected not to change until the [68Ga]Ga-PSMA-11 examination. Last, only patients with proven prostate cancer were included after radical prostatectomy. Nevertheless, this selection criterion was necessary to acquire stable ground truth for lesion labeling.

Conclusions

This study demonstrates the feasibility of [68Ga]Ga-PSMA-11 PET/MRI in combination with radiomics and machine learning to non-invasively deliver both lesion characterization and risk prediction equally to preoperative invasive biopsy in patients with primary prostate cancer. Prospective multicentric studies are required to investigate the reproducibility and clinical utility of this approach.