Introduction

Neurofibromatosis-1 (NF1) is an inherited disease characterised by multiple neurofibromas in which there is an increased risk of malignant transformation to malignant peripheral nerve sheath tumours (MPNSTs) [1]. Non-invasive differentiation of benign symptomatic neurofibromas from those with malignant transformation is a clinical challenge. Standardised uptake value (SUV) or tumour-to-liver ratio measurements from 18F-fluorodeoxyglucose (18F-FDG) positron emission tomography (PET) have previously been described as an accurate method to detect MPNSTs in this patient group [2,3,4,5,6]. Qualitative scoring of heterogeneity of 18F-FDG PET on a three-point scale has also been described where MPNSTS displayed a more heterogeneous uptake of tracer with similar discriminatory power to maximum SUV (SUVmax) [7].

There is increasing interest in the quantitative measurement of heterogeneity in medical images of cancer patients, including computed tomography (CT), magnetic resonance imaging (MRI) and PET. There is evidence that the use of heterogeneity parameters may improve characterisation, segmentation, prognostication and therapy response assessment compared to standard metrics such as size or lesion activity [8,9,10,11,12]. The most commonly used methods involve the measurement of statistically based parameters including first-, second- and high-order features. First-order features include global parameters such as SUV but also heterogeneity parameters, such as standard deviation (SD), first-order entropy and first-order uniformity. These are derived from intensity volume histograms of a tumour volume of interest (VOI) [8, 10, 12]. Second-order features, most often derived from grey-level co-occurrence matrices (GLCM), measure the relationship between pairs of voxels [13] and high-order features, most often derived from neighbourhood grey-tone difference matrices (NGTDM), measure the relationship between three of more voxels in the same or adjacent planes [14].

Our hypothesis was that quantitative heterogeneity parameters from 18F-FDG PET could improve differentiation of benign symptomatic neurofibromas from MPNSTs compared to standard PET metrics such as SUV and our aim was to compare discriminative ability in a retrospective cohort of patients with NF1 whose tumours had been well-characterised.

Patients and methods

A cohort of 54 consecutive patients with NF1 and clinical suspicion of malignant transformation of symptomatic neurofibromas, referred from our national neurofibromatosis service for 18F-FDG PET/CT scans, was identified. There were 30 male (mean age 34.7 years, range 12 to 73 years) and 24 female patients (mean age 35.5 years, range 9 to 86 years). An institutional review board waiver was obtained for retrospective analysis of these data.

18F-FDG PET/CT scans were all acquired to the same protocol in the same institution on one of two scanners (Discovery VCT or DST, GE Healthcare, Chicago, IL, USA) which were cross-calibrated to within 3% [15]. Patients were fasted for at least 6 h prior to administration of 350 (+/− 10%) MBq 18F-FDG (scaled to body weight/70 in paediatric patients) and were only acquired if the blood glucose measurement was less than 10 mmol/l. Scans were acquired according to the institutional standard clinical protocol for NF1 patients with an acquisition at approximately 1.5 h (101.5 +/− 15 min) from the upper thigh to the base of skull followed by an acquisition at approximately 4 h (251.7 +/− 18.4 min) of the symptomatic tumour site only, all at 5 min per bed position [2]. Images were all reconstructed using an ordered subset expectation maximisation algorithm (2 iterations, 20 subsets) with a reconstructed slice thickness of 3.27 mm and pixel size 4.7 mm. The CT component of the scans was acquired at 120 kVp and 65 mAs without administration of oral or intravenous contrast agent.

The reconstructed PET datasets were imported into in-house texture analysis software implemented in MATLAB (Release 2016a, The MathWorks, Inc., Natick, MA, USA). Voxel intensities within the symptomatic tumour VOI were resampled to yield 64 discrete bins. Whilst most patients had multiple neurofibromas, only the symptomatic tumours were analysed. Since many of the tumours showed only low-grade FDG uptake, it was not possible to adequately segment the tumour regions directly from the PET data by freehand or by using semi-automated methods such as percentage threshold or fuzzy locally adaptive Bayesian methods [16]. Regions of interest were, therefore, drawn on the corresponding CT images where tumours were more easily defined (Fig. 1) by an experienced operator with radiology and nuclear medicine training and over 20 years experience. To assess inter-observer variability, a random subset of 16 patients had VOIs defined on 1.5- and 4-h scans by a separate operator blinded to the initial observer measurements and clinical data.

Fig. 1
figure 1

18F-FDG PET and CT (left) with corresponding images with ROIs (right). A symptomatic but benign left posterior thigh neurofibroma (SUVmax = 2.83)

As well as SUVs (mean, maximum and peak, all normalised to body weight in kilogrammes), three first-order (SD, entropy and uniformity), four second-order GLCM parameters (contrast, entropy, uniformity and homogeneity) and four high-order NGTDM parameters (coarseness, contrast, busyness and complexity) were calculated from the resulting VOIs. Second-order features were calculated from GLCMs measuring the grey-level distribution between pairs of voxels and high-order features were derived from three-dimensional matrices taking into consideration neighbouring voxels in adjacent planes. All these features have been previously described in detail [13, 14] and the chosen parameters have previously shown utility and/or robustness when used in clinical 18F-FDG PET data of cancers [17,18,19,20,21,22].

Statistical analysis was performed using SPSS (v22, Chicago, IL, USA) and MedCalc (v16.8.4, Ostend, Belgium) software. The data distributions were tested for normality using the Shapiro–Wilk test. As data were not normally distributed differences between benign and malignant tumours were tested with the Mann–Whitney U test for each parameter and correlations between parameters with Spearman correlation. Receiver operator characteristic (ROC) curves were also used to compare the ability of each parameter to classify tumours as benign or malignant and the area under ROC curves (AUROC) were calculated. Comparisons between AUROC were made as described by DeLong et al. [23]. Separate assessment was made by combining SUVmax with other parameters that did not show a correlation with SUVmax. Statistical significance was assumed when p < 0.05. Inter-observer variation was assessed with intra-class correlation coefficients (ICCs).

Results

Thirty patients had benign tumours and 24 had MPNSTs confirmed either histologically (n = 30) or by at least 5 years of follow-up (n = 24). Thirty-six symptomatic tumours were on the trunk and 18 in the extremities.

Good inter-observer agreement was found for measurement of all parameters with ICC varying from 0.86 (NGTDM contrast and GLCM contrast) to 1.0 (SUVmax and SUVpeak) on 1.5-h and 4-h scans. Median (and range) malignant and benign tumour volumes were 60.0 cm3 and 23.2 cm3, respectively (8.3–303.9 and 3.3–164.1 cm3, respectively, p = 0.004).

On 1.5-h scans, there was a significant difference between benign and malignant tumours for all SUV and other first-order parameters, for none of the second-order parameters and for all four high-order parameters. At 4 h, the results were the same, except second-order entropy was significantly different; high-order contrast was not (Table 1). Only percentage change SUVmean and SUVpeak showed significant differences between benign and malignant lesions (Table 1). For ROC analysis, SUV and other first-order parameters, second-order entropy and all high-order parameters showed ability to discriminate at 1.5 and 4 h (except high-order contrast at 4 h; Table 2). SUVmax showed the highest AUROC at 1.5 h (0.992) and SUVpeak at 4 h (0.997), closely followed by SUVmax (0.996). SD showed the best discrimination from the other first-order features (0.967 and 0.99 at 1.5 and 4 h, respectively; Fig. 2). Coarseness showed the best discrimination from the high-order features (0.894 and 0.888 at 1.5 and 4 h, respectively; Table 2; Fig. 3). The percentage change in SUVmean and SUVpeak showed some discriminatory ability (AUROC 0.722 and 0.688, respectively; Table 2).

Table 1 Differences between benign and malignant tumours for each heterogeneity parameter at 1.5 and 4 h post-injection of 18F-FDG and for percentage change in values between 1.5 and 4 h
Table 2 Area under receiver operating characteristic curves (AUROC), sensitivity, specificity, PPV, NPV and accuracy at 1.5 and 4 h post-injection of 18F-FDG and statistical comparison with SUVmax AUROC
Fig. 2
figure 2

ROC curves for SUVmax and first-order parameters (SD, entropy and uniformity) at 1.5 h. See Table 2 for AUROCs. There was no statistically significant difference between SUVmax AUROC and the other first-order parameter AUROCs (all p > 0.05)

Fig. 3
figure 3

ROC curves for SUVmax and high-order parameters (coarseness, contrast, complexity, busyness) at 1.5 h. See Table 2 for AUROCs. There was a statistically significant difference between SUVmax AUROC and the other high-order parameter AUROCs (coarseness p = 0.019, contrast p < 0.0001, busyness p = 0.0009, complexity p = 0.0002)

Most parameters showed significant correlations with SUVmax except the GLCM parameters and NGTDM contrast. GLCM parameters performed poorly in discriminating tumours and, so, were not further assessed, but the combined parameter SUVmax/NGTDM contrast was further evaluated to see if there was incremental value from this combination (Tables 1 and 2). Whilst combining the parameters in this way showed a better performance than NGTDM contrast alone, it did not show any additional value over SUVmax.

Discussion

This study has shown that MPNSTs in patients with NF1 display greater heterogeneity of 18F-FDG uptake than benign symptomatic neurofibromas as measured by a number of global first-order features (including SD, entropy and uniformity) as well as local high-order features (including coarseness, contrast, busyness and complexity). To our knowledge, only qualitative measures of heterogeneity have previously been described in this scenario where a qualitative heterogeneity score showed similar sensitivity but lower specificity to SUVmax [7]. With regards to other primary soft tissue tumours, a previous study has shown that heterogeneity parameters from 18F-FDG PET can differentiate benign from malignant musculoskeletal tumours better than SUVmax (p = 0.004) [24]. Another study showed that heterogeneity of 18F-FDG uptake and tumour grade in sarcomas were the only independent prognostic factors predicting overall survival (p < 0.001 and 0.004, respectively), whereas SUVmax and tumour type were not [25]. It is hypothesised that increased heterogeneity of 18F-FDG uptake within tumours is related to variations in cell density and proliferation as well as more heterogeneous underlying biology including angiogenesis and hypoxia and this is why heterogeneous tumours behave more aggressively [26, 27].

Our study also showed that MPSNTs showed significantly higher 18F-FDG accumulation compared to benign neurofibromas as measured by SUV parameters, a finding that has been previously reported [2,3,4]. Whilst SUVmax showed excellent ability to discriminate MPNSTs from symptomatic benign neurofibromas as determined by AUROC (0.992, 0.996 at 1.5 and 4 h, respectively), the SUVmax AUROC was not significantly different from SD, entropy or uniformity, but was significantly higher than all high-order features (Table 2; Figs. 1 and 2). The percentage change in SUV and heterogeneity parameters between 1.5- and 4-h scans did not show any superiority in discriminating benign from malignant tumours compared to the parameters alone.

Our study is potentially limited by its retrospective nature, but our results should be representative as this was a cohort of patients referred for clinical assessment of symptomatic neurofibromas that were suspected of malignant transformation. However, it may not necessarily be possible to extrapolate the findings to other tumour types. Whilst semi-automated methods of tumour segmentation on 18F-FDG PET images are preferred and are likely to show even better inter-observer variation, we were unable to apply these methods due to difficulty in defining tumours with low uptake on the PET scans. Nevertheless, VOI definition from the CT images proved straightforward and with good inter-observer reproducibility. In addition, whilst all image sets were checked qualitatively for registration of the PET and CT data by an experienced observer, we cannot exclude small amounts of mis-registration due to patient movement.

Conclusion

In patients with NF1, MPNSTs showed greater heterogeneity and greater levels of 18F-FDG uptake than benign symptomatic neurofibromas. First-order heterogeneity parameters were as discriminative as SUVmax. Although high-order features also showed the ability to differentiate benign and malignant tumours, these had lesser discriminatory ability compared to SUVmax.