1 Introduction

Deep learning has been a popular topic in the scientific communities for a long time due to its significant impact in many fields [1,2,3]. A new wave of deep learning has been sweeping the medical imaging field [4,5,6,7,8,9,10], and some deep learning techniques, such as image reconstruction and computer-aided diagnosis, have already been implemented in commercial systems [11,12,13,14]. In this review, we focus on positron emission tomography (PET) imaging using deep learning in the field of medical imaging [15,16,17,18,19,20,21,22,23,24].

PET is a molecular imaging method for visualizing and quantifying the distribution of radioactive tracers labeled with positron-emitting radioisotopes, such as fluorine-18 (18F), oxygen-15 (15O), nitrogen-13 (13N), and carbon-11 (11C), administered to living human participants [25]. Thus, PET can observe tracer kinetics; therefore, it is used not only for cancer diagnosis [26, 27] and diagnosis of neurodegenerative diseases, such as Alzheimer's disease [28, 29], but for fundamental research, such as brain function [30, 31]. PET is a unique imaging modality capable of tracking picomole-order molecules; however, image noise is severe compared to other tomographic scanners, such as X-ray computed tomography (CT) because there are fewer counts in the measured data. Image noise degrades quantitative accuracy and lesion detectability, leading to the potential scenario of missed lesions. One straightforward strategy for improving PET image quality (or suppressing PET image noise) is to increase the amount of PET tracer administered to the individual. This is sometimes difficult to actively adopt because of the problem of increased radiation exposure [32] and the limitations in count-rate capabilities of PET scanners. Therefore, there is a demand for noise reduction techniques that do not increase injected dose. It is no exaggeration to state that the development history of PET imaging has been a battle against noise.

In this review, we highlight the algorithms used for PET imaging and systematically describe the history of PET image reconstruction and post-processing denoising algorithms from early analytical methods to the latest advances in deep learning technology. Section 2 describes the basic principles of PET imaging, including PET imaging models, conventional analytical and statistical PET image reconstruction algorithms, and an overview of deep learning-based PET imaging algorithms. Section 3 reviews deep learning for PET image denoising algorithms, and Sects. 4, 5, and 6 review deep learning for direct, iterative, and dynamic PET image reconstruction algorithms, respectively. Finally, we conclude this review by providing future perspectives on PET imaging and deep learning technology.

2 Basic principles of PET imaging

This section briefly reviews the history of PET image reconstruction prior to the advent of deep learning techniques. Figure 1 summarizes the evolution of PET image reconstruction between the 1980s and 2000s.

Fig. 1
figure 1

Demonstration of various PET image reconstruction algorithms from the FBP to recent iterative PET image reconstruction algorithms, which were applied to the same simulation data generated from the BrainWeb (https://brainweb.bic.mni.mcgill.ca/brainweb/)

Between the 1970s and 1980s, researchers developed analytical reconstruction methods, such as the filtered backprojection (FBP) algorithm for tomographic imaging systems, such as X-ray CT and PET [33,34,35]. FBP is an analytical method that models the relationship between the image and the tomographic measurement data through an integral equation [36] as follows:

$$\begin{array}{c}Y\left(r,\phi \right)={\int }_{-\infty }^{\infty }X\left(r{\text{cos}}\phi -s{\text{sin}}\phi ,r{\text{sin}}\phi +s{\text{cos}}\phi \right)ds,\end{array}$$
(1)

where \(X\left(u,v\right)\) is a two-dimensional (2D) image and \(Y\left(r,\phi \right)\) holds 1D projections for each view angles, known as a sinogram. The sinogram is an integral along the \(s\)-axis of a rotated image by an angle \(\phi\). This integral transformation is known as the Radon transform [37] or X-ray transform. The principle behind FBP is the projection slice theorem that shows the relationship between the 2D Fourier transform of \(X\left(u,v\right)\) and the 1D Fourier transform of \(Y\left(r,\phi \right)\), with respect to r as a one-to-one mapping. The most commonly used analytical method is FBP, which is calculated as follows:

$$\begin{array}{c}\begin{array}{c}X\left(u,v\right)={\int }_{0}^{\pi }{\left.{Y}_{filtered}\left(r,\phi \right)\right|}_{r=ucos\phi +vsin\phi }d\phi ,\\ {Y}_{filtered}\left(r,\phi \right)={\int }_{-\infty }^{+\infty }G\left(\xi ,\phi \right)\left|\xi \right|{\text{exp}}\left(2\pi i\xi r\right)d\xi \\ G\left(\xi ,\phi \right)={\int }_{-\infty }^{+\infty }Y\left(r,\phi \right){\text{exp}}\left(-2\pi i\xi r\right)dr,\end{array},\end{array}$$
(2)

where \(i\) is the imaginary unit, \(\xi\) is the frequency domain variable, and the high pass filter \(\left|\xi \right|\) is called a ramp filter. Although the ramp filter is an analytically derived necessity, enhancing the high-frequency components tends to produce severe noise. Therefore, frequency cutoff techniques and various filters have been developed to reduce high-frequency noise, although they sacrifice spatial resolution. In principle, the analytical method is known for its high speed, linearity, and quantitative accuracy; however, it is susceptible to noise and leads to streak artifacts in low-count situations, as shown in Fig. 1.

Early PET systems had septa inserted between detector rings to shield gamma rays from oblique directions, and measured data to be processed were limited to a 2D plane. In the 1980s, researchers developed 3D PET image reconstruction methods [38,39,40], and since the 1990s, 3D acquisition has been performed by removing the septa or making them retractable and switchable for 2D and 3D modes [41, 42]. While projections in a range of 0 to 180° for the directions perpendicular to the axial direction form a complete set for reconstruction of a 3D image, the 3D acquisition significantly improved the sensitivity by measuring oblique projections. To fully utilize 3D projection data, image reconstruction methods that consider data redundancy are required. We should note that, in addition, scatter correction is essential for quantification due to the increased scatter components in 3D projection data, contrary to the 2D acquisition where the septa shield most components scattered inside a patient's body. Scatter components are estimated and subtracted from projection data before analytical image reconstruction. There are many studies regarding the estimation and the impact of the scatter [43,44,45,46,47,48,49], but they are out of the scope of this article. One of the most used analytical image reconstruction methods for the 3D PET is the 3D reprojection (3DRP) algorithm [39], which includes estimating the missing truncated data in 2D projections in order to apply 3D FBP. The 3D FBP is an extension of 2D FBP to three dimensions, where the Colsher filter [38] is applied to 2D parallel projections for each projection direction parameterized by azimuthal (phi) and co-polar (theta) angles. Note that the set of 2D parallel projections for the co-polar (oblique) angle of 0 is equivalent to the stack of sinograms for 2D FBP (direct sinograms). Then, the filtered 2D parallel projections are back-projected to the image domain. Although the 3D FBP requires projection data without truncation, projections in oblique angles have unmeasured regions against objects inside cylindrical PET scanners. The truncated data are estimated from a 3D image reconstructed as a stack of 2D images by 2D FBP for direct sinograms. The 3DRP method of directly treating 3D projection data was a computational burden for computers in the early 1990s. Therefore, the Fourier rebinning (FORE) method [50, 51] was developed for rebinning 3D projection data to a stack of 2D sinograms, allowing decomposition of the 3D image reconstruction problem into a set of 2D image reconstructions. Not only 2D FBP but iterative 2D image reconstruction methods can be applied following the FORE [52,53,54]. As a result of the maturation of 3D image reconstruction, modern PET scanners no longer use septa. In recent computers, 3D reconstruction methods have become tractable, and even iterative methods are used in practice.

Between the 1980s and 1990s, iterative reconstruction methods, such as the maximum likelihood expectation maximization (MLEM) algorithm [55,56,57], were developed to incorporate statistical and physical models into image reconstruction. In the EM iterative method, the relationship between a tomographic image and measured data is modeled through a system of linear equations and Poisson distribution [58] as follows:

$$\begin{array}{*{20}c} {y = Poisson \left( {{\varvec{Ax}} + \overline{\user2{b}}} \right),} \\ \end{array}$$
(3)

or

$$\begin{array}{*{20}c} {y_{i} = Poisson\left( {\mathop \sum \limits_{j = 1}^{J} a_{ij} x_{j} + \overline{b}_{i} } \right),} \\ \end{array}$$
(4)

where \({\varvec{x}}={\left({x}_{1},{x}_{2},\cdots ,{x}_{J}\right)}^{T}\) is a vector of voxel values of the image, \({\varvec{y}}={\left({y}_{1},{y}_{2},\cdots ,{y}_{I}\right)}^{T}\) is a vector of sampled values of the projection data, \(\overline{{\varvec{b}} }={\left({\overline{b} }_{1},{b}_{2},\cdots ,{\overline{b} }_{I}\right)}^{T}\) is a vector of expected values of background components, such as scatter and random coincidence events, which can be estimated using scatter and randoms modeling methods, and \({\varvec{A}}\in {\mathbb{R}}^{I\times J}\) is a system matrix where each element, and \({a}_{ij},\) is the probability that two gamma rays emitted from \(j\)-th voxel are detected by \(i\)-th line-of-response (LOR). The negative log-likelihood function of data \({\varvec{y}}\) under image \({\varvec{x}}\) is defined as follows:

$$\begin{array}{*{20}c} {L\left( {{\varvec{y}}{|}{\varvec{x}}} \right) = - \log P\left( {{\varvec{y}}{|}{\varvec{x}}} \right) = C - \mathop \sum \limits_{i = 1}^{I} \left\{ {y_{i} \log \left( {\mathop \sum \limits_{j = 1}^{J} a_{ij} x_{j} + \overline{b}_{i} } \right) - \left( {\mathop \sum \limits_{j = 1}^{J} a_{ij} x_{j} + \overline{b}_{i} } \right)} \right\},} \\ \end{array}$$
(5)

where \(P\left({\varvec{y}}|{\varvec{x}}\right)\) is the probability of sampling \({\varvec{y}}\) under \({\varvec{x}}\), and \(C\) is a constant. The MLEM algorithm estimates an image by minimizing (5) using iterative updates as follows:

$$\begin{array}{*{20}c} {x_{j}^{{\left( {k + 1} \right)}} = \frac{{x_{j}^{\left( k \right)} }}{{\mathop \sum \nolimits_{i = 1}^{I} a_{ij} }}\mathop \sum \limits_{i = 1}^{I} \frac{{a_{ij} y_{i} }}{{\mathop \sum \nolimits_{{j^{\prime} = 1}}^{J} a_{{ij^{\prime}}} x_{{j^{\prime}}}^{\left( k \right)} + \overline{b}_{i} }},} \\ \end{array}$$
(6)

where \(k\) denotes the iteration number. The MLEM algorithm achieves better image quality than the FBP algorithm by leveraging a statistical noise model for PET, as shown in Fig. 1. After introducing the MLEM algorithm, the ordered subset expectation maximization (OSEM) algorithm [59], a block iterative reconstruction method that divides the projection data into subsets and updates the image for each subset, was developed as a speed-up method. Furthermore, Tanaka and Kudo proposed a dynamic row action maximum likelihood (DRAMA) algorithm [60, 61]. The DRAMA algorithm contributes to improved convergence speed in the reconstruction process by controlling an optimal relaxation factor deduced by balancing the noise propagation from each subset to the final reconstructed image [61, 62]. We should note that these algorithms can be also applied for 3D PET data and even time-of-flight (TOF) PET data by properly modeling the system matrix.

Between the 1990s and 2000s, iterative reconstruction methods integrating the point-spread function (PSF) were developed for dedicated PET [63,64,65,66], as well as whole-body PET/CT [67]. The PSF can be modeled in either projection and/or image space. An example of incorporating an image-space PSF is as follows [64]:

$$\begin{array}{*{20}c} {{\varvec{x}}^{{\left( {k + 1} \right)}} = \frac{{{\varvec{x}}^{\left( k \right)} }}{{{\varvec{H}}^{T} {\varvec{A}}^{T} 1}}{\varvec{H}}^{T} {\varvec{A}}^{T} \frac{{\varvec{y}}}{{{\varvec{AHx}}^{\left( k \right)} + \overline{\user2{b}}}},} \\ \end{array}$$
(7)

where \({\varvec{H}} \in {\mathbb{R}}^{J \times J}\) is a matrix comprising the PSF kernel in the image space. In Eq. (7), the division and multiplication between vectors are element-wise. Note that Eq. (7) is equivalent to Eq. (6) when \({\varvec{H}}\) is identity matrix. The PSF image reconstruction primarily reduces image noise and enhances contrast as well as improves spatial resolution, as shown in Fig. 1. The PSF kernel increases the correlation between voxels and reduces their variance. From this point of view, the image-space PSF can be considered a variant of the basis function approach [68, 69].

Parallel to the development of statistical and physical model-based iterative reconstructions, maximum a posteriori (MAP) reconstruction methods that incorporate image priors, such as smoothness to maximum likelihood estimation, have been developed [63, 70,71,72,73,74,75]. The MLEM algorithm exhibits an unfavorable property whereby noise and edge artifacts tend to increase as the iterations progress [76, 77]. Thus, practical solutions involve early stopping of iterations and/or post-smoothing using a Gaussian filter [78]. The MAP reconstruction presents an alternative solution that often achieves a more favorable balance between noise and contrast than the above-mentioned techniques [79]. The posterior probability of image \({\varvec{x}}\) given data \({\varvec{y}}\) is expressed through Bayes’ theorem as follows:

$$\begin{array}{*{20}c} {P\left( {{\varvec{x}}{|}{\varvec{y}}} \right) = \frac{{P\left( {{\varvec{y}}{|}{\varvec{x}}} \right)P\left( {\varvec{x}} \right)}}{{P\left( {\varvec{y}} \right)}},} \\ \end{array}$$
(8)

where \(P\left( {\varvec{x}} \right)\) is the prior probability of image \({\varvec{x}}\). The prior probability is assumed to be the following exponential function called the Gibbs distribution:

$$\begin{array}{*{20}c} {P\left( {\varvec{x}} \right) = \frac{1}{Z}exp\left( { - \beta U\left( {\varvec{x}} \right)} \right),} \\ \end{array}$$
(9)

where \(Z\) is a partition function that makes the sum of the probabilities 1, and \(U\left( {\varvec{x}} \right)\) is an energy function designed to be small when the image is correct. The negative log-posterior likelihood is defined as follows:

$$\begin{array}{*{20}c} { - \log P\left( {{\varvec{y}}{|}{\varvec{x}}} \right) - \log P\left( {\varvec{x}} \right) = L\left( {{\varvec{y}}{|}{\varvec{x}}} \right) + \beta U\left( {\varvec{x}} \right),} \\ \end{array}$$
(10)

where \(\beta\) is a hyperparameter that adjusts the influence of the prior distribution. Various MAP estimations are customized based on the selection of the prior distribution in the form of the Gibbs distribution. A commonly used energy function for the Gibbs distribution is as follows:

$$\begin{array}{*{20}c} {U\left( {\varvec{x}} \right) = \mathop \sum \limits_{j} \mathop \sum \limits_{{j^{\prime} \in N_{j} }} \omega_{{jj^{\prime}}} V\left( {x_{j} - x_{{j^{\prime}}} } \right),} \\ \end{array}$$
(11)

where \(V\left( \cdot \right)\) is a potential function, \(N_{j}\) is a set of neighboring voxels for the \(j\)-th voxel, and \({\omega }_{j{j}{\prime}}\) is a weight between neighboring voxels. The weight is typically defined as the inverse of the distance between neighboring voxels. Examples of potential functions include quadratic and relative difference [74], as follows:

$$\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{Quadratic}}} & {\left( {x_{j} - x_{{j^{\prime}}} } \right)^{2} } \\ {{\text{Relative}} {\text{difference}}} & {\frac{{\left( {x_{j} - x_{{j^{\prime}}} } \right)^{2} }}{{\left( {x_{j} + x_{{j^{\prime}}} } \right) + \gamma \left| {x_{j} - x_{{j^{\prime}}} } \right|}}} \\ \end{array} ,} \\ \end{array}$$
(12)

where \(\gamma\) is a hyperparameter controlling the shape of relative difference. To minimize negative log-posterior likelihood function, Green’s one-step-late method [72] is commonly used as follows:

$$\begin{array}{*{20}c} {x_{j}^{{\left( {k + 1} \right)}} = \frac{{x_{j}^{\left( k \right)} }}{{\mathop \sum \nolimits_{i = 1}^{I} a_{ij} + \beta \left. {\frac{{\partial U\left( {\varvec{x}} \right)}}{{\partial x_{j} }}} \right|_{{{\varvec{x}} = {\varvec{x}}^{\left( k \right)} }} }}\mathop \sum \limits_{i = 1}^{I} \frac{{a_{ij} y_{i} }}{{\mathop \sum \nolimits_{{j^{\prime} = 1}}^{J} a_{{ij^{\prime}}} x_{{j^{\prime}}}^{\left( k \right)} + \overline{b}_{i} }}.} \\ \end{array}$$
(13)

In the PET image reconstruction, the use of MAPEM with a quadratic prior provides a smoother image than the MLEM algorithm in low-count situations, as shown in Fig. 1.

With the emergence of PET/CT and PET/magnetic resonance imaging (MRI) scanners, the MAPEM algorithms that incorporate additional anatomical information from CT and MR images [80,81,82,83,84] were also developed. For example, we can incorporate MRI information into MAPEM by setting the weight \({\omega }_{j{j}{\prime}}\) based on the difference between \(j\)- and \({j}{\prime}\)-th voxel values of MRI (as detailed in Sect. 5). As shown in Fig. 1, MR-guided MAPEM can provide images with enhanced smoothness while preserving the organ boundaries.

Currently, the trajectory of PET image reconstruction is undergoing a deeper evolution, propelled by the integration of state-of-the-art deep learning technology in conjunction with computer vision techniques [85,86,87,88,89]. Figure 2 shows a classification of the deep learning methods for PET data in this review, which are strategically categorized into three distinct classes. First, the earliest deep learning methods for PET imaging primarily focused on post-processing for PET image denoising. Notably, these methods do not strictly perform image reconstruction processes. Second, a direct image reconstruction is a data-driven approach to learn a direct mapping from sinogram to PET image using training datasets of sinograms and reconstructed images. Third, an iterative reconstruction is a hybrid approach that utilizes existing image reconstruction combined with neural-network image enhancement. We proceed with more details of deep learning-based PET imaging methods in the following sections.

Fig. 2
figure 2

Overview of the deep learning methods for PET data: They are divided into three categories; post-processing (denoising), direct reconstruction, and iterative reconstruction methods using neural networks (NNs)

3 Deep learning for PET image denoising

Reconstructed PET images typically exhibit a low signal-to-noise ratio, owing to physical degradation factors and limited statistical counts. Low-dose radiotracers or short-time scans that reduce patient burden accelerate the degradation of PET images, potentially affecting diagnostic accuracy. This remains a major challenge and an effective restoration approach for low-quality PET images is essential. The restoration of PET images is sometimes included as a “reconstruction” process; however, this section focuses on restoration methods by post-processing after reconstruction, distinguishing it from reconstruction that generates images from measurement data.

Noise occurs as the image reconstruction is ill conditioned, such that a small perturbation of the measurement data greatly affects the image with much larger perturbations, as follows:

$$\begin{array}{*{20}c} {\hat{\user2{x}} = {\user2{x}} + {\user2{n}}, } \\ \end{array}$$
(14)

where \(\widehat{{\varvec{x}}}\), \({\varvec{x}}\), and \({\varvec{n}}\) are the degraded PET image, true PET image, and degraded component, respectively. The PET image denoising (or restoration) task is an inverse problem, whereby restoring the original image from a degraded image additively mixed with statistical noise complicated by the image reconstruction process. In recent years, deep learning approaches have been proposed to train the relationship between \(\widehat{{\varvec{x}}}\) and \({\varvec{x}}\) using the following minimization problem:

$$\begin{array}{*{20}c} {\theta^{*} = \mathop {{\text{argmin}}}\limits_{\theta } E\left( {f\left( {\theta {|}\hat{\user2{x}}} \right);{\varvec{x}}} \right),} \\ \end{array}$$
(15)

where \(f\) represents a neural-network model with trainable parameters \(\theta\), \(E\) is a loss function such as a mean-squared error (MSE) or mean absolute error. In general, deep learning-based PET image denoising aims to acquire data-driven nonlinear mapping from low-quality to high-quality PET images. It provides better denoising performance while retaining the spatial resolution and quantitative accuracy compared with classical denoising methods. In this section, we introduce deep learning-based PET image denoising methods based on the power of convolutional neural networks (CNNs) that specialize in image mappings in three ways to be covered below: supervised learning, self-supervised and unsupervised learning, and emerging approaches.

3.1 Supervised learning approach

Supervised learning is an approach used in machine learning to train models based on labeled data. PET image denoising tasks require huge datasets, comprising pairs of high- and low-quality PET images, as shown in Eq. (14). The evolution of deep learning has led to the transformation of shallow CNNs, initially implemented with only a few convolutional layers into architectures with deeper layers. This progress has enabled more potent PET image denoising capabilities, as evidenced by their superior performance [90]. Starting with these successes, CNN architectures have more complex features and have developed into structures specialized for image denoising and medical image processing, as shown in Fig. 3. Among them, the U-Net proposed by Ronneberger et al. [91] and 3D U-Net proposed by Çiçek [92] for semantic segmentation are widely used for PET image denoising [93,94,95]. A typical U-Net architecture consists of a contracting path to capture the context from the input image and a symmetric expanding path that up-samples the extracted feature map. In addition, the U-Net architecture introduces skip connections that pass the feature maps at each resolution of the contracting path to the expanding path. Residual learning [96] has also been proposed, in which the noise component contained in the image is output based on the idea that it is easier to leave only latent noise rather than retain the complex visual features of the PET image in the hidden layer [97,98,99,100]. Perceptual loss, which is based on high-level feature representations extracted from a pre-trained VGG16 on ImageNet, has been shown to improve the visual quality of PET images compared to general loss functions, such as the MSE [101]. Recently, the widespread use of PET/CT or PET/MR scanners has facilitated the simultaneous acquisition of functional and anatomical images. Therefore, PET image denoising is also performed by combining multimodal anatomical information, such as CT [102, 103] or MR images [104,105,106,107,108,109,110,111,112], thereby achieving superior denoising performance compared with PET alone.

Fig. 3
figure 3

© 2023 SNCSC. Reprinted with permission from Wang et al. [159]

Overview of the various deep learning architectures for PET image denoising. (a) U-Net model. (b) Multi-modal network using anatomical information. (c) GAN model. (d) Vision Transformer (ViT) model. (e) Swin Transformer image restoration network (SwinIR).

The advent of generative adversarial networks (GAN) has led to breakthroughs in the field of image generation [113]. The GAN consists of two competing neural networks: a generator and a discriminator. In addition to adopting a network, such as a U-Net, which is capable of image-to-image translation as a generator, it can be regarded as a training method that considers the adversarial loss based on the output from the discriminator. GAN training proceeds such that the label data are no longer distinguishable from the output images of the CNN, thereby synthesizing denoised PET images with less spatial blur and better visual quality [114,115,116]. Common models for denoising by GANs include Conditional GAN [117] and Pix2Pix [118], while incorporating various network structures [119, 120] and additional loss functions, such as least squares [121, 122], task-specific perceptual loss [123], pixelwise loss [124], and Wasserstein distance with a gradient penalty [125], have all been reported to improve denoising performance. CycleGAN is a method that consists of two generator and discriminator pairs with cycle consistency loss [126], which can train denoising tasks without a corresponding direct pairing between the degraded and original PET images, which is conventionally essential (Fig. 4) [127,128,129,130,131].

Fig. 4
figure 4

© 2019 IOP Publishing. Reprinted with permission from Lei et al. [127]

Examples of the denoised whole body.18F-FDG PET images by supervised learning approaches. Sample images showing (a) CT, (b) full-count PET, (c) low-count PET, and denoised PET images corresponding to the (d) U-Net, (e) GAN, and (f) CycleGAN. (g) Line profiles in sagittal section.

3.2 Self-supervised and unsupervised learning approaches

Collecting a large number of high-quality PET images for supervised learning is particularly difficult in clinical practice. Furthermore, the generalization performance for various PET tracers can be poor, and denoised images may have inherent biases affecting use with unknown data, such as disease, scanner, and noise levels, which are not included in the training data. To overcome these challenges, self-supervised and unsupervised learning approaches have attracted a steadily growing interest. Self-supervised learning generally refers to training algorithms that use self-labels automatically generated from unannotated data. Noise2Noise is a representative self-supervised denoising approach that restores clean images from multiple independent corrupt images [132] and is also reported to be effective for PET image denoising [133, 134]. To avoid the constraints of Noise2Noise, which requires more than one noise realization, Noise2Void, an unsupervised approach using a blind-spot network design [135], has also been used for PET image denoising [136].

Among the unsupervised learning approaches, the deep image prior (DIP), which uses a CNN structure as an intrinsic regularizer and does not require the preparation of a prior training dataset [137], has achieved better performance in PET image denoising [138,139,140]. DIP training is formulated as follows:

$$\begin{array}{*{20}c} {\theta^{*} = \mathop {{\text{argmin}}}\limits_{\theta } \Vert {\varvec{x}}_{0} - f\left( {\theta {|}{\varvec{z}}} \right)\Vert, {\varvec{x}}^{*} = f\left( {\theta^{*} {|}{\varvec{z}}} \right),} \\ \end{array}$$
(16)

where \(\Vert \bullet \Vert\) is the L2 loss, \(f\) represents the CNN model with trainable parameters \(\theta\), the training label \({{\varvec{x}}}_{0}\) is the noisy PET image, and \({\varvec{z}}\) is the network input. After reaching an optimal stopping criterion, the CNN outputs the final denoised PET image, \({{\varvec{x}}}^{*}\). Conditional DIP (CDIP) [141, 142], which uses anatomical information instead of the original random noise as the network input, promotes denoising performance, and an attention mechanism to weight the multi-scale features extracted from the anatomical image guides the spatial details and semantic features of the image more effectively (Fig. 5) [143]. A four-dimensional DIP can perform end-to-end dynamic PET image denoising by introducing a feature extractor and several dozen reconstruction branches [144]. Recently, a pre-trained model using population information from a large number of existing datasets has been shown to improve DIP-based PET image denoising [145]. Furthermore, the self-supervised pre-training model acquired transferable and generalizable visual representations from only low-quality PET images; it achieves robust denoising performance for various PET tracers and scanner data [146].

Fig. 5
figure 5

© 2021 Elsevier. Reprinted with permission from Onishi et al. [143]

Examples of the denoised brain 18F-florbetapir PET images by unsupervised learning approaches. From left to right, the sample images showing the MR, standard-count PET, noisy PET, and denoised PET images corresponding to the Gaussian filter (GF), image-guided filter (IGF), DIP, MR-DIP (CDIP), and MR-guided deep decoder (GDD).

3.3 Emerging approaches

Currently, deep learning-based PET image restoration technology has already been implemented in commercial PET scanners [13, 14, 147] with Food and Drug Administration-cleared commercially available software [148,149,150,151,152], making significant contributions in clinical practice. Moreover, deep learning continues to develop rapidly, with emerging approaches and novel applications being frequently proposed.

The transformer architecture revolutionizes sequence tasks with self-attention and efficiently captures distant dependencies [153]. In particular, the Vision Transformer (ViT) [154] and Swin Transformer [155] effectively handle both local and global features, more so than CNNs. These transformer models have been adapted for PET image denoising and in some cases have reported to outperform CNN-based denoising performance (Fig. 6) [156,157,158,159,160,161]. The emergence of diffusion models resulted in a breakthrough in the field of image generation, following variational autoencoders and GANs. The effectiveness of denoising diffusion probabilistic models [162] for PET image denoising has also been investigated [163, 164]. From the viewpoint of personal information protection, federated learning, which enables decentralized learning without the need to export clinical data, is beginning to be applied to PET image denoising [165, 166]. In addition, uncertainty estimation [167, 168] and noise-aware networks [169,170,171] can provide additional value to conventional denoising methods. The advancement of PET state-of-the-art scanners, represented currently by total-body PET scanners [172], will pave the way for further applications of deep learning.

Fig. 6
figure 6

© 2023 SNCSC. Reprinted with permission from Wang et al. [159]

Examples of denoised 18F-FDG PET images by emerging approaches. Each column (a) to (d) indicates different patients or organs. From left to right, the sample images show the standard-count PET, low-count PET, and denoised PET images, corresponding to the enhanced deep super-resolution network (EDSR), EDSR-ViT, GAN, U-Net, and Swin Transformer image restoration network (SwinIR).

4 Deep learning for direct PET image reconstruction

Deep learning-based direct PET image reconstruction is a data-driven approach in which the reconstructed PET image, x, can be directly transformed from the measurement data, y, through a neural-network model, f, with trainable weights, \(\theta\). This is expressed as a problem of minimizing the following objective function:

$$\begin{array}{*{20}c} {\theta^{*} = \mathop {{\text{argmin}}}\limits_{\theta } E\left( {f\left( {\theta {|}{\varvec{y}}} \right);{\varvec{x}}} \right),} \\ \end{array}$$
(17)

where E is a loss function, such as the MSE. The direct reconstruction approach is completely different from previous approaches in that it attempts to find an image reconstruction mechanism entirely from the training dataset without involving physical models, such as forward or backprojection.

The earliest direct image reconstruction algorithm in the field of nuclear medicine was probably a single-photon emission CT (SPECT) image reconstruction algorithm using a perceptron with two hidden layers, as proposed by Floyd in 1991 [173], before the advent of deep learning. In this method, a four-layer perceptron was used, consisting of an input layer that considers the measurement sinogram as 1D data, one trainable hidden layer, another hidden layer with fixed weights for the backprojection calculation, and an output layer, as shown in Fig. 7. This pioneering work demonstrated that it was possible to realize a data-driven FBP method in which the first hidden layer works as a trainable kernel in the projection data space, and the second hidden layer works as a backprojection. The trained kernel in the projection data space delivers a kernel similar to that corresponding to a ramp filter in the frequency domain, as would be expected.

Fig. 7
figure 7

Schematic illustration of the earliest direct image reconstruction algorithm for SPECT by Floyd in 1991 [173]. The network realizes a data-driven FBP method in which the first hidden layer works as a trainable kernel in the projection data space, and the second hidden layer works as a backprojection. Note that the first hidden layer performed 1D filtering in the actual implementation using the common trainable weights at each angle

After 27 years, the advent of automated transform by manifold approximation (AUTOMAP) proposed by Zhou et al. in 2018 [174] led to the development of more modern direct image reconstruction algorithms using both fully connected (FC) layer as well as CNNs. The AUTOMAP architecture introduces dense (FC) connections in the first and second layers of the neural-network structure, as shown in Fig. 8. An interesting aspect of the dense connections in the AUTOMAP architecture is that they can work as an inverse transformation from measurement data to reconstructed MR and PET images in global operations using dense connections.

Fig. 8
figure 8

Schematic illustration of the AUTOMAP architecture by Zhou et al. in 2018 [174]. The network introduces dense connections in the first and second layers of the neural-network structure, which can work as an inverse transformation from measurement data to the reconstructed images in global operation

Inspired by the success of AUTOMAP, Häggström et al. proposed the DeepPET method for direct PET image reconstruction using an FCN architecture [175], as shown in Fig. 9. The DeepPET architecture consists of an encoder–decoder network that mimics some modifications of the VGG16 network [176], which has several improvements to address the challenges of using FCNs for direct image reconstruction. First, the encoder part initially utilizes larger convolution filter kernel sizes to perform a wider operation in the sinogram space, similar to the global operation with dense connections in the AUTOMAP architecture. Second, a deeper layered network structure allows the bottleneck features to obtain better latent representations. The size of the bottleneck feature is 18 × 17 × 1024, indicating that almost no spatial information of the input sinogram remains. The DeepPET has much fewer trainable parameters than the AUTOMAP, which uses dense connection layers (approximately 800 million trainable parameters for AUTOMAP compared to approximately 60 million for DeepPET [18]) and can train with a smaller dataset. DPIR-Net, a network structure similar to DeepPET, improves PET image quality by adding perceptual and adversarial losses to the loss function [177]. In addition, direct image reconstruction for long-axial field-of-view PET scanners has been developed [178]. DirectPET, which incorporates a Radon inversion layer that connects a masked region of the sinogram to a local patch of the image in the neural network as a PET physical model, has also been proposed [179, 180]. Furthermore, direct image reconstruction using modern network structures, such as a transformer network, and physics-informed networks have been developed [181, 182].

Fig. 9
figure 9

Schematic illustration of the DeepPET architecture by Häggström et al. in 2019 [175]. The arrows collectively represent the two convolution layers. The encoder part initially utilizes larger convolution filter kernel sizes of 7 × 7 in the red arrow and 5 × 5 in the blue arrows to work wider operation in the sinogram space, similar to the global operation with dense connections in the AUTOMAP architecture

These direct PET image reconstruction algorithms are expected to represent the next generation of fast and accurate image reconstruction methods; however, they have some limitations. We consider the DeepPET reconstruction results shown in Fig. 10 as an example. At a first glance, the direct reconstruction algorithms produce good PET images from sinograms. However, the detailed structures may differ from those obtained using the OSEM algorithm. This discrepancy may arise because obtaining an accurate inverse transformation from sinograms to reconstructed images using a data-driven approach is challenging. Consequently, these algorithms may generate artifacts or false structures in the reconstructed PET images. Another critical challenge is that the algorithms are limited to 2D image reconstruction, owing to graphics processing unit memory capacity. Therefore, these algorithms require a large number of training datasets to learn the backprojection task in a data-driven manner.

Fig. 10
figure 10

© 2019 Elsevier. Reprinted with permission from Häggström et al. [175]

Input sinograms and the reconstructed results of DeepPET method [175]. Columns correspond to the input sinogram, FBP, OSEM, and DeepPET results, respectively (left to right).

Another strategy for direct PET image reconstruction involves the use of TOF information. In general, acquired PET data (list-mode format data) are first histogrammed into the sinogram space. However, a different strategy directly creates histograms of the acquired PET data in the image space, known as the histo-image [183]. Whiteley et al. proposed the FastPET method, which obtains accurate PET images with a faster calculation time from the histo-image blurred by the TOF resolution in the LOR direction for each event (Fig. 11) [184]. The FastPET method differs from other direct image reconstruction methods, such as AUTOMAP and DeepPET, because the FastPET framework uses input images in the image space instead of the sinogram space. This implies that FastPET can employ CNNs, such as the U-Net structure. The advantage of this strategy is that it can be easily extended to 3D PET data because the sizes of the input data and network structure are quite small compared to those in other direct image reconstruction algorithms. Furthermore, some improved methods use the direction information of the acquired PET data by dividing the histo-image into several projection angles (Fig. 12) [185,186,187].

Fig. 11
figure 11

Schematic illustration of the FastPET framework for TOF-PET image reconstruction by Whiteley et al. in 2021 [184]

Fig. 12
figure 12

Results of FastPET reconstruction. Columns correspond to the phantom image, list-mode DRAMA, FastPET without and with direction information (from left to right). Reconstructed images were tagged using the mean and standard deviation of the contrast recovery coefficients (CRCs) of three tumor regions. The use of directional information (Ote and Hashimoto [186]) improves reconstruction performance (FastPET [184]). The figure is reprinted with a modification from the work of Ote and Hashimoto [186]

5 Deep learning for iterative PET image reconstruction

Deep learning-based iterative PET image reconstruction is a hybrid approach that combines existing iterative PET image reconstruction algorithms based on physical and statistical models with deep learning algorithms. There are two main approaches: one involves incorporating a neural network as an equality constraint, and the other involves integrating a neural network into the objective function as a penalty. The former approach to synthetic PET image reconstruction is represented by the following equation:

$$\hat{\user2{x}} = \mathop {{\text{argmin}}}\limits_{{\varvec{x}}} L\left( {{\varvec{y}}{|}{\varvec{x}}} \right),$$
$$\begin{array}{*{20}c} {s.t. x = f\left( {\theta {|}{\varvec{z}}} \right),} \\ \end{array}$$
(18)

where L is the negative Poisson log-likelihood function and z is the input to the neural network, f, with trainable weights, \(\theta\). A simpler solution involves utilizing the pre-trained model f for the PET image denoising task and updating the reconstructed PET image.Footnote 1 This optimization problem is solved such that the measurement data align with the projection of the denoised PET image output from the neural network. In other words, the denoised PET images from the neural network were as consistent with the measurement data as possible, although they were the output of the neural-network denoising.

Gong et al. proposed an iterative PET image reconstruction algorithm using a synthesis-based prior [188]. The algorithm transforms the constrained optimization problem in Eq. (17) into an unconstrained optimization problem using the augmented Lagrangian format and is solved using the alternating direction method of multipliers (ADMM) algorithm [189]. The reconstructed results of the method by Gong et al. achieved superior performance in terms of lesion contrast and white matter noise tradeoff, as shown in Figs. 13 and 14, respectively. This framework can be reasonably extended by modifying the denoiser network. For example, Xie et al. replaced the network with a GAN generator incorporating a self-attention mechanism [190] to enhance the image quality without introducing blurring [191]. This method produced a better lesion contrast recovery and background noise tradeoff than the other methods. Alternatively, high-quality PET images can be obtained without any prior training dataset by introducing a DIP as a constraint, which uses the intrinsic prior of the CNN structure [192]. Ote et al. implemented 3D list-mode PET image reconstruction using DIP [193] by replacing the negative log-likelihood function in Eq. (17) with a list-mode log-likelihood function [194]. Additionally, some iterative reconstruction methods have also been proposed using only backpropagation without any backprojection process (Figs. 15 and 16) [195,196,197].

Fig. 13
figure 13

© 2019 IEEE. Reprinted with permission from Gong et al. [188]

Reconstructed results of the iterative PET image reconstruction algorithm using CNN representation [188]. Columns represent high count ground truth, EM reconstruction with the Gaussian filtering, fair penalty-based penalized reconstruction, dictionary learning-based reconstruction [198], CNN denoising, and the proposed iterative PET image reconstruction using CNN.

Fig. 14
figure 14

© 2019 IEEE. Reprinted with permission from Gong et al. [188]

Tradeoffs of the iterative PET image reconstruction algorithm using CNN representation [188] between the lesion contrast recovery (CR) and standard deviation (STD) of the white matter region. Legends represent the same methods as in Fig. 13.

Fig. 15
figure 15

© 2022 IEEE. Reprinted with permission from Hashimoto et al. [195]

Overview of the iterative PET image reconstruction using the DIP framework [195]. This is a simple image reconstruction method incorporating a forward projection model as a loss function by backpropagation.

Fig. 16
figure 16

© 2022 IEEE. Reprinted with permission from Hashimoto et al. [195]

Reconstructed results of the iterative PET image reconstruction using the DIP framework [195]. Columns represent MR image, ground truth, FBP, MLEM with Gaussian filtering, DIP reconstruction by Gong et al. [192], proposed methods with random noise and MRI input [195].The proposed method with MRI input is visually close to the ground truth.

Next, we considered the latter approach for iterative PET image reconstruction using an analysis-based prior as follows:

$$\begin{array}{*{20}c} {\hat{\user2{x}} = \mathop {{\text{argmin}}}\limits_{{\varvec{x}}} L\left( {{\varvec{y}}{|}{\varvec{x}}} \right) + \beta R\left( {\varvec{x}} \right),} \\ \end{array}$$
(19)

where R is an energy function, modulated the influence by the regularization parameter, β. For example, we consider a simpler case, where the energy function is as follows:

$$\begin{array}{*{20}c} {R\left( {\varvec{x}} \right) = \mathop \sum \limits_{j = 1}^{J} \left( {f\left( {\theta {|}{\varvec{z}}} \right)_{j} - x_{j} } \right)^{2} ,} \\ \end{array}$$
(20)

where j denotes the voxel index. Intuitively, Eq. (19) is less constraining than the synthetic PET image reconstruction because optimization is performed to ensure that the reconstructed PET image does not deviate far from the neural-network output in the image space [18], while not requiring equality with the output of the neural network.

Mehranian and Reader proposed PET image reconstruction via FBSEM-Net [199], which uses a forward–backward splitting algorithm [200]. The FBSEM-Net architecture is illustrated in Fig. 17. Using PET-MR data, FBSEM-Net can enhance PET image quality compared to other conventional reconstruction algorithms, as shown in Fig. 18. Kim et al. proposed a deep learning-based iterative PET image reconstruction [201] that introduced local linear fitting inspired by guided filtering [202] to the energy function for bias reduction in blind denoising, which is as follows:

$$\begin{array}{*{20}c} {R\left( {\varvec{x}} \right) = \frac{1}{2}\Vert{\varvec{x}} - q \odot f\left( {\theta {|}{\varvec{x}}} \right) - b\Vert_{2}^{2} ,} \\ \end{array}$$
(21)

where x is the PET image, f is the denoiser network with weights, \(\theta\), ⊙ is the Hadamard product, and q and b denote the local linear fitting coefficients. The method was divided into substeps for the denoiser network and local linear fitting using the ADMM algorithm. Gong et al. proposed MAPEM-Net, which can be easily implemented by incorporating a potential function into neural-network optimization [203]. In addition, various other iterative PET image reconstruction algorithms have been proposed for PET and SPECT [204,205,206,207,208,209,210,211,212,213,214,215,216].

Fig. 17
figure 17

© 2021 IEEE. Reprinted with permission from Mehranian and Reader [199]

Overview of the FBSEM-Net [199]. The method can control the regularization parameter in the fusion block as the trainable weight.

Fig. 18
figure 18

© 2021 IEEE. Reprinted with permission from Mehranian and Reader [199]

Reconstructed results of the FBSEM-Net [199]. The -p and -pm in the methods represent the use of PET and MRI data for input, respectively.

6 Deep learning for dynamic PET image reconstruction

PET can be used to analyze the temporal pharmacokinetics of PET tracers through continuous measurements after the administration of radiopharmaceuticals. Usually, kinetic parameters, such as Ki, are estimated by fitting compartment models to the dynamic PET images of each voxel reconstructed over short-time frames. Alternatively, direct parametric reconstruction algorithms for dynamic PET data have been developed to enable accurate noise modeling [217,218,219].

With the advancement of deep learning, several dynamic PET image reconstruction methods using CNN have been proposed [220,221,222,223]. Li et al. expanded the DeepPET algorithm to direct the parametric image reconstruction from small-frame sinograms without using an input function [224]. Gong et al. introduced direct linear parametric PET image reconstruction using a nonlocal DIP architecture [225] with a linear kinetic model layer [226].

Dual-tracer PET imaging can measure two PET tracers in a single scan, which may be useful for diagnosing and tracking diseases as another application of dynamic PET [227, 228]. Deep learning has been reported to be useful for these approaches [229,230,231,232,233,234].

7 Conclusion and future perspectives

We conducted a comprehensive review of deep learning-based PET image denoising and reconstruction. Remarkable strides in deep learning-based PET image reconstruction are noteworthy. Recent advancements in PET scanner innovations are equally impressive and have aligned seamlessly with progress made in the field of deep learning technology. One of the recent breakthroughs in PET hardware is total-body PET geometry [235,236,237] that obtains high-sensitivity PET data and can provide extremely less noisy training datasets for deep learning-based PET image reconstruction [238]. Another noteworthy innovation is the TOF technology discussed in Sect. 4. Along with advancements in PET detectors [239,240,241,242], ultrafast TOF detectors of 30 ps have been developed, enabling reconstruction-free positron emission imaging [243]. This synergy between state-of-the-art TOF and deep learning technologies has pushed the limits of TOF performance [244,245,246]. Undoubtedly, the integration of deep learning will play a pivotal role in enhancing the performance of not only PET imaging but also signal processing [247,248,249,250].