Keywords

1 Introduction

Sparse data-acquisition protocols are widely used in magnetic resonance imaging (MRI) in order to shorten scanning times. In contrast, in computed tomography (CT), the data acquisition process is fast while reducing radiation exposure is an important clinical issue. One possible way to reduce radiation exposure is to decrease the tube current of the X-ray emitting source. However, the direct consequence is decreased image quality due to higher image noise. In this paper, we use a sparse view data-acquisition scheme to reach a significant radiation exposure reduction in CT. This can be achieved by masking the X-ray source at certain angular positions during the rotation of the scanner and therefore preventing some X-rays to pass through the patient. Using standard algorithms, images reconstructed from sparse view data exhibit undersampling structures which are related to the scanner geometry as well as the sub-sampling scheme used for data acquisition.

Recently, deep neural networks have shown to be a promising alternative to current state-of-the-art iterative methods for the reconstruction from heavily undersampled CT data. In particular, the U-net [6] has shown its excellent performance in the restoration of undersampled images in CT and MRI [4]. However, these standard network designs can be viewed as post-processing methods, as the network used to remove the artifacts is the only learned component in the reconstruction pipeline. As a consequence, these methods may lack data consistency. In this paper we propose a new network architecture for the image reconstruction from undersampled data in sparse view CT. Our network structure is inspired by the network cascade developed in [7] and consists of a cascade of convolutional neural networks and data consistency layers which minimize a properly-chosen functional. However, while the approach in [7] is based on the isometry of the full MRI forward operator, our data consistency layer is directly applicable to general inverse problems as well. Furthermore, the fully convolutional neural networks (FCNNs) with residual connections are replaced by U-nets. For different, gradient-descent-like data consistency layers, see [2, 3].

1.1 Sparse View Computed Tomography

Here and after we work with the discrete setting. By \(\mathbf {x}\in \mathbb R^n\) we refer to the vector of size \(m \times m\) with \(m^2=n\) as representation of the two-dimensional X-ray attenuation function and write \(\mathbf {y}\in \mathbb R^d\) for a fully sampled sinogram. Further, we use \(\mathbf {R}\) to denote the discretized forward operator of a CT scanner, i.e. the discrete X-ray transform specified by the scanner’s geometry. We denote the pseudoinverse of the discretized forward operator by \(\mathbf {R}^{\dag }\). Note that the continuous form of the Radon transform is injective but not surjective. Therefore, we may assume that the Radon transform \(\mathbf {R}\) is sampled sufficiently fine such that the discretized full data operator is injective but not surjective as well. Anyway, the approach presented below works for an arbitrary discrete transform \(\mathbf {R}\in \mathbb R^{d \times n}\).

Assume the data is measured only for lines corresponding to a subset \(I \subset J \triangleq \{1,\ldots ,d\} \), where J is the full set of projections. The corresponding discretized sparse data forward operator can be modeled by \(\mathbf {R}_I= \mathbf {S}_I\mathbf {R}\), where the sub-sampling operator is given by

$$\begin{aligned} \mathbf {S}_I\mathbf {y}(i) \triangleq {\left\{ \begin{array}{ll} \mathbf {y}(i) &{}\text { if } i \in I \\ 0 &{}\text { if } i \in I^c: = J \setminus I. \end{array}\right. } \end{aligned}$$
(1)

The sparse data image reconstruction problem then consists in recovering the image \(\mathbf {x}\in \mathbb R^n\) from the set of projections, i.e. we want to solve

$$\begin{aligned} \mathbf {R}_I\mathbf {x}= \mathbf {y}_I. \end{aligned}$$
(2)

2 Proposed Network Architecture

In the full data case, (2) can be be solved by filtered back-projection, which is a stable numerical implementation of \(\mathbf {R}^{\dag }\). However, in the sparse view case we have \(|I |\ll |I^{\mathrm {c}} |\) and the application of \(\mathbf {R}^{\dag }\) to data \(\mathbf {y}_I\) yields images with severe artifacts. Images with diagnostic quality can usually be obtained by iterative reconstruction methods designed for minimizing \(\mathcal {R}(\mathbf {x}) + \lambda \Vert \mathbf {R}_I\mathbf {x}- \mathbf {y}_I\Vert _{p}^p\), where \(\mathcal {R}(\mathbf {x})\) is a regularizer and \(\Vert \cdot \Vert _{p}\) denotes a norm which ensures data consistency. Typical choices for the regularizer are the total variation, or the \(\ell ^1\)-norm with respect to a frame or a trained dictionary. As a drawback, these methods are usually computationally expensive since they rely on a repeated application of the forward and adjoint operators. Furthermore, using regularization solely based on prior assumptions will likely bias the result.

Methods based on neural networks as for example in [4] propose non-iterative regularization approaches. Given an estimate solution \(\mathbf {x}_I\) of (2), regularized images are obtained as the output of a CNN \(f\) which is previously trained on a dataset of pairs \((\mathbf {x}_I,\mathbf {x}_{\mathrm {full}})\), where \(\mathbf {x}_{\mathrm {full}}\) is an image obtained from the reconstruction of a fully-sampled measurement. Such a procedure consists in a subsequent regularization of the initial solution \(\mathbf {x}_I\) rather than a joint minimization of \(\mathcal {R}(\mathbf {x}) + \lambda \Vert \mathbf {R}_I\mathbf {x}- \mathbf {y}_I\Vert _{p}^p\). Therefore, following [7], we propose to train different networks intercepted by data consistency (DC) layers.

2.1 Data Consistency Layer

Let \(f_{\varTheta }\) be a previously trained CNN with parameters \(\varTheta \). Given measured data \(\mathbf {y}_I\), we can apply a CNN to map \(\mathbf {x}_I\) to its corresponding label, i.e. \( f_{\varTheta }(\mathbf {x}_I) \simeq \mathbf {x}_{\mathrm {full}}\) where \(\mathbf {x}_I\triangleq \mathbf {R}^{\dag } \mathbf {y}_I\). However, the CNN reconstruction \(f_{\varTheta }(\mathbf {x}_I)\) may not satisfy the data consistency condition \(\mathbf {R}_I(f_{\varTheta }(\mathbf {x}_I)) \simeq \mathbf {y}_I\).

In order to improve data consistency, we define a new reconstruction \(f_{\mathrm {dc}}(\mathbf {x}_{\mathrm {cnn}}, \mathbf {y}_I, \lambda ) \triangleq \mathbf {R}^{\dag } ( \mathbf {z}_\mathrm{dc} ) \) where \(\mathbf {z}_\mathrm{dc} \in \mathbb R^d\) is the minimizer of the functional given by

$$\begin{aligned} F_{\varTheta , \mathbf {y}_I, \mathbf {x}_\mathrm{cnn}, \lambda }(\mathbf {z}) \triangleq ||\mathbf {R}( \mathbf {x}_\mathrm{cnn}) - \mathbf {z}||_2^2 + \lambda ||\mathbf {y}_I- \mathbf {S}_I\mathbf {z}||_2^2 \,, \end{aligned}$$
(3)

with \(\mathbf {x}_\mathrm{cnn} = f_{\varTheta }(\mathbf {x}_I)\) denoting the output of the trained CNN.

Here, the term \(||\mathbf {y}_I- \mathbf {S}_I\mathbf {z}||_2^2 \) enforces data consistency and \(\Vert \mathbf {R}( \mathbf {x}_\mathrm{cnn}) - \mathbf {z}||_2^2 \) uses \(\mathbf {x}_\mathrm{cnn}\) to regularize in Radon space. Opposed to [7], where the regularization term \(\Vert \mathbf {x}_\mathrm {cnn} - \mathbf {x}\Vert _2^2\) in image space has been used, the proposed regularization in data space yields the following representation of the DC layer for general, possibly non-orthogonal transforms.

Theorem 1

Let \(\mathbf {R}\in \mathbb R^{d \times n}\) be a real valued matrix and \(\mathbf {R}_I= \mathbf {S}_I\mathbf {R}\), where \(\mathbf {S}_I\) is the subsampling operator defined in (1). The data consistency layer \(f_{\mathrm {dc}}(\mathbf {x}_{\mathrm {cnn}}, \mathbf {y}_I, \lambda ) \) is well defined by (3) and takes the explicit form

$$\begin{aligned} f_{\mathrm {dc}}(\mathbf {x}_{\mathrm {cnn}}, \mathbf {y}_I, \lambda ) = \mathbf {R}^{\dag } \big ( \mathbf A\mathbf {R}\mathbf {x}_{\mathrm {cnn}} + \frac{\lambda }{1+\lambda } \mathbf {y}_I\big ), \end{aligned}$$
(4)

where \(\mathbf A= {\text {diag}}(a_1, \dots , a_n)\) is a diagonal matrix of size \(d \times d\) with diagonal entries \(a_i = 1\) if \(i \not \in I\) and \(a_i = 1/(1+\lambda )\) otherwise.

Proof

The functional in (3) takes the separable form \(\sum _{i \in J} |\mathbf {R}\mathbf {x}_\mathrm{cnn}(i) - \mathbf {z}(i) |_2^2 + \lambda |\mathbf {y}_I(i)- (\mathbf {S}_I\mathbf {z})(i) |_2^2\). Hence, the minimizer of \( F_{\varTheta , \mathbf {y}_I, \mathbf {x}_\mathrm{cnn}, \lambda }\) is unique and can be found by component-wise minimization. Elementary computations show (4).

The matrix \(\mathbf A\) ensures that, when the i-th projection is not available from the measurements, \((\mathbf {R}\mathbf {x})(i)\) is directly estimated from the projection data of the output of the CNN. Otherwise, \((\mathbf {R}\mathbf {x})(i)\) is calculated as a linear combination of the CNN coefficient \(\mathbf {R}\mathbf {x}_{\mathrm {cnn}}(i) \) and the measured coefficient \(\mathbf {y}_I(i)\). Note that the evaluation of (4) requires the application of the pseudoinverse, which might be numerically unstable. In the numerical implementation, the pseudoinverse \(\mathbf {R}^{\dag }\) is replaced by an appropriate regularization. We emphasize that this issue is not present in MRI reconstruction, as the corresponding full data operator is bijective and the inverse well-conditioned. Therefore, the extension of the corresponding data consistency layer from MRI to CT is a non-trivial issue.

2.2 U-Nets Cascade

Here, we always refer to a U-net as any residual encoder-decoder network architecture with a similar structure to the one presented in [4]. However, in our experiments we vary the number of stages which are used to encode the input, the number of convolutional layers per stage, the initial number of feature maps which are extracted from the input and the factor by which the feature maps are augmented after each max-pooling layer. In order to satisfy the data consistency condition \(\mathbf {R}_I(f_{\varTheta }(\mathbf {x}_I)) \simeq \mathbf {y}_I\), we propose to construct a sequence of U-nets which are intercepted by DC layers as described in Subsect. 2.1. While the U-nets tackle the removal of the undersampling artifacts, the DC layers account for data consistency in Radon space. Figure 1 shows the structure of a U-nets cascade, where each U-net consists of three encoding stages and two convolutional layers per stage.

Fig. 1.
figure 1

A cascade of U-nets with intermediate data consistency layers.

3 Numerical Experiments

3.1 Dataset

We test our proposed network architecture on a dataset consisting of cardiac CT images from 52 patients. The 3D volumes contain from 240 up to 640 slices per patient. For each slice, the undersampled data \(\mathbf {y}_I\) is generated according to a parallel-beam geometry where we cover a half rotation of 180\(^{\circ }\) of the scanner by only 32 angles. The images \(\mathbf {x}_I\) are obtained by applying filtered back-projection \(\mathbf {R}^{\dag }\) with Ram-Lak filter to \(\mathbf {y}_I\). The operator \(\mathbf {R}\) is assumed to perform 512 projections. We use the images of 40 patients for training, of 6 for validation and of 6 for testing. For computational reasons and in order to allow us to build neural networks with a certain depth, the images are first downsampled from \(512\times 512\) to \(256\times 256\) pixels.

3.2 Network Architectures and Training

In all our experiments we train the U-nets cascade to minimize the \(L_2\)-error between the predicted output of the cascade and the corresponding label. All architectures are trained for 20 epochs by stochastic gradient descent. When one single U-net is used, we decrease the learning rate from \(10^{-7}\) to \(10^{-9}\). For all other architectures which contain the operators \(\mathbf {R}\) and \(\mathbf {R}^{\dag }\), a more conservative learning rate which is decreased from \(10^{-10}\) to \(10^{-14}\) has to be chosen for numerical stability. The network architectures are implemented in TensorFlow and the scanner geometry, the forward and the pseudoinverse operators \(\mathbf {R}\) and \(\mathbf {R}^{\dag }\) are implemented in ODL [1]. We parametrize a U-net cascade according to the following hyperparameters:

  • U - the number of U-nets employed in the cascade

  • E - the number of stages used for the encoding of each U-net

  • C - the number of convolutional layers per stage for each U-net

  • K - the number of feature maps which are initially extracted from the input of each U-net

  • F - the factor by which the number of feature maps is increased after the max-pooling layers of each U-net.

For example, U1 E5 C4 K64 F2 denotes a single U-net architecture similar to the one presented in [4]. On the other hand, U4 E1 C4 K64 denotes a FCNN cascade as discussed in [7]. Note that, in such a case, we omit the hyperparameter F in the notation, since due to the absence of max-pooling layers, the number of extracted feature maps stays constant over the different stages.

For a fair comparison, we try to keep the number of trainable parameters approximately equal for the architectures we compare. Note that due to the large number of possible combinations of hyperparameters, it is computationally demanding to conduct experiments which clearly reveal the effect of each hyperparamter. However, we identify the presence of max-pooling layers to be the main difference between the proposed U-net cascade and the cascade in [7] in terms of feature-extraction-operations of the subnetworks. Therefore, in order to reach a certain number of trainable parameters, we choose to always favour to increase the number of encoding stages rather than increasing the number of convolutional layers per stage, the number of extracted feature maps or the factor by which they are increased after the max-pooling layers.

For the evaluation of the performance of the network we report the peak signal-to-noise ratio (PSNR), the relative \(L_2\)-error (NRMSE), the structural similarity index measure (SSIM) and the Haar-wavelet based perceptual similarity index measure (HPSI, [5]) which has been reported to achieve higher correlation with human opinion scores than SSIM on various benchmark databases.

Fig. 2.
figure 2

Comparison of different cascades. 32-views FBP-reconstruction (first column), ground truth (second column), U4 E1 C4 K64 (third column), U4 E4 C2 K32 (fourth column). The red circles indicate newly introduced or not correctly removed artifacts from the reconstruction with the FCNNs-cascade. (Color figure online)

3.2.1 Effect of the U-Net:

Here, we investigate the effect of the replacement of the FCNNs discussed in [7] by the U-nets. Table 1 lists the average of the aforementioned quantitative measures over the test set. In terms of PSNR, SSIM and NRMSE, both cascades deliver similar results. On the other hand, we report a statistically significant increase of the mean value of HPSI for all tested U-nets cascades, (\(p < 0.001\) for all cases). Figure 2 shows two examples of reconstructed images of the test set. Due the relatively small number of trainable parameters and the high undersampling factor, both approaches do not entirely remove the undersampling artifacts and fail at recovering fine details. Note that, however, the cascade with the FCNNs even introduces new artifacts. The phenomenon can be observed in several images reconstructed with the FCNNs cascade. On the other hand, the U-nets cascade seems to better preserve the overall structure of the images.

Table 1. Comparison of the proposed U-nets cascade with a cascade of FCNNs with residual connections. The measures are averaged over the test set.
Fig. 3.
figure 3

Variation of the length of the cascade. Ground truth (top left), FBP-reconstruction from undersampled data (bottom left), U1 E3 C2 K64 F2-reconstruction (top middle), U2 E3 C4 K32 F2-reconstruction (bottom middle), U3 E3 C3 K64-reconstruction (top right), U4 E3 C2 K32 F2-reconstruction (bottom right). The yellow arrows point at the left coronary artery. (Color figure online)

Table 2. Variation of the length of the U-nets cascade. The measures are averaged over the test set.

Effect of the Cascade: In this experiment, we test different network architectures where we vary the length of the cascade. Figure 3 shows an image reconstructed with different network cascades. The results show that the left coronary artery is better visible in the images reconstructed with the U-nets cascades compared to a single U-net. In contrast to the results presented in [7], increasing the length of the cascades does not further improve the results. We attribute this to the fact that the inversion of the Radon-transform is ill-posed and therefore, numerical errors due to the inversion of \(\mathbf {R}\) prevail over the presence of the data consistency layers. However, when we replace a single U-net by a U-nets cascade, the network’s performance statistically significantly increases (\(p< 0.001\)) with respect to all measures except for SSIM, where a single U-net yields the best results, see Table 2.

3.3 Conclusion

In this work, we have presented a new network architecture for image reconstruction in sparse view CT. Replacing the FCNNs by U-nets in the cascade in [7] visually improves the reconstruction in sparse view CT. The proposed U-nets cascade outperforms the single U-net architecture with respect to all reported quantitative measures except for SSIM and better preserves fine anatomic details. By adapting the data-acquisition process and the index set I, the architecture is directly applicable to other limited data inverse problems such as limited angle CT where we expect the method to deliver even better results as the portion of measured data which can be used in the reconstruction is significantly larger. Furthermore, we expect the extension of the network cascade employing U-nets as sub-networks also to further improve the image reconstruction in MRI.