Keywords

1 Introduction

Speed-of-sound (SoS) ultrasound computed tomography (USCT) is a promising image modality, which generates maps of speed of sound in tissue as an imaging biomarker. Potential clinical applications are differentiation of breast tumorous lesions [3], breast density assessment [13, 15], staging of musculoskeletal [11] and non-alcoholic fatty liver disease [7], amongst others. For this, a set of time of flight (ToF) measurements through the tissue between pairs of transmit/receive elements of an ultrasonic array can be used for a tomographic reconstruction. Various 2D ad 3D acquisition setups have been proposed, including circular or dome-shaped transducer geometries, which provide multilateral set of measurements that are convenient for reconstruction methods [8] but costly to manufacture and cumbersome in use. Hand-held reflector based setup [10, 14] depicted in Fig. 1a uses a conventional portable ultrasound probe to measure ToF via wave reflections of a plate placed on the opposite side of the sample. Despite its simplicity, such a setup results in limited-angle (LA) CT, which requires prior assumptions and suitable regularization and numerical optimization techniques to produce meaningful reconstructions [14]. Such optimization techniques may not be guaranteed to converge, are often slow in runtime, and involve parameters that are difficult to set.

In this paper, we propose a problem-specific variational network [1, 5] for limited-angle SoS reconstruction, with parameters learned from numerous forward simulations. Contrary to machine learning methods based on sinogram inpainting [16] and reconstruction artefact removal [6] for LA-CT, we learn reconstruction process end-to-end, and show that it allows to qualitatively improve conventional reconstruction.

2 Methods

Using the wave reflection tracking algorithm described in [14], we measure the ToF \(\varDelta t\) between transmit (Tx) and receive (Rx) transducers in a \(M= 128\) element linear ultrasound array (see Fig. 1a). Discretizing corresponding ray paths using a Gaussian sampling kernel, the inverse of ToF can be expressed as a linear combination of tissue slowness values x [s/m], i.e. \((\varDelta t)^{-1}=\sum _{i\in \text {Ray}}l_i x_i\). Considering a Cartesian \(n_1 \times n_2 = P\) grid, we define the forward model

$$\begin{aligned} \mathbf {b}= \text {{diag}}(\mathbf {m})\mathbf {L}\mathbf {x}+ \mathcal {N}(\mathbf {0}, \sigma _N\mathbf {I}), \end{aligned}$$
(1)

where \(\mathbf {x}\in \mathbb {R}^{P}\) is the inverse SoS (slowness) map, \(\mathbf {L}\in \mathbb {R}^{ M^2\times P}\) is a sparse path matrix defined by acquisition geometry and discretization scheme, \(\mathbf {m}\in \{0,1\}^{M^2}\) is the undersampling mask with zeros indicating a missing (e.g., unreliable) ToF measurement between a corresponding Tx-Rx pair, and \(\mathbf {b}\in \mathbb {R}^{M^2}\) is a zero-filled vector of measured inverse ToFs \((\varDelta t)^{-1}\). Reconstructing a slowness map \(\mathbf {x}\) is a process inverse to (1) and can be posed as the following convex optimization problem:

$$\begin{aligned} \hat{\mathbf {x}}(\mathbf {b}, \mathbf {m}; \lambda , \varvec{\nabla }) = \mathop {\mathrm {argmin}}\limits _{\mathbf {x}}\, \Vert \text {{diag}}(\mathbf {m})\mathbf {L}\mathbf {x}- \mathbf {b}\Vert _1 + \lambda \Vert \varvec{\nabla }\mathbf {x}\Vert _1, \end{aligned}$$
(2)

which we solve using ADMM [2] algorithm with Cholesky factorization. Here \(\varvec{\nabla }\) is a matrix, and \(\lambda \) is the regularization weight.

It is common to choose regularization matrix \(\varvec{\nabla }_\text {TV}\) that implements spatial gradients on Cartesian grid, yielding the total variation (TV) regularization [12], which allows to efficiently recover sharp image boundaries, but can introduce signal underestimation and staircase artefacts that are amplified by the limited-angle acquisition. In attempt to remedy this problem, one can delicately construct a set of image filters that will penalize problem-specific reconstruction artefacts. We follow [14] and use regularization matrix \(\varvec{\nabla }_\text {MATV}\) that implements convolution with the set of weighted directional gradient operators. This weights regularization according to known wave path information, such that the locations with information from a narrower angular range are regularized more.

Fig. 1.
figure 1

(a) Acquisition setup and ray tracing discretization. (b) Structure of variational network; tunable parameters of the layer are highlighted in red. (c) Samples from synthetic training set \(\mathcal {T}\) and testing set of geometric primitives \(\mathcal {P}\). (Color figure online)

2.1 Variational Network

Variational networks (VN) is a class of deep learning methods that incorporate a parametrized prototype of a reconstruction algorithm in differentiable manner. A successful VN architecture proposed by Hammernik et al. in [5] for undersampled MRI reconstruction unrolls a fixed number of iterations of the gradient descent (GD) algorithm applied to a virtual optimization-based reconstruction problem. By unrolling the iterations of the algorithm into network layers (see Fig. 1b), the output is expressed as a formula parametrized by the regularization parameters and step lengths of this GD algorithm. The parameters are then tuned on retrospectively undersampled training data.

In contrast to discrete Fourier transform, the design matrix of LA-CT is poorly conditioned, which compromises the performance of conventional GD. Therefore, we propose to enhance the VN in the following ways: (i) unroll GD with momentum, (ii) add left diagonal preconditioner \(\mathbf {p}^{(k)}\in \mathbb {R}^{M^2}\) for the path matrix \(\mathbf {L}\), (iii) use adaptive data consistency term \(\varphi _\text {d}^{(k)}\), and (iv) allow spatial filter weighting \(\mathbf {w}_i^{(k)}\in \mathbb {R}^{P}\). The resulting reconstruction network is defined in Algorithm 1 with tunable parameters \(\varTheta \), where each of K variational layers contains \(N_f\) convolution matrices \(\mathbf {D}=\mathbf {D}(\mathbf {d})\) with \(N_c\times N_c\) kernels \(\mathbf {d}\) that are ensured to be zero-centered unit-norm via re-parametrization: \(\mathbf {d}=(\mathbf {d}'-\langle \mathbf {d}'\rangle )/\Vert (\mathbf {d}'-\langle \mathbf {d}'\rangle )\Vert _2\), where \(\langle .\rangle \) denotes mean value of the vector. Each filter \(\mathbf {D}\) is associated with its potential function \(\varphi _{\text {r}}\{.\}\) that is parametrized via cubic interpolation of control knots \({\varvec{\phi }}_\text {r}\in \mathbb {R}^{N_g}\) placed on Cartesian grid on \([-r, r]\) interval. Data term potentials \({\varphi }_\text {d}\{.\}\) are defined in the same way. The network is trained to minimize \(\ell _1\)-norm of the reconstruction error on the training set \(\mathcal {T}\):

$$\begin{aligned} \min _\varTheta \mathop {\mathbb {E}}_{ \{\mathbf {b}, \mathbf {m}, \mathbf {x}^\star \} \in \mathcal {T}} \Vert \mathcal {V}(\mathbf {b}, \mathbf {m}; \,\varTheta ) - \mathbf {x}^\star \Vert _1. \end{aligned}$$
(3)
figure a

Training. Dataset \(\mathcal {T}\) is generated using fixed acquisition geometry with reflector depth equal to transducer array width. High-resolution (HR) 256 \(\times \) 256 synthetic inclusion masks are produced by applying smooth deformation to an ellipse with random center, eccentricity, and radius. Two smooth slowness maps with random values from [1 / 1650, 1 / 1350] interval are then blended with this inclusion mask, yielding a final slowness map \(\mathbf {x}^\star _\text {HR}\) (see Fig. 1c). The chosen range corresponds to observed SoS values for breast tissues of different densities and tumorous inclusions of different pathologies [4]. Forward path matrix \(\mathbf {L}_\text {HR}\) and random incoherent undersmapling mask \(\mathbf {m}\) are used to generate noisy inverse ToF vector \(\mathbf {b}\) according to model (1) with \(\sigma _N = 2 \cdot 10^{-8}\). Finally, we downsample \(\mathbf {x}^\star _\text {HR}\) to \(n_1\times n_2\) size yielding the ground truth map \(\mathbf {x}^\star \). About 10% of maps did not contain inclusions. For each reconstruction problem the path matrix \(\mathbf {L}\) is normalized with its largest singular value, and inverse ToF are centered and scaled: \(\mathbf {b}' = \mathbf {b}- (\langle \mathbf {b}\rangle /\langle \mathbf {L}\mathbf {1}\rangle )\mathbf {L}\mathbf {1}\), \(\tilde{\mathbf {b}} = \mathbf {b}' / \text {std}(\mathbf {b}')\).

The configuration of networks were the following: \(K=10\), \(N_f=50\), \(N_c=5\), \(n_1 = n_2=64\), \(N_g=55\). All parameters were initialized from \(\mathcal {U}(0, 1)\). We refer to this architecture as VNv4. Ablating spatial filter weighting \(\mathbf {w}_i^{(k)}\) from VNv4, we get VNv3; additionally ablating adaptive data potentials \(\varphi _\text {d}^{(k)}\), we get VNv2; further ablating preconditioner \(\mathbf {p}^{(k)}\), we get VNv1; and eventually unrolling GD without momentum, VNv0. For tuning the aforementioned models we used \(10^5\) iterations of Adam algorithm [9] with learning rate \(10^{-3}\) and batch size 25. Every 5000 iterations we readjust potential function’s interval range r by setting it to the maximal observed value of the corresponding activation function argument.

3 Results

We compare TV and MA-TV against VN architectures on (i) 200 samples from \(\mathcal {T}\) that were set aside and unseen during training, and (ii) a set \(\mathcal {P}\) of 14 geometric primitives depicted in Fig. 1c, using following metrics:

$$\begin{aligned} \text { SAD}(\mathbf {x}, \mathbf {y})=\frac{\Vert \mathbf {x}- \mathbf {y}\Vert _1}{P}, \quad \text { CR} = \frac{2\left| \mu _\text {inc} - \mu _\text {bg} \right| }{\left| \mu _\text {inc}\right| + \left| \mu _\text {bg}\right| },\quad \text {CRf}=\frac{\text { estimated CR}}{\text { ground truth CR}}, \end{aligned}$$
(4)

where \(\mu _\text {inc}\) and \(\mu _\text {bg}\) are mean values in the inclusion and background regions accordingly. The optimal regularization weight \(\lambda \) for TV and MA-TV algorithms was tuned to give the best (lowest) SAD on the P3 image (see Fig. 2). Similarly to training generation, the forward model for validation and test sets was computed on high resolution images with normal noise and 30% undersmapling.

Quantitative evaluation on synthetic data is reported in Table 1 and shows that the proposed VNv4 network outperforms conventional TV and MA-TV reconstruction methods both in terms of accuracy and contrast. Comparing VNv options, it can be seen that richer architectures performed better. Figure 2 shows qualitative evaluation of reconstruction methods. VNv4 is able to reconstruct multiple inclusions (P5), handle smooth SoS variation (T1), and generally maintain inclusion position and geometry without hallucinating nonexistent inclusions. Although for some geometries (e.g. P4) TV reconstruction has lower SAD value, VNv4 provides better contrast, which allows to separate the two inclusions. As expected from the limited-angle nature of the data, highly elongated inclusions that are parallel to the reflector either undergo axial geometric distortion (P1), or could not be adequately reconstructed (T3) by any presented method.

Fig. 2.
figure 2

Sound speed reconstructions of synthetic data from sets \(\mathcal {T}\), \(\mathcal {P}\) and single natural image N1. Inclusions are delineated with red curves. For each image transducer array is placed on top and reflector on bottom. (Color figure online)

Table 1. SoS reconstruction measures computed on 200 validation images from training distribution \(\mathcal {T}\), 14 test images from the set of geometric primitives \(\mathcal {P}\).
Fig. 3.
figure 3

Hand-held SoS mammography of the breast phantom. Stiff (red) and soft (green) inclusions were delineated in the B-mode image. (Color figure online)

Fig. 4.
figure 4

Live SoS imaging demonstration. (a) Experimental setup. (b) Sample outputs of B-mode and SoS video feedback. A non-echogenic stiff lesion is clearly delineated in the SoS image. (c) Computational benchmarks, also showing initialization and memory allocation times. After initialization, SoS reconstruction time is negligible compared to data transfer and reflector ToF measurement via dynamic programming [14].

Breast Phantom Experiment. We also compared the reconstruction methods using a realistic breast elastography phantom (Model 059, CIRS Inc.) that mimics glandular tissue with two lesions of different density. Portable ultrasound system (UF-760AG, Fukuda Denshi Inc., Tokyo, Japan) streams full-matrix RF ultrasound data over a high bandwidth link to a dedicated PC, which is used to perform USCT reconstruction and output a live SoS video feedback (cf. Fig. 4). We used an ultrasound probe (FUT-LA385-12P) with 128 piezoelectric transducer elements. For each frame a total of 128 \(\times \) 128 RF lines are generated for all Tx/Rx combinations, at an imaging center frequency of 5 MHz digitized at 40.96 MHz. As seen in Fig. 3, VNv4 qualitatively outperforms both TV and MA-TV methods, showing clearly distinguishable hard and soft lesions. Run-time of MA-TV and TV algorithms on CPU is \(\sim \)30 s per image, while VN reconstruction takes \(\sim \)0.4 s on CPU and \(\sim \)0.01 s on GPU.

4 Discussion

In this paper we have proposed a deep variational reconstruction network for hand-held US sound-speed imaging. The method is able to reconstruct various inclusion geometries both in synthetic and phantom experiments. VN demonstrated good generalization ability, which suggests that unrolling even more sophisticated numerical schemes may be possible. Improvements over conventional reconstruction algorithms are both qualitative and quantitative. The ability of method to distinguish hard and soft inclusions has great diagnostic potential in characterizing lesions in real-time.