Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Reliable tracking of vascular structures or intravascular devices in dynamic X-ray images is essential for guidance during interventional procedures and postprocedural analysis [13, 8, 13, 14]. However, bad tissue contrast due to low radiation dose and lack of depth information always bring challenges on detecting and tracking those curvilinear structures (CS). Traditional registration and alignment-based trackers depend on local image intensity or gradient. Without high-level context information, they cannot efficiently discriminate low-contrasted target structure from complex background. On the other hand, the confounding irrelevant structures bring challenges to detection-based tracking. Recently, a new solution is proposed that exploits the progress in multi-target tracking [2]. After initially detecting candidate points on a CS, the idea is to model CS tracking as a multi-dimensional assignment (MDA) problem, then a tensor approximation is applied to search for a solution. The idea encodes high-order temporal information and hence gains robustness against local ambiguity. However, it suffers from the lack of mechanism to encode the structure prior in CS, and the features used in [2] via random forests lack discrimination power.

Fig. 1.
figure 1

Overview of the proposed method.

In this paper, we present a new method (refer to Fig. 1 for the flowchart) to detect and track CS in dynamic X-ray sequences. First, a convolutional neural network (CNN) is used to detect candidate landmarks on CS. CNN automatically learns the hierarchical representations of input images [6, 7] and has been recently used in medical image analysis (e.g. [9, 10]). With the detected CS candidates, CS tracking is converted to a multiple target tracking problem and then a multi-dimensional assignment (MDA) one. In MDA, candidates are associated along motion trajectories cross time, while the association is constructed according to the trajectory affinity. It has been shown in [11] that MDA can be efficiently solved via rank-1 tensor approximation (R1TA), in which the goal is to seek vectors to maximize the “joint projection” of an affinity tensor. Sharing the similar procedure, our solution adopts R1TA to estimate the CS motion. Specifically, a high-order tensor is first constructed from all trajectory candidates over a time span. Then, the model prior of CS is integrated into R1TA encoding the spatial interaction between adjacent candidates in the model. Finally, CS tracking results are inferred from model likelihood.

The main contribution of our work lies in two-fold. (1) We propose a structure-aware tensor approximation framework for CS tracking by considering the spatial interaction between CS components. The combination of such spatial interaction and higher order temporal information effectively reduces association ambiguity and hence improves the tracking robustness. (2) We design a discriminative CNN detector for CS candidate detection. Compared with traditional hand-crafted features, the learned CNN features show very high detection quality in identifying CS from low-visibility dynamic X-ray images. As a result, it greatly reduces the number of hypothesis trajectories and improves the tracking efficiency.

For evaluation, our method is tested on two sets of X-ray fluoroscopic sequences including vascular structures and catheters, respectively. Our approach achieves a mean tracking error of 1.1 pixels on the vascular dataset and 0.8 pixels on the catheter dataset. Both results are clearly better than other state-of-the-art solutions in comparison.

2 Candidate Detection with Hierarchical Features

Detecting CS in the low-visibility dynamic X-ray images is challenging. Without color and depth information, CS shares great similarity with other anatomical structures or imaging noise. Attacking these problems, a four-layer CNN (Fig. 2) is designed to automatically learn hierarchical features for CS candidate detection. We employ 32 filters of size \(5 \times 5\) in the first convolution stage, and 64 filters of the same size in the second stage. Max-pooling layers with a receptive window of \( 2 \times 2 \) pixels are employed to down-sample the feature maps. Finally, two fully-connected layers are used as the classifier. Dropout is employed to reduce overfitting. The CNN framework used in our experiments is based on MatConvNet [12].

Fig. 2.
figure 2

The CNN architecture for CS candidate detection.

For each image in the sequence except the first one which has groundtruth annotated manually, a CS probability map is computed by the learned classifier. A threshold is set to eliminate most of the false alarms in the image. Result images are further processed by filtering and thinning. Typically, binarized probability map is filtered by a distance mask in which locations too far from the model are excluded. Instead of using a groundtruth bounding box, we take the tracking results from previous image batches. Based on the previously tracked model, we calculate the speed and acceleration of the target to predict its position in next image batch. Finally, after removing isolated pixels, CS candidates are generated from the thinning results. Examples of detection results are shown in Fig. 3. For comparison, probability maps obtained by a random forests classifier with hand-crafted features [2] are also listed. Our probability maps contain less false alarm, which guarantees more accurate candidate locations after post-processing.

Fig. 3.
figure 3

Probability maps and detected candidates of a vessel (left) and catheter (right). For each example, from left to right are groundtruth, random forests result, and CNN result, respectively. Red indicates region with high possibility, while green dots show resulting candidates.

3 Tracking with Model Prior

To encode the structure prior in a CS model, we use an energy maximization scheme that combines temporal energy of individual candidate and spatial interaction energy of multiple candidates into a united optimization framework. Here, we consider the pairwise interactions of two candidates on neighboring frames. The assignment matrix between two consecutive sets \(O^{(k-1)}\) and \(O^{(k)}\) (i.e. detected candidate CS landmarks) can be written as \(\mathbf {X}^{(k)}=(x_{i_{k-1}i_k})^{(k)}\), where \(k=1,2,\dots ,K\), and \(\mathbf {o}_{i_k}^{(k)} \in O^{(k)}\) is the \(i_k\)-th landmark candidate of CS. For notation convenience, we use a single subscript \(j_k\) to represent the entry index (\(i_{k-1},i_k\)), such as \(x_{j_k}^{(k)} \doteq x_{i_{k-1}i_k}^{(k)}\), i.e., \(\mathrm {vec}(\mathbf {X}^{(k)})=(x_{j_k}^{(k)})\) for vectorized \(\mathbf {X}^{(k)}\). Then our objective function can be written as

$$\begin{aligned} f(\mathcal {X}) = \sum c_{j_1j_2 \dots j_K}x_{j_1}^{(1)}x_{j_2}^{(2)} \dots x_{j_K}^{(K)} + \sum _{k=1}^{K} \sum _{l_{k},j_{k}}w_{l_{k}j_{k}}^{(k)}e_{l_{k}j_{k}}^{(k)}x_{l_{k}}^{(k)}x_{j_{k}}^{(k)}, \end{aligned}$$
(1)

where \(c_{j_1j_2 \dots j_K}\) is the affinity measuring trajectory confidence; \(w_{l_{k}j_{k}}^{(k)}\) the likelihood that candidates \(x_{j_k}^{(k)}\) and \(x_{l_k}^{(k)}\) are neighboring on the model; and \(e_{l_{k}j_{k}}^{(k)}\) the spatial interaction of two candidates on two consecutive frames. The affinity has two parts as

$$\begin{aligned} c_{i_0i_1,\dots i_K} = app_{i_0i_1,\dots i_K} \times kin_{i_0i_1,\dots i_K}, \end{aligned}$$
(2)

where \(app_{i_0i_1,\dots i_K}\) describes the appearance consistency of the trajectory, and \(kin_{i_0i_1,\dots i_K}\) the kinetic affinity modeling the higher order temporal affinity as detailed in [2].

Model Prior. CS candidates share two kinds of spatial constrains. First, trajectories of two neighboring elements should have similar direction. Second, relative order of two neighboring elements should not change so that re-composition of CS is prohibited. Thus inspired, we formulate the spatial interaction of two candidates as

$$\begin{aligned} e_{l_kj_k} \doteq e_{m_{k-1}m_{k}i_{k-1}i_{k}} = E_{para} + E_{order}, \end{aligned}$$
(3)

where

$$ E_{para} = \frac{(\mathbf {o}_{i_{k-1}}^{(k-1)} - \mathbf {o}_{i_k}^{(k)})\cdot (\mathbf {o}_{m_{k-1}}^{(k-1)} - \mathbf {o}_{m_k}^{(k)}) }{\Vert (\mathbf {o}_{i_{k-1}}^{(k-1)} - \mathbf {o}_{i_k}^{(k)})\cdot (\mathbf {o}_{m_{k-1}}^{(k-1)} - \mathbf {o}_{m_k}^{(k)}) \Vert } ,~ E_{order} = \frac{(\mathbf {o}_{i_{k-1}}^{(k-1)} - \mathbf {o}_{m_{k-1}}^{(k-1)} )\cdot ( \mathbf {o}_{i_k}^{(k)}- \mathbf {o}_{m_k}^{(k)}) }{\Vert (\mathbf {o}_{i_{k-1}}^{(k-1)} - \mathbf {o}_{m_{k-1}}^{(k-1)} )\cdot ( \mathbf {o}_{i_k}^{(k)}- \mathbf {o}_{m_k}^{(k)}) \Vert }, $$

such that \(E_{para}\) models the angle between two neighbor trajectories, which also penalizes large distance change between them; and \(E_{order}\) models the relative order of two adjacent candidates by the inner product of vectors between two neighbor candidates.

Maximizing Eq. 1 closely correlates with the rank-1 tensor approximation (R1TA) [4], which aims to approximate a tensor by the tensor product of unit vectors up to a scale factor. By relaxing the integer constraint on the assignment variables, once a real valued solution of \(\mathbf {X}^{k}\) is achieved, it can be binarized using the Hungarian algorithm [5]. The key issue here is to accommodate the row/column \(\ell _1\) normalization in a general assignment problem, which is different from the commonly used \(\ell _2\) norm constraint in tensor factorization. We develop an approach similar to [11], which is a tensor power iteration solution with \(\ell _1\) row/column normalization.

Model Likelihood. Coefficient \(w_{l_{k}j_{k}}^{(k)} \doteq w_{m_{k-1}m_{k}i_{k-1}i_{k}}^{(k)}\) measures the likelihood that two candidates \(\mathbf {o}_{i_{k-1}}^{(k-1)}\) and \(\mathbf {o}_{m_{k-1}}^{(k-1)}\) are neighboring on model. In order to get the association of each candidate pair in each frame, or in other words, to measure the likelihood a candidate \(\mathbf {o}_{i_k}^{(k)}\) matching a model element part \(\mathbf {o}_{i_0}^{(0)}\), we maintain a “soft assignment”. In particular, we use \(\theta _{i_0i_k}^{(k)}\) to indicate the likelihood that \(\mathbf {o}_{i_k}^{(k)}\) corresponds to \(\mathbf {o}_{i_0}^{(0)}\). It can be estimated by

$$\begin{aligned} \varTheta ^{(k)} = \varTheta ^{(k-1)} \mathbf {X}^{(k)}, k=1,2,\dots ,K, \end{aligned}$$
(4)

where \(\varTheta ^{(k)}=(\theta _{i_0i_k}^{(k)})\in \mathbb {R}^{I_0 \times I_k}\) and \(\varTheta ^{(0)}\) is fixed as the identity matrix.

The model likelihood is updated in each step of the power iteration. After the update of the first term in Eq. 1, a pre-likelihood \(\varTheta '^{(k)}\) is estimated for computing \(w_{l_{k}j_{k}}^{(k)}\). Since \(\varTheta ^{(k)}\) associates candidates directly with the model, final tracking result of the matching between \(\mathbf {o}^{(0)}\) and \(\mathbf {o}^{(k)}\) can be derived from \(\varTheta ^{(k)}\).

With \(\varTheta '^{(k)}\), the approximated distance on model of \(\mathbf {o}_{i_{k-1}}^{(k-1)}\) and \(\mathbf {o}_{m_{k-1}}^{(k-1)}\) can be calculated as following

$$\begin{aligned} d_{i_km_k}^{(k)} = \frac{\sum _{i_0} \Vert (\mathbf {o}_{i_{0}}^{(0)} - \mathbf {o}_{i_{0}+1}^{(0)} )\Vert \theta _{i_0i_k}^{(k)}\theta _{i_0+1m_k}^{(k)}}{\sum _{i_0}\theta _{i_0i_k}^{(k)}\theta _{i_0+1m_k}^{(k)}}. \end{aligned}$$
(5)

Thereby, \(w_{l_{k}j_{k}}^{(k)}\) then can be simply calculated as

$$\begin{aligned} w_{l_{k}j_{k}}^{(k)} \doteq w_{m_{k-1}m_{k}i_{k-1}i_{k}}^{(k)} = \frac{2d_{i_{k-1}m_{k-1}}^{(k-1)}\bar{d}}{(d_{i_{k-1}m_{k-1}}^{(k-1)})^2 + (\bar{d})^2}, \end{aligned}$$
(6)

where \(\bar{d}\) is the average distance between two neighboring elements on model \(O^{(0)}\). The proposed tracking method is summarized in Algorithm 1.

figure a

4 Experiments

We evaluate the proposed CS tracking algorithm using two groups of X-ray clinical data collected from liver and cardiac interventions. The first group consists of six sequences of liver vessel images and the second 11 sequences of catheter images, each with around 20 frames. The data is acquired with \(512 \times 512\) pixels and physical resolution of 0.345 or 0.366 mm. Groundtruth of each image is manually annotated (Fig. 4(a)).

Vascular Structure Tracking. We first evaluate the proposed algorithm on the vascular sequences. First frame from each sequence is used to generate training samples for CNN. To be specific, 800 vascular structure patches and 1500 negative patches are generated from each image. From the six images, a total of \(2300 \times 6 = 13,800\) samples are extracted and split as 75 % training and 25 % validation. All patches have the same size of \(28 \times 28\) pixels. Distance threshold of predictive bounding box is set to 60 pixels for enough error tolerance. Finally, there are around 200 vascular structure candidates left in each frame. The number of points on the model is around 50 for each sequence.

In our work, \(K=3\) is used to allow each four consecutive frames to be associated. During tracking, tensor kernel costs around 10s and 100 MB (peak value) RAM to process one frame with 200 candidates in our setting running on a single Intel Xeon@2.3GHz core. The tracking error is defined as the shortest distance between tracked pixels and groundtruth annotation. For each performance metric, we compute its mean and standard deviation. For comparison, the registration-based (RG) approach [14], bipartite graph matching [2] (BM) and pure tensor based method [2] (TB) are applied to the same sequences. For BM and TB, same tracking algorithms but with the CNN detector are also tested and reported. The first block of Fig. 4 illustrates the tracking results of vascular structures. B-spline is used to connect all tracked candidates to represent the tracked vascular structure. The zoom-in view of a selected region (rectangle in blue) in each tracking result is presented below, where portions with large errors are colored red. Quantitative evaluation for each sequence is listed in Table 1.

Table 1. Curvilinear structure tracking errors (in pixels)
Fig. 4.
figure 4

Curvilinear structure tracking results. (a) groundtruth, (b) registration, (c) bipartite matching, (d) tensor based, and (e) proposed method. Red indicates regions with large errors, while green indicates small errors.

Catheter Tracking. Similar procedures and parameters are applied to the 11 sequences of catheter images. The second block of Fig. 4 shows example of catheter tracking results. The numerical comparisons are listed in Table 1.

The results show that our method clearly outperforms other three approaches. Candidates in our approach are detected by a highly accurate CNN detector, ensuring most extracted candidates to be on CS, while registration-based method depends on the first frame as reference to identify targets. Our approach is also better than the results of bipartite graph matching where \(K=1\). The reason is that our proposed method incorporates higher-order temporal information from multiple frames; by contrast, bipartite matching is only computed from two frames. Compared with the pure tensor based algorithm, the proposed method incorporates the model prior which provides more powerful clues for tracking the whole CS. Confirmed by the zoom-in views, with model prior, our proposed method is less affected by neighboring confounding structures.

5 Conclusion

We presented a new method to combine hierarchical features learned in CNN and encode model prior to estimate the motion of CS in X-ray image sequences. Experiments on two groups of CS demonstrate the effectiveness of our proposed approach. Achieving a tracking error of around one pixel (or smaller than 0.5 mm), it clearly outperforms the other state-of-the-art algorithms. For future work, we plan to adopt pyramid detection strategy in order to accelerate the pixel-wised probability map calculation in our current approach.