1 Introduction

Anatomical landmark localisation is a key challenge for many medical image analysis tasks. Accurate landmark identification can be used for (a) extracting biometric measurements of anatomical structures, (b) landmark-based registration of 3D volumes, (c) extracting 2D clinical standard planes from 3D volumes and (d) initialisation of tasks such as image segmentation. However, manual landmark detection is time-consuming and suffers from high observer variability. Thus, there is a need to develop automatic methods for fast and accurate landmark localisation. Recently, deep learning approaches have been proposed for this purpose [1, 3, 5,6,7,8] but there remain major challenges: (a) typically only a limited amount of annotated medical images is available, (b) model training and inference for 3D medical images is computationally intensive, making real-time applications challenging and (c) when multiple landmarks are detected jointly, their spatial relationships should be taken into account.

Fig. 1.
figure 1

Overall framework of PIN for single landmark localisation.

Related Work: Deep learning methods for landmark localisation can be divided into two categories: The first category adopts an end-to-end learning strategy where the entire image is taken as input to a convolutional neural network (CNN) while the output is a map from which the landmark coordinates can be inferred directly. Payer et al. [5] and Laina et al. [3] output a heatmap in which Gaussians are located at the landmark positions. Xu et al. [6] train a supervised action classifier (SAC) that outputs an action map whose classification labels denote the direction towards the true landmark location. However, end-to-end learning methods are typically applied to 2D images since 3D volumetric networks require large receptive fields for landmark tasks. Such 3D networks are computationally intensive, which inhibits real-time performance, and require a large amount of memory during training, which is beyond current hardware’s capabilities.

The second category uses image patches as training samples to learn a classification or regression model. Zheng et al. [8] extract a patch around each voxel in the image and use a neural network to classify if a landmark is present at the patch centre. Zhang et al. [7] and Aubert et al. [1] use a CNN-based regression model that learns the association between an image patch and its 3D displacement to the true landmark. Ghesu et al. [2] propose a deep reinforcement learning (DRL) approach that also operates on patches. Most patch-based methods require dense sampling of many image patches during prediction which is computationally intensive. Furthermore, most methods require the training of separate models to detect each landmark. This is time-consuming and neglects the spatial relationships among multiple landmarks.

Contribution: In this paper, we propose a novel landmark localisation approach that uses a patch-based CNN to predict multiple landmarks efficiently in an iterative manner. We term this approach Patch-based Iterative Network (PIN). PIN has distinct advantages that address the key challenges of landmark localisation in 3D medical images: (1) During inference, PIN guides the patch towards the true landmark location using iterative sparse sampling. This approach reduces the computational cost by avoiding dense sampling at every voxel of the volume. (2) PIN uses a 2.5D representation to approximate the 3D patch as network input. This accelerates computation as only 2D convolutions are required. (3) PIN treats landmark localisation as a combined regression and classification problem for which a joint network is learned via multi-task learning. This prevents model overfitting, improves generalisation ability of the learned features and increases localisation accuracy. (4) PIN detects multiple landmarks jointly using a single model and takes the global anatomical spatial relationships among landmarks into account. We evaluate the landmark localisation accuracy of PIN using 3D ultrasound images of the fetal brain. In addition, clinically useful scan planes can be extracted from the predicted landmarks which visually resemble the anatomical standard planes as defined by fetal screening standards, e.g., [4].

2 Method

Overall Framework: Figure 1 illustrates the overall PIN framework for single landmark localisation. We show the 2D case for clarity but the method works similarly in 3D. Given an image, the goal is to predict the true landmark coordinates (red dot in Fig. 1). A position \(\varvec{x}_0\) is first initialised at instant \(t=0\) and a patch centred around \(\varvec{x}_0\) is extracted (solid green box in Fig. 1). The CNN takes the patch as input and predicts regression and classification outputs that are used to compute a new position \(\varvec{x}_{t+1}\) from the previous position \(\varvec{x}_t\), bringing the patch closer to the true landmark location. The patch at \(\varvec{x}_{t+1}\) (dashed green box in Fig. 1) is then given as input to the CNN and the process is repeated until the patch reaches the true landmark position.

Network Input: For 3D data, the CNN input can be a 3D volume patch. However, 3D convolution operations on volume patches are computationally expensive. To this end, we use a 2.5D representation to approximate the full 3D patch. Specifically, given a particular position \(\varvec{x}={(x,y,z)^{T}}\) in a volume V, we extract three 2D image patches centred around \(\varvec{x}\) at the three orthogonal planes (Fig. 2a). The patch extraction function is denoted as \(I(V,\varvec{x},s)\) where s is the length of the square patch. The three 2D patches are then concatenated together as a 3-channel 2D patch which is passed as input to the CNN. Such a representation is computationally efficient since it requires only 2D convolutions and still provides a good approximation of the full 3D volume patch.

Joint Regression and Classification: PIN jointly predicts the magnitude and direction of movement of a current point towards the true landmark by combining a regression and a classification task together in a multi-task learning framework. This joint framework shares model parameters in the convolutional layers and is experimentally shown to learn more generalisable features, which improves overall performance.

The regression task estimates how much the point at the current position should move to get to the true landmark location. The regression output \(\varvec{d}={(d_1,d_2,\ldots ,d_{n_o})}^{T}\) is a displacement vector that predicts the relative distance between the current and true landmark positions. In single landmark localisation, \(\varvec{d}\) has \(n_o=3\) elements which give the displacement along each coordinate axis.

Fig. 2.
figure 2

(a) Patch extraction of a single landmark. (b) Patch extraction of multiple landmarks. (c) CNN architecture combining regression and classification. Output size of each layer is represented as \(\text {width}\,\times \,\text {height}\,\times \,(\# \text { feature maps})\). (d) Landmarks defined on the TV and TC plane for fetal sonographic examination.

The classification task estimates the direction of current point movement towards the true landmark by dividing direction into 6 discrete classification categories: positive and negative direction along each coordinate axes [6]. Denoting c as the classification label, we have \(c\in \{ { c }_{ 1 }^{ + },{ c }_{ 1 }^{ - },{ c }_{ 2 }^{ + },{ c }_{ 2 }^{ - },{ c }_{ 3 }^{ + },{ c }_{ 3 }^{ - }\} \). For instance, \({ c }_{ 1 }^{+}\) is the category representing movement along the direction of positive x-axis. The classification output \(\varvec{P}\) is then a vector with \(2n_o=6\) elements, each representing the probability/confidence of movement in that direction. Mathematically, \(\varvec{P}=(P_{ { c }_{ 1 }^{ + } },P_{ { c }_{ 1 }^{ - } },\ldots ,P_{ { c }_{ { n }_{ o } }^{ + } },P_{ { c }_{ { n }_{ o } }^{ - } })^{T}\) where \(P_{ { c }_{ 1 }^{ + } }=\text {Prob}(c={ c }_{ 1 }^{ + })\).

Given a volume V and its ground truth landmark point \(\varvec{x}^{GT}\), a training sample is represented by \((I(V,\varvec{x},s), \varvec{d}^{GT}, \varvec{P}^{GT})\) where \(\varvec{x}\) is a point randomly sampled from V and \(I(V,\varvec{x},s)\) is its associated patch. The ground truth displacement vector is given by \(\varvec{d}^{GT}=\varvec{x}^{GT}-\varvec{x}\). To obtain \(\varvec{P}^{GT}\), we first determine the ground truth classification label \(c^{GT}\) by selecting the component of \(\varvec{d}^{GT}\) with the maximum absolute value and taking into account its sign,

$$\begin{aligned} c^{GT}={\left\{ \begin{array}{ll} {c}_{i}^{+},\qquad \text {if}\quad {d}_{i}^{GT}>0 \\ {c}_{i}^{-},\qquad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(1)

where \(i=\text {argmax}(\text {abs}(\varvec{d}^{GT}))\). For a vector \(\varvec{a}\), \(\text {argmax}(\varvec{a})\) returns the index of the vector component with maximum value. During training, a hard classification label is used. As such, the probability vector \(\varvec{P}^{GT}\) is obtained as a one-hot vector where component \(P_{c^{GT}}\) is set to 1 and all others set to 0. The CNN is trained by minimising the following combined loss function:

$$\begin{aligned} L = (1-\alpha )\frac{ 1 }{ { n }_{ 0 }{ n }_{ batch } } \sum _{ n=1 }^{ n_{ batch } }{ { \left\| { \varvec{d} }_{ n }^{ GT }-{ \varvec{d} }_{ n } \right\| }_{ 2 }^{ 2 } } -\alpha \frac{ 1 }{ { n }_{ batch } } \sum _{ n=1 }^{ n_{ batch } }{ \log { ({ P }_{ { c }^{ GT },n } )} } \end{aligned}$$
(2)

The first term is the Euclidean loss of the regression task and the second term is the cross-entropy loss of the classification task. \(\alpha \) is the weighting between the two losses. \(n_{batch}\) is the number of training samples in a mini-batch. \({ \varvec{d} }_{ n }\) and \({ P }_{ { c }^{ GT },n }\) denote respectively the regression and classification outputs predicted by the CNN on the nth sample.

Fig. 3.
figure 3

Overall framework of PIN for multiple landmarks localisation.

CNN Architecture: Figure 2c shows the PIN CNN architecture combining the classification and regression tasks. The network comprises five convolution (C) layers, each followed by a max-pooling (P) layer. These layers are shared by both tasks. After the \(5^{th}\) pooling layer, each task has three separate fully-connected (FC) layers to learn the task-specific features. All convolution layers use 3x3 kernels with stride = 1 and all pooling layers use 2x2 kernels with stride = 2. ReLU activation is applied after all convolution and FC layers except for the last FC layer of each task. Drop-out is added after each FC layer.

PIN Inference: Given an unseen 3D volume, we initialise 19 points in the volume (one at the volume centre and 18 others at fixed distance of one-quarter image size around it). The patch extracted from each point is forward passed into the CNN and the point is moved to its new position based on the CNN outputs (\(\varvec{d}\) and \(\varvec{P}\)) and a chosen update rule. This process is repeated for T iterations until there is no significant change in the displacement of the point. The final positions of the 19 points at iteration T is averaged and taken to be the final landmark prediction. Multiple initialisations average out errors and improve the overall localisation accuracy.

PIN Update Rules: We proposed three update rules (A–C). Let \(\varvec{x}_{t}\) be the position of a point at iteration t and \(\varvec{x}_{t+1}\) be the new updated position. Rule A is based only on the classification output \(\varvec{P}\). It updates the current landmark position by moving it one pixel in the direction category which has the highest probability as predicted by \(\varvec{P}\). Rule B is based only on the regression output \(\varvec{d}\) and is given by: \(\varvec{x}_{t+1}=\varvec{x}_{t}+\varvec{d}\). Rule C uses both the classification and regression outputs for the update and is given by: \(\varvec{x}_{t+1}=\varvec{x}_{t}+{\varvec{P}_{max}}\odot \varvec{d}\) where \(\odot \) is the element-wise multiplication operator and \(\varvec{P}_{max}=(\max {(P_{{ c }_{ 1 }^{ + }}, P_{{ c }_{ 1 }^{ - }})}, \max {(P_{{ c }_{ 2 }^{ + }}, P_{{ c }_{ 2 }^{ - }})}, \ldots , \max {(P_{{ c }_{ n_o }^{ + }}, P_{{ c }_{ n_o }^{ - }})})^{T}\). Intuitively, Rule C moves the point to its new position by an amount specified by the regression output weighted by a confidence probability specified by the classification output. This ensures smaller movement in the less confident direction and vice versa.

Multiple Landmarks Localisation: The above approach for single landmark localisation has two drawbacks: (1) Separate CNN models are required for each landmark which increase the parametrisation significantly and thus computational cost for training and inference. (2) Individual landmark prediction ignores the anatomical relationships between the different landmarks. To overcome these problems, we extend our approach to localise multiple landmarks simultaneously using only one CNN model which also accounts for inter-landmark relationships by working in a reduced dimensional space.

Let \(\varvec{X}={(x_1,y_1,z_1,\ldots ,x_{n_l},y_{n_l},z_{n_l})}^{T}\) be the 3D coordinates of all \(n_l\) landmarks of one volume. Given a training set of \(\varvec{X}\), we use PCA to transform \(\varvec{X}\) into a lower dimensional space. The transformations between the original and reduced dimensional spaces are given by,

$$\begin{aligned} \varvec{X}=\bar{ \varvec{X} } + \varvec{W}\varvec{b} \end{aligned}$$
(3)
$$\begin{aligned} \varvec{b}={\varvec{W}}^{T}(\varvec{X}-\bar{ \varvec{X} }), \end{aligned}$$
(4)

where \(\bar{\varvec{X} }\) is the mean of the training set, \(\varvec{b}\) is a \(n_b\)-element vector where \(n_{b}<3n_l\) and the columns of matrix \(\varvec{W}\) are the \(n_{b}\) eigenvectors. In our case, \(n_l=10\) and we set \(n_b=15\) to explain 99.5% of the total variations in the training set.

We can directly apply our PIN approach to the reduced dimensional space by replacing all occurrences of \(\varvec{x}\) by \(\varvec{b}\). Figure 3 illustrates the PIN approach for multiple landmarks. Specifically, 3 orthogonal patches are extracted for every landmark and concatenated together so that a \(s\times s\times 3n_l\) block is passed as CNN input (Fig. 2b). \(\varvec{d}\) becomes the displacement vector in the reduced dimensional space with \(n_o=n_b\) elements. The number of classification categories becomes \(2n_b\) which include positive and negative directions along each dimension of the reduced space. Hence, \(\varvec{P}\) is a \(2n_b\)-element vector. Training can be carried out similar to Eq. 2 with the only difference being \(\varvec{d}^{GT}=\varvec{b}^{GT}-\varvec{b}\) where \(\varvec{b}^{GT}\) is transformed from \(\varvec{x}^{GT}\) using Eq. 4 and \(\varvec{b}\) is randomly sampled. During inference, we update \(\varvec{b}\) iteratively using \(\varvec{b}_{t+1}=\varvec{b}_{t}+{\varvec{P}_{max}}\odot \varvec{d}\) (Rule C) and use Eq. 3 to convert \(\varvec{b}_{t+1}\) back to \(\varvec{X}_{t+1}\) for patch extraction in the next iteration. We use multiple initialisations of \(\varvec{b}_0\) (one initialisation with \(\varvec{b}_0=\varvec{0}\) and five random initialisations) and take their mean results as the final landmarks prediction.

3 Experiments and Results

Data: PIN is evaluated on 3D ultrasound volumes of the fetal head from 72 subjects. Each volume is annotated by a clinical expert with 10 anatomical landmarks that lie on two standard planes (transventricular (TV) and transcerebellar (TC)) commonly used for fetal sonographic examination as defined in the UK FASP handbook [4] (Fig. 2d). 70% of the dataset is randomly selected for training and the remaining 30% is used for testing. All volumes are processed to be isotropic and resized to \(324\times 207\times 279\) voxels with voxel size \(0.5\times 0.5\times 0.5\) mm3.

Experiment Setup: PIN is implemented using Tensorflow running on a machine with Intel Xeon CPU E5-1630 at 3.70 GHz and one NVIDIA Titan Xp 12 GB GPU. Patch size s is set to 101. During training, we set \(n_{batch}=64\). Weights are initialised randomly from a distribution with zero mean and 0.1 standard deviation. Optimisation is carried out for 100,000 iterations using the Adam algorithm with learning rate = 0.001, \(\beta _1=0.9\) and \(\beta _2=0.999\). We choose \(\alpha =0.5\) empirically unless otherwise stated. During inference, \(T=350\) for Rule A and \(T=10\) for Rule B and C.

Results: Table 1 compares the landmark localisation errors of a single landmark, cavum septum pellucidum (CSP), using several PIN variants which differ in the CNN model training and the inference update rule. Given the same inference rules, the model trained using both classification and regression losses (\(\alpha =0.5\)) achieves lower error than the models trained using either loss alone (\(\alpha =1\) or 0) (PIN1 vs PIN3, PIN2 vs PIN4). This illustrates the benefits of multi-task learning. Using the model trained with joint losses, we then compare the effect of different inference rules. PIN3 uses only the classification output which can result in landmarks getting stuck and oscillating between two opposing classification categories during iterative testing (e.g., \(c_{1}^{+}\) and \(c_{1}^{-}\)). PIN3 also takes longer during inference since the patch moves by one pixel at each test iteration and requires more iterations to converge. PIN4 uses only the regression output, which improves the localisation accuracy and runtime as the patch ‘jumps’ towards the true landmark position at each iteration. This requires much fewer iterations to converge. PIN5 achieves the best localisation accuracy by combining the classification and regression outputs where the regression output gives the magnitude of movement weighted by the classification output giving the probability of movement in each direction. Our proposed PIN approach also outperforms a recent state-of-the-art landmark localisation approach using DRL [2].

Table 1. Localisation error (mm) and runtime (s) of different approaches for single landmark (CSP) localisation. C and R denote classification and regression training loss respectively. Results presented as (Mean ± Standard Deviation).

Table 2 shows the localisation errors for all ten landmarks. PIN-Single trains a separate model for each landmark while PIN-Multiple trains one joint model that predicts all the landmarks simultaneously. Since PIN-Multiple accounts for anatomical relationships among the landmarks, it has a lower overall localisation error than PIN-Single. PIN-Single needs a total of 0.94 s to predict all ten landmarks in sequence while PIN-Multiple needs 0.44 s to predict all ten landmarks simultaneously. Figure 4 shows the TV and TC planes containing the ground truth landmarks as red dots. The landmarks predicted by PIN-Multiple are projected onto these standard planes as green dots. The supplementary materials provide visual comparison of standard planes obtained from ground truth and predicted landmarks as well as videos showing several initialisations converging towards the true landmark positions (and standard planes) after ten inference updates.

Table 2. Localisation error (mm) of PIN for single and multiple landmark localisation. Results presented as (Mean ± Standard Deviation).
Fig. 4.
figure 4

Visualisation of landmarks predicted by PIN-Multiple (green dots) vs. ground truth landmarks (red dots).

4 Conclusion

We have presented PIN, a new approach for anatomical landmark localisation. Its patch-based and iterative nature enables training on limited data and fast prediction on large 3D volumes. A joint regression and classification model is trained by multi-task learning to improve localisation accuracy. PIN is capable of multiple landmark localisation and uses PCA to impose anatomical constraints among landmarks. PIN is generic to landmark localisation and as future work, we are extending PIN to other medical applications. It is also worthwhile to replace PCA with an autoencoder to model non-linear correlations among landmarks.