1 Introduction

Active Appearance Models (AAMs) [2] and Constrained Local Models (CLMs)Footnote 1 [8] are widely used in medical image analysis for robust model-based segmenation (see [1, 4, 6] for examples). Both approaches rely on the classical point- and PCA-based statistical shape modeling framework [3] and represent the shape space observed in a training population by a mean shape and a variability model. In addition to this global shape prior, AAMs also learn a gobal statistical model of the typical (gray value) appearance/texture of those objects. During the segmentation process, an AAM, therefore, not only provides global a-priori information about plausible shape instances but also about global appearance variations. These properties are favourable for robust segmentation, but lead to complicated exact or heuristic fitting algorithms [2, 7] and to the limited generalization abilities of AAMs in case of small training populations [12]. CLMs, therefore, nowadays typically replace the global appearance model with (independent) local detectors for each landmark (i.e. random forests in [6]) based on small image patches. During model fitting, the shape model is used to globally regularize the local detector reponses to guarantee plausability of the resulting shape. This leads to efficient optimization strategies but ambiguities cannot always be resolved, which negatively impacts the segmentation performance [8].

Methodological advances and the availability of large annotated training data sets have led to a recent successful revival of AAM-based face tracking methods in computer vision (CV; see [12] for an overview). The main goal of this work is to show that this recent success in CV can be transferred to medical problems when those approaches are adequately adapted and the lack of training data is accounted for. Therefore, the contributions of this paper are threefold: (1) We adapt the recent patch-based facial shape and appearance modeling approach from [11] and the Fast-SIC fitting algorithm from [12] to multi-object segmentation in medical image data. This approach elegantly combines strengths of AAMs (global shape and appearance models) and CLMs (efficient optimization, use of small patches). (2) We incorporate our recent approach for learning of representative statistical shape models from few training data [13] into this framework and extend it to patch-based appearance modeling. (3) We show that this novel combination of methods leads to competitive results on publicly available chest radiographs and cardiac MRI data and outperforms traditional AAM methods.

2 Methods

Although being independent of the image/data dimensionality (2D or 3D), we will describe our methods in a 2D scenario for ease of understanding. We start with a description of our patch-based AAM framework in Sect. 2.1, which is followed by an explanation of the method we adapted and extended for learning representative AAMs from small training populations in Sect. 2.2. See Fig. 1 for a graphical overview of the proposed framework.

Fig. 1.
figure 1

Graphical overview of the proposed framework for patch-based active appearance modeling with few training samples. See text for details.

2.1 Patch-based AAM Framework: Definition and Optimization

In the following, we assume a set \(\{\mathbf {s}_i\}_{i=1}^{N_S}\) of \(N_S\) 2D training shapes/contours \(\mathbf {s}_i\) and a set of \(\{I_i\}_{i=1}^{N_S}\) corresponding images \(I_i:\mathbb {R}^{2} \rightarrow \mathbb {R}\) to be given. Each contour is defined by \(N_M\) points \(\mathbf {s}_i=[x_{i,1},y_{i,1},\ldots ,x_{i,N_M},y_{i,N_M}]^T\in \mathbb {R}^{2N_M}\) distributed across its surface. We also assume that the landmarks are in correspondence across the training samples and that shape differences due to similarity transformations have been removed from the data (contours and images).

According to [12] an AAM consists of a statistical shape model (SSM), a statistical appearance model (SAM), and a suitable motion/transformation model used to map shape-free textures onto a new shape instance. For our patch-based AAM framework, which largely follows [11], we start by defining the SSM as a standard point distribution model [3]:

$$\begin{aligned} \mathbf {s}(\mathbf {b})=\overline{\mathbf {s}}+\mathbf {Pb}\ ,\ \text {with}\ \overline{\mathbf {s}}=\frac{1}{N_S}\sum _{i=1}^{N_S}\mathbf {s}_i\ . \end{aligned}$$
(1)

Here, \(\mathbf {P}\in \mathbb {R}^{2N_M\times N_b}\) denotes an orthonormal matrix whose \(N_b\) columns are eigenvectors of the covariance matrix \(\mathbf {C}_s=1/N_S\sum ^{i=1}_{N_S}(\mathbf {s}_i-\overline{\mathbf {s}})(\mathbf {s}_i-\overline{\mathbf {s}})^T\), and which compactly represents the subspace of plausible shapes. New shapes can be generated from Eq. (1) by varying parameter vector \(\mathbf {b}\in \mathbb {R}^{N_b}\). As proposed in [7], \(\mathbf {P}\) also includes 4 orthonormal vectors to describe similarity transforms of the shapes generated (included in \(N_b\)).

To build a SAM, we first define that we are only interested in modeling the object appearance at small quadratic patches centered around each landmark. Furthermore, each patch defines a \(N_e \times N_e\) regular grid with \(N_p=N_e^2\) sampling locations. For model training, appearance information for each image \(I_i\) is obtained by simply sampling the image information at patches placed at the corresponding landmark locations. For statistical analysis, the appearance information of each image is concatenated to form an appearance vector \(\mathbf {a}_i\in \mathbb {R}^{N_a}\), with \(N_a=N_MN_pN_f\) where \(N_f\) denotes the number of features extracted (gray values, descriptors, ...) at each sampling location (e.g., for raw gray values \(N_f=1\)). After applying an eigenvalue decomposition to the covariance matrix \(\mathbf {C}_a=1/N_S\sum ^{i=1}_{N_S}(\mathbf {a}_i-\overline{\mathbf {a}})(\mathbf {a}_i-\overline{\mathbf {a}})^T\), we end up with a SAM similar to Eq. (1):

$$\begin{aligned} \mathbf {a}(\mathbf {c})=\overline{\mathbf {a}}+\mathbf {Qc}\ ,\ \text {with}\ \overline{\mathbf {a}}=\frac{1}{N_S}\sum _{i=1}^{N_S}\mathbf {a}_i\ ,\ \mathbf {Q}\in \mathbb {R}^{N_a\times N_c}\ ,\ \text {and}\ \mathbf {c}\in \mathbb {R}^{N_c}\ . \end{aligned}$$
(2)

The SSM in Eq. (1) and the SAM in Eq. (2) define our patch-based AAM (Patch-AAM). Please note that for computational efficiency, we refrain from explicitly coupling both models as i.e. done in [2] and that our Patch-AAM implicitly defines a simple translational motion model instead of the complicated, traditional piecewise-affine warp [7]. Instances of Eqs. (1) and (2) are simply combined by translating the generated patches \(\mathbf {a}(\mathbf {c})\) to the landmark locations given by \(\mathbf {s}(\mathbf {b})\). Global scale changes and rotations can be handled by applying the associated similarity transform to the patches/image. Moreover, multiple objects can be easily handled by merging the landmarks/appearance information of all objects into one vector.

Fitting the Patch-AAM to an unseen image \(I:\mathbb {R}^{2} \rightarrow \mathbb {R}\) is now (with a slight abuse of notation and a patch-sampling/vectorization function \(\phi (\cdot ,\cdot )\)) defined as a joint non-linear least-squares problem:

$$\begin{aligned} \underset{\mathbf {b},\mathbf {c}}{\arg \min }\Vert \phi (I,\mathbf {s}(\mathbf {b}))-\mathbf {a}(\mathbf {c})\Vert ^2\ . \end{aligned}$$
(3)

Parameters \(\mathbf {b}\) and \(\mathbf {c}\) that optimally explain the image content in a least-squares sense are sought. Optimizing Eq. (3) is hard due to the non-linearity in \(\mathbf {b}\). We follow the computationally efficient Gauss-Newton-like Fast-SIC optimization strategy presented in [12] to iteratively minimize Eq. (3). Fast-SIC was chosen due to its demonstrated ability to produce state-of-the-art results in CV [12].

After linearizing Eq. (3) with respect to the model and omission of second-order terms, we arrive at

$$\begin{aligned} \underset{\varDelta \mathbf {b},\varDelta \mathbf {c}}{\arg \min }\Vert \phi (I,\mathbf {s}(\mathbf {b}))-\mathbf {a}(\mathbf {c})-\mathbf {Q}\varDelta \mathbf {c}-\mathbf {J}\varDelta \mathbf {b}\Vert ^2 \end{aligned}$$
(4)

to compute updates \(\varDelta \mathbf {b}\) and \(\varDelta \mathbf {c}\), where \(\mathbf {J}\in \mathbb {R}^{N_a\times N_b}\) is the Jacobian of \(\mathbf {Q}\) with respect to \(\mathbf {s}(\mathbf {0})\) (see [11] for details). With \(\mathbf {J}_{\text {F}}=\mathbf {J}-\mathbf {QQ}^T\mathbf {J}\), closed-form solutions \(\varDelta \mathbf {b}=(\mathbf {J}_{\text {F}}^T\mathbf {J}_{\text {F}})^{-1}\mathbf {J}_{\text {F}}(\phi (I,\mathbf {s}(\mathbf {b}))-\overline{\mathbf {a}})\) and \(\varDelta \mathbf {c}=\mathbf {Q}^T(\phi (I,\mathbf {s}(\mathbf {b}))-\mathbf {a}(\mathbf {c})-\mathbf {J}\varDelta \mathbf {b})\) for both updates can be obtained in an alternating fashion. Because of the simple translational motion model, the shape and appearance parameters can finally be updated by \(\mathbf {b}\leftarrow \mathbf {b}-\varDelta \mathbf {b}\) and \(\mathbf {c}\leftarrow \mathbf {c}+\varDelta \mathbf {c}\).

2.2 Building Representative Patch-AAMs from Few Training Samples

The generalization capabilities and the segmentation performance of the Patch-AAMs presented in Sect. 2.1 will be mainly influenced by the quality and quantity of the training samples used to build the models. In medical image analysis, building AAMs often results in high-dimension-low-sample-size (HDLSS) problems, because the number of training samples is typically much smaller than the dimensionality of the data to be modeled (e.g., \(N_S \ll 2N_M\) and \(N_S\ll N_a\)). In practice, this limits the dimension of the subspaces defined by \(\mathbf {P}\) and \(\mathbf {Q}\) to \(N_S-1\), the rank of \(\mathbf {C}_s/\mathbf {C}_a\). It is unlikely that for very small sample sizes (e.g., \(N_S=10\)), the space of plausible shapes/appearances is appropriately approximated by such low-dimensional subspaces. We, therefore, aim to extend the dimension of those subspaces to \(N_b>N_S-1\) and \(N_c>N_S-1\) in a plausible way to improve the generalization ability of the Patch-AAM in HDLSS scenarios.

Here, we utilize and extend the recent approach from [13] for building representative SSMs from few data, which is based on manipulations of the covariance matrices and has several major advantages: It generates a single consistent shape model with global and local variability, seamlessly integrates with existing SSM-based frameworks, naturally handles multi-object scenarios, and was shown to outperform competing methods in [13].

In [13] it is assumed that in HDLSS scenarios some covariances (esp. those between distant points) in \(\mathbf {C}_s\) are overestimated. Therefore, a principle of locality (= interaction between distant landmarks is limited) is applied by defining a distance measure \(d(s_i,s_j)\) on the set of landmarks and a cascade of thresholds \(\tau _1>\tau _l>\ldots >\tau _{N_l}\). For each \(\tau \), a manipulated covariance matrix \(\mathbf {C}_{s_\tau }\) is computed by enforcing the correlation between landmarks \(d(s_i,s_j)>\tau \) to be 0. The eigenvectors of \(\mathbf {C}_{s_\tau }\) define a subspace \(span(\mathbf {P}_\tau )\subset \mathbb {R}^{2N_M}\) of dimension \(N_{b_\tau }\ge N_S-1\) because \(rank(\mathbf {C}_{s_\tau })\ge rank(\mathbf {C}_{s})\), where the exact value of \(N_{b_\tau }\) depends on \(\tau \). Now, starting with \(\mathbf {P}^1_*=\mathbf {P}_{\tau _1}\), the \(N_l\) subspaces are combined into a single multi-level shape model by successively searching for orthonormal bases \(\mathbf {P}^{l}_*\) of increasing locality in way that preserves global information [13]:

$$\begin{aligned} \mathbf {P}^{l}_*=\underset{\mathbf {P}}{\arg \min }\ d_{\mathcal {G}}(\mathbf {P},\mathbf {P}_{\tau _l})\quad \text {s.t. }span(\mathbf {P}^{l-1}_*)\subseteq span(\mathbf {P})\ . \end{aligned}$$
(5)

Here, \(d_{\mathcal {G}}(\cdot ,\cdot )\) is a geodesic distance between subspaces and \(\mathbf {P}^{l}_*\) can be efficiently computed with an algorithm given in [13]. Finally, the orthonormal basis \(\mathbf {P}_*=\mathbf {P}^{N_l}_*\), which represents global and local variability, can be plugged into Eq. (1).

In [13], this locality-based approach is only defined/used to enhance SSMs. We, however, strongly believe that in an AAM framework, the fitting algorithm can only make full use of the additional flexibility of \(\mathbf {P}_{*}\) when the SAM is enhanced in a similar way. We, therefore, propose to apply the same method to the covariance matrix \(\mathbf {C}_a\) of the SAM. Due to the patch-based definition of our AAM, we can simply use a distance d on the shape landmarks to manipulate \(\mathbf {C}_a\), if we define that all points of a patch have the same distance to all points of another patch. Hence, sampling points of one patch are not separated by the manipulations. As for the SSM, the resulting orthonormal matrix \(\mathbf {Q}_{*}\) can be used to replace \(\mathbf {Q}\) in Sect. 2.1. This extension leads to a flexible Patch-AAM framework to build representative shape and appearance models from few training data.

Table 1. Mean symmetric surface distances to the ground-truth segmentations obtained for the different patch-based AAM approaches on both data sets and for different numbers of training samples. See text for explanations. Results are given as mean±std. dev. in mm over all available test cases and repetitions. Italic font indicates a statistically significant difference to P-AAM. Bold font indicates a statistically significant difference to L-SSM. Significance is assessed by paired t-tests with \(p<0.05\).

3 Experiments and Results

The objectives of our evaluation are as follows: (1) Analysis of the segmentation performance of the proposed Patch-AAM approach on medical data. We focus on multi-object problems and HDLSS scenarios. (2) Analysis of the hypothesis that the fitting algorithm can only make full use of the flexibility of the locality-based SSM when the approach is also employed to build the SAM.

Data: Two publicly available 2D data bases are used for the experiments. (1) The JSRT/SCR data base [4, 10] that consists of 247 chest radiographs (2047 \(\times \) 2047 pixels; 0.175 mm pixel spacing) and provides ground-truth segmentations for 5 structures (right/left lung, heart, right/left clavicle; represented by in total 166 corresponding landmarks; see [4]) for all cases. In [4], the data was divided into two disjunct folds of 124 (fold1) and 123 cases (fold2), respectively. Here, fold2 is employed to train the models, while fold1 is used as test data. (2) 32 mid-ventricular slices (256 \(\times \) 256 pixels; avg. pixel spacing: 1.40 mm) taken from different end-diastolic short axis cardiac MRI scans from [1]. For each case, ground-truth contours for the endo- and epicardium of the left ventricle (LV) are provided. We additionally segmented the right ventricle (RV). Each case consists of 104 landmarks (manually placed at corresponding locations). Random subsets of the data are employed for model training while the remaining cases are used for testing. Both data bases contain challenging cases due to i.e. fuzzy boundaries, the projective nature of the data (JSRT), and large anatomical variability.

Experimental design: We compare 3 different variants of the Patch-AAM approach: (1) Patch-AAMs directly learned on the training samples (P-AAM, see Sect. 2.1), (2) Patch-AAMs where the SSM is learned using the locality-based approach (L-SSM, see Sect. 2.2), and (3) Patch-AAMs where the SSM and the SAM are learned using the locality-based approach (L-AAM, see Sect. 2.2). For the locality-based variants, we use the multi-object distance defined in [13] and build 5 locality levels for the SSM and 3 for the SAM. Each variant is once learned on the raw gray values (\(N_f=1\) and \(N_e=11\)) and once by using the well-known SSC descriptor [5] with \(N_f=6\) and \(N_e=5\) to show the flexibility of the framework. To mimic HDLSS scenarios, models are generated for varying numbers of available training samples \(N_S\) randomly sampled from the training data (JSRT: 15, 30, 40, 70, all 123; Cardiac: 5, 10, 15, 20). Those models are then used to segment the objects in the test images. The experiments are repeated 10 (JSRT)/20 (Cardiac) times to reduce the bias introduced via random sampling. For P-AAM, the fitting algorithm is initialized with the mean shape and a multi-resolution scheme with 3 levels is employed. The locality-based variants performed best with 2 multi-resolution levels. See our MATLAB codeFootnote 2 for additional parameter settings. The segmentation accuracy is quantitatively assessed by computing mean symmetric contour distances to the ground-truth contours weighted by the number of landmarks of each object.

Results: Our results are summarized in Table 1 and exemplarily illustrated in Fig. 2. All 3 variants lead to competitive results for \(N_S=123\) (JSRT)/\(N_S=20\) (Cardiac) when compared to the literature. For the JSRT data, all variants outperform the model-based approaches tested in [4, 13] (e.g., ASMs and AAMs; best in [4]: Hybrid ASM with 2.77 mm) and the SSC-based models achieve results comparable to [9] who obtain a mean distance of 2.10 mm on the same data. For the cardiac data, the AAM in [1] achieves a mean error of \(\approx \)1.5 mm for LV segmentation in 3D on the same data. We think our results obtained in 2D are at least comparable to theirs, given the fact that we also segment the RV.

Fig. 2.
figure 2

Illustration of segmentation results (colored contours) for Patch-AAM variants on both data sets (left: JSRT; right: Cardiac). Black/White contours: ground-truth. Please note the improved coverage of local details by the L-AAM approach.

Regarding the specific performances of the locality-based variants (L-SSM/L-AAM) for \(N_S<123\) (JSRT)/\(N_S<20\) (Cardiac), we can see effects comparable to those observed in [13] for locality-based ASMs: In most cases, L-SSMs significantly outperform P-AAMs (see Table 1; paired t-tests with \(p<0.05\)). The improvements tend to be larger for the JSRT data for which our results are also at least comparable to those achieved in [13] for the same sample sizes and data with a locality-based ASM. Most of ours are clearly better (e.g. \(N_S=40\); L-SSM w/SSC: 2.28 mm; ASM in [13]: 2.82 mm). The results also confirm our initial hypothesis that the Patch-AAM fitting algorithm performs better with a locality-based SSM and SAM (L-AAM in Table 1). Nearly all results reported for L-AAM are significantly better (see Table 1; paired t-tests with \(p<0.05\)) than those obtained by L-SSM. However, improvements when using the SSC descriptor are less prominent or even not existent. The exact reason for this behavior remains unclear and is subject to further research. Computationally, Patch-AAMs are efficient (0.5–6 s to process an image on a six-core Xeon CPU).

4 Conclusion

In this paper, a flexible framework for patch-based active appearance modeling that elegantly combines strengths of AAMs and CLMs is presented. Patch-AAMs consist of global shape and appearance models whose parameters are jointly optimized during the efficient segmentation procedure. The often insufficient generalization abilities of those global models are tackled by incorporating and extending a recent approach for learning representative SSMs from small training populations to patch-based appearance modeling. Our experiments on publicly available data show that our framework leads to competitive segmentation results for challenging multi-object problems even when only few training samples are available. Furthermore, the evaluation shows that our framework is able to make use of structural image representations like image descriptors in addition to raw gray values. Although being only applied to 2D data in this work, the approach is not limited to 2D and readily generalizes to 3D cases.