Keywords

1 Introduction

Mild cognitive impairment (MCI) is a condition of cognitive deterioration that is difficult to classify as normal aging or as a prodromal stage to dementia. Neuropsychological tests alone are highly valuable but not sufficient to determine MCI or early stages thereof, since they are not sensitive enough for patients with subjective complaints and no significant and clinically detectable deficits [33].

Further, the diagnosis of temporal lobe epilepsy (TLE) was, and still currently is, based on clinical assessment and electro-encephalographic (EEG) examination, which is sometimes inconclusive [31].

However, both diseases, i.e., MCI as well as TLE, need to be treated and handled adequately, in order to prevent massive memory decline or risks due to seizures [7]. MCI and TLE occur also as a co-morbid disease, when patients with MCI encounter seizures; moreover, the presentation of cognitive disorders in the elderly warrants a differential diagnosis of MCI or TLE, since seizures can mimic confusional episodes of MCI or Alzheimer’s Disease without visible seizure activity with scalp electroencephalography [23]. Treating these seizures can restore cognitive functioning of these patients. However, diagnostics based on invasive electroencephalography bear significant risk and costs.

From a structural point of view, the hippocampus is an area of the brain that links MCI and TLE as it has been found that both diseases affect the hippocampal structure in some way [14]. It is therefore worth evaluating techniques for the diagnosis of these conditions that are based on distinctive features of this brain structure. Segmentation of the hippocampi is, of course, a prerequisite for such approaches. The hippocampus is atrophic in mild cognitive impairment and dementia [11], and it is sclerotic in specific subtypes of epilepsy [28], specifically in TLE. Thus volumetry alone is not sufficient for a discrimination which calls for shape-based features to be investigated.

For structural characterisation, the amount of time it takes an expert to segment the hippocampus is a significant obstacle. In a high-resolution magnetic resonance image, a specialist has to trace the contour of the formation in each slice and review the result in several dimensions. Even after a certain training, this might still take about up to one hour per hippocampus - i.e. two hours per patient. This motivates the use of automated segmentation techniques. A large variety of techniques and algorithms for automated hippocampus segmentation have been published over the last years, some of them targeted to specific disease or deformation classes (see e.g. [2, 10, 17, 20, 29, 30, 34, 35, 39, 40]). The classical state-of-the-art algorithms for automated hippocampus segmentation [5, 24] are based on multi-atlas segmentation (MAS [15]), with recent upcoming deep learning-based techniques (e.g. [36] – see also [1] for a review on deep learning-based brain MRI segmentation techniques).

Interestingly, in a recent large-scale study on algorithms for computer-aided diagnosis of dementia based on structural MRI [4], 6 out of 15 considered techniques (including the best performing algorithms) still relied on FreeSurfer segmentations (see Sect. 3) which is the majority of algorithms based on segmentation in this comparison. This massive employment of FreeSurfer in large-scale studies, although not any more being considered as state-of-the-art, underpins the need for publicly available and easy-to-use segmentation tools. This also holds true for the upcoming deep learning-based tools. Recent work [26] also demonstrates that employing publicly available and cost free segmentation tools were not able to reproduce MCI vs. control group classification results relying on a custom (private) hippocampus segmentation tool [13].

In this paper, we closely follow a shape-based approach originally used to distinguish hippocampi affected by MCI from those of a healthy control group by employing spherical harmonics coefficients (SPHARM) as potentially discriminating features [13, 26, 32]. However, as major original contribution, we investigate if this approach is suited to differentiate hippocampi affected by TLE from (i) those affected by MCI (which has never been investigated at all with any technique) and from (ii) those of a healthy control group, respectively (which has been done relying on manual segmentations only [9, 18, 19, 21]). In this context, we investigate the impact of using different hippocampus segmentation approaches: Three cost-free and pre-compiled out-of-the-box hippocampus segmentation software packages as well as three segmentations independently conducted by human raters with different qualifications (the availability of which can be considered an extremely rare asset). This work discriminates itself from our earlier work [27] by not relying on the two hippocampi separately, thus not looking into lateralisation effects, leading to larger datasets and thus increased statistical result significance.

Section 2 briefly explains how we obtain SPHARM coefficients used to compose feature vectors subject to subsequent classification. In Sect. 3, we first explain the experimental setup in detail, specifically including the hippocampus segmentation variants and SPHARM coefficient selection strategy employed. Subsequently, classification results are shown and described. Section 4 concludes the paper by discussing the observed results.

2 Spherical Harmonics Descriptors in Structural MCI Characterisation

The features used for classification are based on Spherical Harmonics (SPHARM). These are a series of functions which are used to represent functions defined on the surface of a sphere. Once a 3D object has been mapped onto a unit sphere, it is also possible to describe that object in terms of coefficients for the basis function of SPHARM. In other words, the SPHARM coefficients can be used a shape descriptors. In this work we follow the approach described in [3] in order to obtain coefficients for the hippocampi voxel volumes.

Once a voxel volume for a hippocampus has been obtained, either by automatic or manual segmentation, we first fix the topology of the voxel objects. This is necessary since, in order to map a 3D object to a sphere, the respective voxel object must exhibit a spherical topology. Moreover, the voxel dimensions are adjusted to obtain isotropic voxel. Since the voxel volumes resulting from MRI scans yield voxels with a size depending on the scan parameters (i.e. often non-isotropic), we resample the data such that we end up with voxel cubes each having a side length of 1 mm. Sometimes the segmented data consists of one large, and two or more disconnected smaller voxel compounds, respectively. In such a case we determine all voxel compounds and remove all but the largest one. This step removes small spurious voxel masses which occur quite frequently especially in manual segmentation and hinder a mapping of the voxel object to a sphere.

Based on the resampled and fixed voxel volumes, we generate 3D objects. While other implementations create objects based on triangular faces, we decided to use quadriliterals since these more naturally correspond to voxels (see Fig. 1(a)). The 3D objects are then mapped onto a unit sphere during the initial parameterisation, which is followed by a constrained optimisation (described in more detail in [3]). The optimised parameterisation is then used to compute the SPHARM coefficients (see Fig. 1(b) and (c)).

Fig. 1.
figure 1

Basic principle visualised.

Fig. 2.
figure 2

The process of re-aligning the hippocampus. (a) original orientation (SPHARM reconstruction up to degree 15), (b) the same reconstruction but only to degree 1, (c) same as (b) but re-aligned to axes, and (d) the final object, same as (a) but re-aligned to axes.

We are mainly interested in shape differences and thus we want to ignore orientation differences which are found e.g. in malrotated hippocampi. Instead of using e.g. rotation invariant SPHARM representations [16] we resort to a classical alignment procedure, as invariant representation sometimes suffer from a lower extent of discriminativeness. For alignment, we compute a reconstruction of the hippocampus object up to SPHARM degree 1 (based on a triangulated sphere). This results in an ellipsoid which is aligned with the main orientation of the 3D object. Using PCA we determine the principal axes of the ellipsoid and rotate the object such that all hippocampal volumes are always in the coordinate system of the principal axes (i.e. co-aligned). After the re-alignment we recompute the SPHARM coefficients up to degree 15 for each re-aligned object and use them for the subsequent classification process. Figure 2 shows the process of re-aligning a hippocampal 3D object.

Computing SPHARM coefficients up to degree J, we obtain a total of N complex coefficients per hippocampus, where N is computed as

$$\begin{aligned} N = 3*(J+1)^2. \end{aligned}$$
(1)

Extracting SPHARM coefficients up to degree 15 we end up with 768 coefficients per hippocampus. We have used a custom MATLAB implementation following the SPHARM-PDM code (https://www.nitrc.org/projects/spharm-pdm).

The final feature vectors \(F_i\) available for the feature selection process are composed of the absolute coefficient values for the left and right hippocampi:

$$\begin{aligned} F_i = (|C^{l}_{i, 1}|, \ldots , |C^{l}_{i, N}|, |C^{r}_{i, 1}|, \ldots , |C^{r}_{i, N}|), \end{aligned}$$
(2)

where \(C^{l}_{i, n}\) and \(C^{r}_{i, n}\) denote the n-th coefficient for the left and right hippocampus of subject i, respectively. These feature vectors contain 1536 coefficients in total out of which subsets can be selected for actual classification.

3 Experiments

3.1 Experimental Settings

Data. In this work we use 58 T1-weighted MRI volumes, a data set that has been acquired at the Department of Neurology, Paracelsus Medical University Salzburg, including patients with mild cognitive impairment (MCI, 20 subjects), temporal lobe epilepsy (TLE, 17 subjects), and a healthy control group (CG, 21 subjects). These data are a subset of a larger study [25]. We defined patients with amnestic MCI according to level three of the global deterioration scale for aging and dementia described by [12]. Diagnosis/ground truth w.r.t. MCI and TLE was based on multimodal neurological assessment, including imaging (high resolution 3 T magnetic resonance tomography, and single photon emission computed tomography with Hexamethylpropylenaminooxim), electroencephalography, and neuropsychological testing.

Hippocampus Segmentations. Manual segmentations have been performed by 3 experienced raters (one senior neurosurgeon – Rater1 – and two junior neuroscientists supervised by a senior neuroradiologist – Rater2 & Rater3) on a Wacom Cintiq 22HD graphic tablet device (resolution \(1920 \times 1200\)) using a DTK-2200 pen and employing the 32-bit 3DSlicer software for Windows (v. 4.2.2-1 r21513) to delineate hippocampus voxels for each slice separately. The raters independently used consensus on anatomical landmarks/boarders of the hippocampus based on Henry Duvemoy’s hippocampal anatomy [22]. The procedure used was to depict the hippocampal outline in the view of all planes in the following order: sagittal – coronal – axial with subsequent cross line control through all planes.

For automated hippocampus segmentation, in contrast to most of the algorithms presented in literature, e.g. [40], all three employed hippocampus segmentation software packages are already pre-compiled and available for free [25]:

\(\underline{FreeSurfer (FS)}\)Footnote 1 is a popular set of tools which allow an automated labelling of subcortical structures in the brain [10]. Such a subcortical labelling is obtained by using the volume-based stream which consists of five stages [10]. The result is a label volume, containing labels for various different subcortical structures (e.g. hippocampus, amygdala, and cerebellum). FreeSurfer is a highly popular tool in hippocampal analysis to assess clinical hypotheses [6, 17, 20, 35, 38] or to compare to newly proposed segmentation techniques (e.g. [29, 30, 40]). The winning algorithm in a recent large-scale study on algorithms for computer-aided diagnosis of dementia based on structural MRI [4] was based on FS segmentation, as well as 5 other out of 15 considered techniques in this work.

\(\underline{AHEAD}\) (Automatic Hippocampal Estimator using Atlas-based DelineationFootnote 2) is specifically targeted at an automated segmentation of hippocampi [34] and employs multiple atlases and statistical learning method. After an initial rigid registration step, a deformable registration is carried out using the Symmetric Normalisation algorithm. From the result of these steps, the volume is normalised to the atlas. The hippocampus segmentation from the atlas is then warped back to the input volume. Based on multiple atlases and a statistical learning method, the final segmentation is obtained.

Although \(\underline{BrainParser (BP)}\)Footnote 3 is usually able to label various different subcortical structures, we use a version of BrainParser which is specifically tailored to hippocampus segmentation. After re-orienting the input volume to the coordinate system of the included, pre-trained atlas, skull stripping is performed. This is followed by computing an affine transform between the input volume and the reference brain volume. Then a deformable registration between the input and the reference volume is carried out. Then, according to the trained atlas, the input volume is labelled.

We also fuse the segmentation results using voxel-based majority voting (abbreviated M.V. in the results – a voxel is active in the fused volume if at least two raters or segmentation tools marked that voxel as belonging to a hippocampus) and STAPLE [37] (abbreviated as STA). Since for human raters there is hardly a difference between majority voting and STAPLE, the latter results are not shown.

Feature Selection, Classification, and Evaluation Protocol. Features used for actual classification are selected from the feature vectors \(F_i\) according to the degree J of their coefficients. The strategy “CumulaJ” selects all coefficients with degree \(\le J\). Thus, for example, according to Eq. 1 for \(J=5\), Cumula5 employs 108 coefficients of degrees \(J= 1 \dots 5\).

For the classification of the features we use the Support Vector Machines (SVM) classifier [8] with a linear kernel. The choice for this classifier has been made since the classifier is known to be able to cope very well with high-dimensional features. However, this classifier does not guarantee that a large feature set containing a smaller one exhibits better classification when trained with the larger feature set compared to the smaller one.

To come up with classification accuracy estimation, we apply leave-one-out-cross-validation (LOOVC) for the feature vectors.

3.2 Experimental Results

Tables 1 and 2 display the overall classification accuracy in percent for our test data set. Results are shown up to \(J=7\). Depending on the underlying segmentations, for higher values of J, classification accuracy either decreases (for the automated techniques and low quality human segmentations) or does not increase any more (for high quality human segmentations).

The first impression of the results to discriminate TLE from CG (Table 1) is that there are no bold numbers in the left half of the table, i.e. the employment of automated segmentation tools does not lead to any decent classification results using the considered SPHARM approach and most results are hardly superior to random guessing.

Table 1. Classification results: temporal lobe epilepsy (TLE) vs. healthy control group (CG) (overall accuracy \(\ge \) 75% in bold).

Looking at the results of applying individual automated segmentation tools, BP is clearly the best technique with increasing classification rates for increasing J (which indicates that also coefficients representing finer detail are still useful), while FS and AHEAD based results are on a comparable level and worse than BP. Applying Majority Voting (M.V.) or STAPLE (STA) to the individual segmentations does not really help, indicating that these are too different to benefit from fusion strategies.

The situation is different for the results when basing SPHARM classification on human rater segmentations. There is a clear trend that Rater1, being the most qualified human rater, provides segmentations that lead to sensible classification results (increasing up to 82% for \(J=7\)). Rater3 achieves some useful results as well, but lower (\(\le \)77%) and for \(J=6,7\) only, while Rater2 stays below 67%. Overall, the differences among the human raters in terms of achieved classification accuracy are considerable and applying fusion techniques does not lead to useful results (i.e. better than Rater2 and Rater3 in most settings, but clearly inferior to the Rater1 results).

Table 2. Classification results: temporal lobe epilepsy (TLE) vs. mild cognitive impairment (MCI) (overall accuracy \(\ge \) 75% in bold).

Considering the results to discriminate TLE affected hippocampi from those affected by MCI (see Table 2), we see a different picture. Automated segmentation tools at least deliver useful segmentations for classification when fusion techniques are applied (for small values of J only, again indicating that only coarser details contribute useful information for classification). Also the relation among the individual tools is different, with the clear ranking as AHEAD best, followed by FS, and BP worst (note that this is almost inverse to the classification task before).

Concerning manual segmentations, we notice a somewhat similar behaviour as before. Again, Rater1 (with the highest qualification) is the best, while for this classification task, segmentations of Rater2 also lead to quite decent classification results for higher values of J. Except for \(J=2\), Rater3 segmentations are the worst. Again, fusion of human segmentation results does not reach the classification results of the individual segmentations, except for \(J=1\) (but on a moderate level).

4 Discussion and Conclusion

The obtained results indicate that discriminating TLE patients from healthy controls is far more difficult that discriminating TLE patients from MCI patients. In the former case, only human segmentation (and only that provided by the most qualified human rater) leads to SPHARM coefficients that can be used to discriminate the two classes with reasonable accuracy. In this case, contrasting to the case of discriminating MCI affected patients from healthy controls [26], also applying fusion among the human segmentations does not work properly (obviously the segmentations are too different to benefit from fusion [25]). For discriminating TLE patients from healthy controls, the considered automated segmentation tools do not lead to sensible classification results under the employed SPHARM-methodology, neither applied individually, nor under fusion techniques (which might also explain the non-existence of corresponding publications).

The situation is different for the second case (i.e. discriminating TLE patients from MCI patients). While employing individual automated tools does not work, fusing the corresponding segmentations and using the lower order SPHARM coefficient for classification is successful. Obviously, the shape difference is so fundamental, that it is reflected in the coarse grain SPHARM details of the automated tools segmentations’ in some complementary manner. The best qualified human rater’s segmentations again lead to best classification results, but also the two other human raters achieve useful accuracy for some settings. Thus, the fundamental differences are also grasped by the less qualified human raters segmentations. As in the first case, human segmentations seem to be too different to provide useful results under fusion (see also [25] for corresponding segmentation result analysis confirming this assumption).