1 Introduction

The estimation of the age of a speaker from an analysis of his or her voice has forensic applications – for example the profiling of perpetrators of crimes [1], commercial applications – for example targeted advertising, and technological applications – for example adaptation of a speech recognition system to a speaker [2].

Many previous studies have looked at the performance of both human listeners and machine-learning systems for the estimation of age from speech. Unfortunately, variations in the data set, task and performance metric make these studies hard to compare. In our work we take the view that the natural task should be numerical estimation of the age of the speaker, and the natural performance metric should be the mean absolute error (MAE) of estimation. The MAE answers the question “how close is the average estimate to the actual age?”

A recent review of previous studies on human listener judgments of speaker age may be found in [3]. Of the studies reported which used numerical age estimation and MAE, most seem to suggest human performance has an MAE of about 10 years. Table 1 provides a summary.

Table 1. Previous studies on human listener age estimation

There have also been many studies in the machine prediction of speaker age from speech, see [8] for a review of the state of the art. The studies also vary greatly in terms of the data set, audio quality, audio duration, audio feature set, recognition task and machine learning approach taken. Machine learning methods have included support vector machines, Gaussian mixture models (GMM), GMM supervectors, i-vectors and phoneme recognisers. Reference [9] provides system design and performance figures for a range of contemporary approaches together with a fusion of systems for age estimation of very short speech excerpts (<2 s). The different approaches only varied by a few percentage points (43.1–47.5 % age categories correctly identified) suggesting that the choice of machine learning algorithm is not a critical factor.

The results of some machine studies that addressed the problem of numerical age estimation evaluated with MAE are shown in Table 2.

Table 2. Previous studies on machine age estimation

The best performing system on adult speech described in [8] used the i-vector approach followed by support vector regression and demonstrated an MAE of 6.1 years. While at first glance this looks considerably better than the MAE figures quoted for human performance, it is important to note that the test data used in this study had a non-uniform age distribution, with significantly more speakers in the 20–29 age band than in other bands. This uneven distribution means that even a null model which always predicted the mean age of the training speakers would show an MAE of 10.6 years for female speakers and 10.1 years for male speakers. The superiority of the machine system might therefore have arisen from the unfair knowledge it had of the prior age distribution. Since all machine systems in Table 2 may have exploited a prior on the test speaker age, this makes it impossible to compare any of them fairly with the human listeners, who were not given that information.

The goals of this study are to make a fair comparison between human and machine speaker age estimation. This will be done by: (i) comparing human and machines on same test data, (ii) comparing them on the on same task – numerical age estimation, (iii) evaluating both using the same performance metric – MAE, and (iv) removing any advantage of knowing a prior on the test set by using a uniform test age distribution.

We describe the data set used for the task, results of a human listening task and results of a machine age estimation system constructed to be similar to the best performing systems in Table 2.

2 Speech Corpus

The work described here uses the Accents of the British Isles corpus (version 2) available from The Speech Ark [15]. The ABI-2 corpus consists of recordings of 262 speakers covering 13 accent areas of the British Isles. Each speaker is recorded reading a range of English language materials; although for this work we used only the first part of the “accent diagnostic” passage which has a median duration of 39.2 s. The recordings are supplied as wide bandwidth audio of good quality, recorded using a close-talking microphone at 22050 samples/s.

The corpus was divided into a test set containing 52 speakers, and a training set of the remaining 210 speakers. The test set was chosen to have equal representation of men and women for all 5-year age bands between 15 and 80. Figure 1 shows the age distribution by gender for the test and training sets.

Fig. 1.
figure 1

Age and gender distribution for the train and test sets.

The mean age of the training set was 42.6 years. Used as a null prediction for the test set this value would score a mean absolute error of 16.7 years.

3 Human Prediction Performance

To obtain human age prediction performance a web-based data collection protocol was used. Listeners were able to listen to each test recording then make an age estimation using a sliding scale between 15 and 80. Estimates were recorded as whole numbers of years. Recordings were presented in a random order different for each listener. Listeners could make their age estimate at any time while the recording was playing, or could listen to the audio multiple times. Listeners conducted the test in their own homes, but were asked to listen over headphones. The web interface may be seen in Fig. 2.

Fig. 2.
figure 2

Web experiment interface.

An attempt was made to recruit listeners over a range of ages and genders, although the balance was not perfect. In all, 36 native English listeners completed the test; Table 3 shows their distribution by age and gender.

Table 3. Distribution of listeners by age and gender

The raw age predictions are plotted against the true speaker ages in Fig. 3. The line of best fit has a slope of 0.68 and an intercept of 12.7 years. The correlation coefficient is 0.759 and the mean absolute error (MAE) of prediction is 9.79 years (male speakers only 10.1, female speakers only 9.51).

Fig. 3.
figure 3

Age predictions of 52 test speakers by 36 listeners.

The MAE as a function of the age and sex of the speaker is shown in Table 4, and MAE as a function of the age and sex of the listener is shown in Table 5.

Table 4. Mean Absolute error of prediction as a function of age and sex of the speaker
Table 5. Mean absolute error of prediction as a function of age and sex of the listener

Generalised linear mixed-effects models of the predictions were estimated using Markov chain Monte Carlo techniques with the MCMCglmm package [16]. The models were used to determine if the absolute error in age prediction was affected by the sex or age of the speaker, or the sex or age of the listener.

The speaker model was trained with the identity of the listener as a random factor. The sex of the speaker was found not to have significant effect. The age-band of the speaker did have significant effect, with the ages of speakers in the 20–29 age band being significantly better estimated than the other bands.

The listener model was trained with the identity of the speakers as a random factor. The sex of the listener was found not to have a significant effect. The age-band of the listener did have significant effect with listeners in age-bands 40–49 and 60–69 giving significantly worse predictions than listeners in the 20–29 age band.

The fact that the line of best fit of the estimates does not have a gradient of one might be due to the limited range of the age slider in the web task creating floor and ceiling effects. Listeners were unable to estimate ages lower than 15 years or greater than 80 years even if these would in fact have been in error.

A distribution of the age prediction errors is shown in Fig. 4. It may be seen that errors are approximately symmetric about zero. This suggests that an averaging of age predictions over listeners would provide a better age estimate.

Fig. 4.
figure 4

Distribution of age prediction errors by human listeners.

To estimate the benefit of averaging across listeners, panels of size 2 to 12 were built post hoc from random selections of listeners. The average MAE calculated over 50 random panels of each size is plotted in Fig. 5. It is seen that considerable advantage may be had by consulting a listener panel, with a panel of 10 listeners for example having an MAE of 7.41 years.

Fig. 5.
figure 5

Distribution of mean absolute error of age prediction by listener panel size. Average over 50 random panels. Bars show 1 s.d.

4 Machine Prediction Performance

4.1 Feature Analysis

Following on from the acoustic feature analysis used in the Interspeech Computational Paralinguistics challenges, we have used the OpenSMILE toolkit [17] to generate a large feature vector for each audio recording. The specific set of parameters was those used for the 2014 challenge [18]. This feature set comprises 65 low-level descriptors which are extracted from short-term windows on the signal. These describe speech signal properties such as energy, spectral envelope, pitch and voice quality. The descriptors are then summarized over each file using a large number of statistical measures such as means, medians, quantiles, differences, and so on. The output is a vector of 6373 features for each file.

4.2 Machine Learning

The method chosen for learning the prediction model was Support Vector Regression (SVR) [19] as used by previous authors [1013]. The “e1071” package for the “R” statistics library was the chosen implementation [20]. In support vector regression a subset of the training vectors are chosen to represent the optimal regression hyperplane.

To reduce the training complexity, a feature selection process was implemented. Only features which had an absolute value of correlation greater than an arbitrary threshold of 0.1 with the age of the speaker were passed to SVR. This selection was made on the training set only and left 2538 features. Performance was not strongly affected by the choice of this threshold providing enough features were included.

At the front end of the SVR, a radial-basis function kernel is applied – this provides an additional tunable non-linearity applied to the feature values. Also the SVR algorithm applies a feature normalization step to ensure all features have a similar dynamic range.

Optimal control parameters were found using a cross-validation procedure on subsets of the training data only. The optimal parameters were: C = 8, gamma = 0.25/number-of-features, epsilon = 0.1.

Separate SVR systems were trained for male and female speakers as in [8].

4.3 Raw Prediction Performance

The raw prediction performance of SVR is shown in Fig. 6. The line of best fit has a slope of 0.53 and an intercept of 18.9 years. The correlation coefficient is 0.82, and the mean absolute error is 9.13 years (male speakers only 7.98, female speakers only 10.29). A gender independent model gave a correlation of 0.81 and an MAE of 9.18 years. As mentioned previously, a null model has an MAE of 16.7 years.

Fig. 6.
figure 6

Machine prediction of age using SVR.

Like the human listener predictions, the machine predictions also overestimate the ages of younger speakers and underestimate the ages of older speakers. Table 6 shows the MAE as a function of the age and sex of the speaker. It is noticeable that the greatest estimation errors are with the speakers older than 50. In the next section we try to rebalance the training set to investigate whether this bias is just a reflection of the uneven age distribution in the training data.

Table 6. Mean absolute error of prediction as a function of age and sex of the speaker

4.4 Effect of Balancing the Training Set

Since our original motivation was to make a fair comparison with human listeners on a balanced test set, it may be that we have now disadvantaged the machine system by only providing an unbalanced training set. The machine predictions are worse for the older speakers (Table 6) who are under-represented in the training data (Fig. 1). The training of predictive models under circumstances of imbalanced data is an ongoing area of research both for classification and regression tasks [21]. To explore the effect of imbalanced training data in this task, we explore the synthetic creation of training data samples using a variation of the SMOTE algorithm [22] designed for regression [23].

Here we present results in which we artificially generate additional training samples from linear interpolations between existing vectors. We even out the number of samples for male and female speakers and boost the number of training samples for speakers of ages >50 years. Each new sample is generated from two randomly-chosen instances of the same sex and age band by choosing a random point along the interpolation joining the two vectors. The new age value is interpolated from the ages of the two samples at the same fraction. In total a further 271 vectors were added.

Figure 7 shows the distribution of training samples by decade before and after balancing.

Fig. 7.
figure 7

Results of boosting the frequency of the older speakers in the training data.

A new SVR model was trained on the re-balanced data. 3378 features were selected using the same correlation threshold. Cross validation on the training data suggested the best control parameters were now C = 32, gamma = 0.125/number of features, epsilon = 0.001.

Figure 8 shows the age predictions for the test set after training with the re-balanced training data. The line of best fit had a slope of 0.554 and an intercept of 17.8 years. The correlation was 0.852 and the MAE 8.64 years (male speakers only 7.87, female speakers only 9.42). A gender independent model gave a correlation of 0.81 and an MAE of 9.49 years. Table 7 shows the mean absolute error of prediction as function of the age and sex of the speaker.

Fig. 8.
figure 8

Machine prediction of age using SVR trained on rebalanced data.

Table 7. Mean absolute error of prediction as a function of age and sex of the speaker using SVR trained on rebalanced data

While some performance improvements are seen in comparison to results with the original training set, overall the improvement is small. It may be that in this task, the SVR model does not gain any useful information from the synthetic samples.

5 Discussion

In this study we have made direct comparison between human listeners and machine learning on the problem of speaker age estimation. By nullifying any advantage a machine system may have by knowing about the prior distribution of test speakers, we have shown that humans and machines are more similar in estimation performance compared to results published in previous studies.

Nevertheless the machine system showed a slight advantage. The best machine performance had an MAE of 8.64 years, while the human listeners had an MAE of 9.79 years. The machine system was able to outperform two-thirds (25/36) of the human listeners. However even a panel of 2 listeners had superior average performance than the machine system in this particular experiment.

Interestingly, both human and machine had problems with the extremes of the age range, both showing lines of best fit with slopes significantly less than unity. We showed that boosting the number of older speakers in the training set had very little effect, perhaps because the SVR model did not extract any more information from the interpolated samples than it could extract from the original samples. The difficulty of predicting the ages of older speakers may be due to some inherent characteristics of the data – perhaps the voice characteristics of older speakers are more variable for a given age. This would fit with other research [24] that has shown how cognitive abilities become increasingly heterogeneous with advancing age. Further research into this issue, and improved machine performance, is likely to come from data sets with a larger number of speakers and a larger range of ages.