1 Introduction

Power grid is essential for today’s society as an enabling infrastructure. The efficiency and safety of power system have major consequences for maintaining stable electricity supply, supporting economic growth and ensuring national security. With the rapid development of technology, a lot of sophisticated automation has been introduced into the power system operation. Since this equipment become more complex and start to affect each other, the risk and potential loss of malicious intrusion or attack also increase. Thus, there is an increasing need for verifying the identity of the person regarding authorized to operate the particular machine.

In this situation, conventional human-based authentication such as passwords, tokens, and manual checks is no longer considered to offer high level security alone because human operators are found one of the biggest sources of errors in complex systems [1]. For example, passwords or pin numbers are easily forgotten or forged. And even the most highly trained and alert operators are prone to fatigue and boredom after a long period of continuous work. Therefore, the biometric identification technology can be a useful supplement to existing authentication techniques.

One of the most promising biometric identification technologies is automatic speaker verification (ASV), which is the task of verifying an individual’s identity from their voice samples using machine learning algorithms, without any human intervention. Since voice has been one of the most casual means for natural interactions between humans and machines, voice-based systems are easy and intuitive for human operators to use. Further, voice is inherent to individuals and can neither be lost nor stolen which makes it highly accurate and reliable. The availability of low-cost and portable microphones gives it capability of easy integration. ASV has seen significant advancements over the past few decades, giving rise to the successful introduction for various sectors, such as health care, finance and manufacturing industry etc.

Although state-of-the-art i-vector/PLDA based systems exhibit satisfactory performance with adequate speech data [2], a major challenge in ASV is to improve performance with limited voice segments. On the one hand, to achieve fair performance, ASV systems need to be presented with sufficient long utterance (two or three minutes) for enrollment and test i-vectors extraction [3, 4]. Indeed, it is often difficult to acquire such long speech for practice ASV systems because of background noise, voice overlaps or faulty recording devices. Also, there are difficulties related to speaker himself. In fact, unwilling speakers, the state of health, the character of speakers can all contribute to a reduced available amount of speech data. On the other hand, the systems require a large amount of development data to estimate reliable hyper-parameters. Particularly, the success of PLDA modeling depends on the availability of a large set of labeled in-domain data. In most real-life application, collection of such amount of development data from target domain is infeasible. Hence, it is crucial to maintain ASV performance when it is constrained on limited voice data.

Over the years, considerable research effort has been made to overcome such challenges. In [5], the duration variability is mitigated by propagating the posterior covariance of i-vectors to PLDA. However, scoring is computationally expensive in this method. The work in [6] proposed full posterior distribution PLDA to address short duration issue. The work in [7] attempted to improve short utterance system performance by adaptation for i-vector estimation. Also, many techniques are proposed to deal with inadequate target domain data in PLDA modeling. The work in [8] proposes Bayesian adaptation of PLDA models. In [9], unsupervised clustering of i-vectors for adapting covariance matrices of PLDA models is proposed. The work in [10] proposes inter-dataset variability compensation (IDVC) to find a feature space that is more domain independent. In this paper, we propose a new method by incorporating historical test information for short utterance i-vector extraction. In addition, we modify the conventional LDA projection to compensate the domain mismatch before PLDA modeling. In contrast to the existing works address limited utterance length and limited in-domain development data in separate view, we integrate proposed methods in one system and validate it in a real-life power grid dispatching room scenario.

The rest of the paper is organized as follows. Section 2 describes i-vector/PLDA framework as our baseline ASV system. The proposed method for i-vector extraction and modification for LDA are detailed in Sect. 3. Section 4 presents the experimental setups. Section 5 discusses system implementation and evaluation. Section 6 concludes the paper and outlines future studies.

2 Baseline ASV System Description

2.1 I-Vector Extraction

As mentioned earlier, i-vector based system has become de facto choice for speaker verification and related tasks. I-vector is essentially a low-dimensional representation of the Gaussian mixture model (GMM) super-vector found through a factor analysis process. Specifically, the speaker and channel dependent GMM super-vector M can be generated by

$$ {\text{M}} = {\text{m}} + {\text{Tw}} $$
(1)

where m is the speaker and channel independent super-vector, which is concatenated means of universal background model (UBM), T is a low-rank total variability (TV) matrix, and w is a random latent variable with standard normal distribution. In i-vector approach, the universal background model (UBM) and total variability (TV) matrix are trained with large amount speech data gathered from different speakers. The i-vector x is given by the maximum a posteriori (MAP) point estimate of the hidden variable w which is equal to the mean of the posterior distribution of w conditioned on input utterance:

$$ {\text{x}} = \left( {I + T^{T}\Sigma ^{ - 1} NT} \right)^{ - 1} T^{T}\Sigma ^{ - 1} N\left( {E - m} \right) $$
(2)

where Σ is a diagonal matrix, in which the diagonal blocks are corresponding covariance matrices of Gaussian components of the UBM, N and E are zero and first order Baum-Welch (BW) statistics matrices, respectively. Given an utterance \( {\text{X}} = \left\{ {x_{1} ,x_{2} , \ldots ,x_{F} } \right\} \), the zero and first order BW statistics are computed using UBM as

$$ N_{i} = \sum\nolimits_{j = 1}^{F} {Pr\left( {i|x_{j} } \right)} $$
(3)
$$ E_{i} \left( X \right) = \frac{1}{{N_{i} }}\sum\nolimits_{j = 1}^{F} {Pr\left( {i|x_{j} } \right)} x_{j} $$
(4)

where \( Pr\left( {i|x_{j} } \right) \) is posterior probability of generating \( x_{j} \) by corresponding Gaussian component density:

$$ Pr\left( {i|x_{j} } \right) = \frac{{\omega_{i} p_{i} \left( {x_{j} } \right)}}{{\mathop\Sigma \nolimits_{k = 1}^{C} \omega_{k} p_{k} \left( {x_{j} } \right)}} $$
(5)

2.2 Linear Discriminant Analysis (LDA)

After the i-vector extraction, linear discriminant analysis (LDA) is used to compensate within-class variations and reduce the dimensionality prior to probabilistic linear discriminant analysis (PLDA) modeling. In LDA method, we simultaneously maximize the between-class variability and minimize the within-class variability by maximizing the following objective function:

$$ {\text{J}}\left( v \right) = \frac{{v^{T}\Sigma _{b} v}}{{v^{T}\Sigma _{w} v}} $$
(6)

where v is eigenvector, \( \Sigma _{b} \) and \( \Sigma _{w} \) are between-class scatter matrix and within-class scatter matrix, respectively, which are determined by

$$ \Sigma _{b} = \sum\nolimits_{s = 1}^{s} {n_{s} \left( {{\bar{\text{x}}}_{s} - {\bar{\text{x}}}} \right)\left( {\bar{x}_{s} - \bar{x}} \right)^{T} } $$
(7)
$$ \Sigma _{w} = \sum\nolimits_{s = 1}^{s} {\sum\nolimits_{i = 1}^{{n_{s} }} {\left( {{\text{x}}_{i}^{s} - {\bar{\text{x}}}_{s} } \right)\left( {{\text{x}}_{i}^{s} - {\bar{\text{x}}}_{s} } \right)^{T} } } $$
(8)

where \( {\text{S}} \) is the number of all speakers, \( n_{s} \) is the number of utterances from speaker s, \( {\bar{\text{X}}}_{s} \) is the average of the i-vectors from speaker s, and \( \bar{x} \) is the average of all i-vectors, defined as follows

$$ {\bar{\text{x}}}_{s} = \frac{1}{{n_{s} }}\sum\nolimits_{i = 1}^{{n_{s} }} {x_{i}^{s} } $$
(9)
$$ \bar{x} = \frac{1}{N}\sum\nolimits_{s = 1}^{s} {\sum\nolimits_{i = 1}^{{n_{s} }} {x_{i}^{s} } } $$
(10)

where N is the total number of utterances.

The LDA projection matrix is found by solving the following eigenvalue problem:

$$ \Sigma _{b} v = {\Lambda \Sigma }_{w} v $$
(11)

where Λ is eigenvalue matrix. The projection matrix A is formalized by selecting first k eigenvectors corresponding to the k largest eigenvalues:

$$ {\text{A}} = \left[ {v_{1} ,v_{2} \ldots v_{k} } \right] $$
(12)

Finally, the LDA compensated i-vectors are calculated as

$$ x_{LDA} = A^{T} x $$
(13)

2.3 Probabilistic Linear Discriminant Analysis (PLDA)

Apart from compensating the within-class variations in i-vector space by subspace transformation, probabilistic linear discriminant analysis (PLDA) is widely used to reduce the redundant information such as channels from i-vectors. Here, the generative model for length-normalized i-vectors of s speaker with \( n_{s} \) sessions can be expressed as

$$ \text{x}_{\rm i,j} =\upmu + \text{V}z_{i} + \varepsilon_{i,j} $$
(14)

where μ is the mean of i-vectors, V defines the eigen-voice subspace, \( z_{i} \) is the speaker factor, and \( \varepsilon_{i,j} \) is the residual term.

The verification scores of PLDA system is given as batch likelihood ratio. For projected enrollment and test i-vectors, \( z_{target} \) and \( z_{test} \), the batch likelihood ratio is computed as

$$ \Lambda \left( {z_{target} ,z_{test} } \right) = \log \frac{{p\left( {z_{target} ,z_{test} |H_{1} } \right)}}{{p\left( {z_{target} |H_{0} } \right)p\left( {z_{test} |H_{1} } \right)}} $$
(15)

where \( H_{1} \) denotes the hypothesis that i-vectors belong to the same speaker and \( H_{0} \) denotes the hypothesis that they are from different speakers. Figure 1 shows the process of calculating scores from the enrollment and test utterance in our i-vector/PLDA ASV system.

Fig. 1.
figure 1

Block diagram of i-vector/PLDA ASV system

3 Proposed System Modification

3.1 Analysis of I-Vector Estimation for Short Utterance

In i-vector systems, the test utterance and enrolment utterance(s) are represented by test and enrolment i-vectors extracted with pre-trained UBM and TV matrix. Then ASV is addressed by comparing the test i-vector with enrolment i-vector(s) signed by the individual to generate an accepted or rejected decision. Though the requirement of speech duration can somehow be met in enrolment stage, it may not be possible to maintain the same during the verification stage. This seriously limits the implementation of ASV system in real-world applications.

To better understand the effects of test duration variability on system performance, we present a detailed analysis of i-vector extraction pipeline. With short utterance, there is an increased uncertainty of BW statistics estimation due to lack of enough data to compute statistics parameters, which leads to an uncertain i-vector estimation. For i-vector systems, BW statistics totally represent the feature extracted from a test segment. [7, 11] Particularly, the zero-order BW statistics defines the covariance matrix of the posterior distribution given the utterance as

$$ w_{\Sigma } = \left( {I + T^{T}\Sigma ^{ - 1} NT} \right)^{ - 1} $$
(16)

where \( w_{\Sigma } \) is the covariance of the estimated i-vector, T is TV matrix, Σ is the UBM covariance, N is a diagonal matrix, where the diagonal blocks are the zero-order BW statistics of corresponding Gaussian components in UBM. Since the UBM and TV matrix are pre-trained with large quantity of data from different speakers, the higher variability introduced in BW statistics account for the uncertainty in i-vector estimation for short test segment.

3.2 Incorporating Historical Test Information in I-Vector Extraction

In order to improve the i-vector estimation, we propose a new method for adding historical test information in BW statistics computation. Rather than only use current test utterance to compute the BW statistics, we also exploit the weighted historical test utterance statistics to provide additional information. We define the weight \( \gamma_{i} \) as the estimated probability of current test utterance and historical test utterance i belonging to the same speaker. Then the BW statistics used to extract the current test i-vector is given by

$$ N = N_{c} +\Sigma \gamma_{i} N_{i} $$
(17)
$$ E = E_{c} +\Sigma \gamma_{i} E_{i} $$
(18)

where \( N_{c} \) and \( E_{c} \) are BW statistics computed from current test utterance, \( N_{i} \) and \( E_{i} \) are BW statistics computed from historical test utterance, and \( \gamma_{i} \) is corresponding weight assigned to historical test.

To compute the weight \( \gamma_{i} \) for historical test utterance, we use a world MAP estimator which was proposed in [12] and successfully applied to unsupervised GMM adaptation thereafter in [13, 14]. We first train a two-class Bayesian classifier based on two score models - target and non-target scores - learned from a development set. [14] Each score distribution is modelled by a 12 components GMM. Given the priori target and non-target score distributions, we can compute the posteriori probability of having a target. Specifically, for every encountered test utterance, ASV system output a raw score. Given current test raw score, \( s_{0} \), the posteriori probability of this test belonging to the target speaker is defined as

$$ P\left( {tar|s_{0} } \right) = \frac{{P\left( {s_{0} |tar} \right)P_{tar} }}{{P\left( {s_{0} |tar} \right)P_{tar} + P\left( {s_{0} |non} \right)P_{non} }} $$
(19)

where \( P\left( {s_{0} |tar} \right) \) and \( P\left( {s_{0} |non} \right) \) are the probabilities of the score given the target and non-target score distributions, \( P_{tar} \) and \( P_{non} \) are the prior probabilities of target and non-target test respectively. Then for historical test utterance i with raw score, \( s_{i} \), we can compute weight \( \gamma_{i} \) as follows:

$$ \upgamma_{i} = P\left( {tar|s_{o} } \right)P\left( {tar|s_{i} } \right) + \left[ {1 - P\left( {tar|s_{0} } \right)} \right]\left[ {1 - P\left( {tar|s_{i} } \right)} \right] $$
(20)

Note that all scores used are normalized. In proposed method, we do not require access to the historical test utterances as well as i-vectors. To utilize historical test information, only raw score and corresponding BW statistics are needed, which do not put a heavy burden on real-life applications. Figure 2 shows the flow diagram of the proposed method.

Fig. 2.
figure 2

Flow diagram of the proposed i-vector extraction method

3.3 Modified LDA for Domain Mismatch Compensation

One of the keys to the success of i-vector/PLDA framework is the use of a large quantity of previously collected speech data to characterize and model speaker and channel variability. However, it is unrealistic to assume such a large set of development data for every domain of interest. This is especially true for PLDA modeling, which needs labeled speech data, whereas the training of UBM and TV matrix only need unlabeled data. Studies have found that when PLDA is trained using out-domain data, the ASV system performance degrades rapidly due to the mismatch between development and evaluation data [15].

Conventional LDA projection falls to compensate this domain variability because it captures the domain variability in between-class scatter matrix. Instead of minimizing the domain mismatch in projected i-vectors, LDA maximizes domain variability when training the projection matrix. In order to address such problem, we modify the LDA training to separate domain variability from scatter matrix estimation. For simplicity, we assume the speakers do not overlap across different domains. In our method, the new between-class scatter matrix and within-class scatter matrix are defined as

$$ \Sigma_{b}^{\prime } = \sum\nolimits_{s = 1}^{{S_{OUT} }} {n_{s} } \left( {{\bar{\text{x}}}_{s} - {\bar{\text{x}}}_{out} } \right)\left( {{\bar{\text{x}}}_{s} - {\bar{\text{x}}}_{out} } \right)^{T} + \sum\nolimits_{s = 1}^{{s_{in} }} {n_{s} } \left( {{\bar{\text{x}}}_{s} - {\bar{\text{x}}}_{out} } \right)\left( {{\bar{\text{x}}}_{s} - {\bar{\text{x}}}_{out} } \right)^{T} $$
(21)
$$ \Sigma_{w}^{\prime } = \sum\nolimits_{s = 1}^{{S_{OUT} }} {\sum\nolimits_{i = 1}^{{n_{s} }} {\left( {{\text{x}}_{i}^{s} - {\bar{\text{x}}}_{s} } \right)\left( {{\text{x}}_{i}^{s} - {\bar{\text{x}}}_{s} } \right)^{T} + \sum\nolimits_{s = 1}^{{s_{in} }} {\sum\nolimits_{i = 1}^{{n_{s} }} {\left( {{\text{x}}_{i}^{s} - {\bar{\text{x}}}_{s} } \right)\left( {{\text{x}}_{i}^{s} - {\bar{\text{x}}}_{s} } \right)^{T} } } } } $$
(22)

where \( S_{out} \) and \( S_{in} \) are the number of out-domain and in-domain speakers, \( {\bar{\text{x}}}_{out} \) and \( {\bar{\text{x}}}_{in} \) are the average of the out-domain and in-domain i-vectors, respectively. Also, we define inter-domain variability matrix as

$$ \Sigma_{d} = S_{out} \left( {{\bar{\text{x}}}_{out} - \bar{x}} \right)\left( {{\bar{\text{x}}}_{out} - \bar{x}} \right)^{T} + S_{in} \left( {{\bar{\text{x}}}_{in} - \bar{x}} \right)\left( {{\bar{\text{x}}}_{in} - \bar{x}} \right)^{T} $$
(23)

Finally, the modified LDA projection matrix can be calculated by maximizing the following objective function,

$$ {\text{J}}\left( v \right) = \frac{{v^{T} \sum\nolimits_{b}^{\prime } v }}{{v^{T} \sum\nolimits_{wd} v }} $$
(24)

where v is eigenvector, and \( \Sigma _{{wd}} =\Sigma _{w}^{\prime }\Sigma _{{d}}^{{T}} \). By maximizing above objective function, we can simultaneously maximize the between-class variability and minimizing both within-class variability and domain variability.

4 Experimental Setups

4.1 Speech Data and Acoustic Features

Audio data are collected by an integrated microphone from power grid dispatching hall and dispatcher training simulator (DTS) room. All speakers are male. The raw data are automatically saved in a memory card every 3 min. The two locations have different room sizes, background noises, telephone channels, and so on. Figure 3 shows different environmental setting of audio data collection. From raw audio data, 19 dimensional Mel-frequency cepstral coefficients (MFCCs) together with energy coefficient are extracted and appended with delta and delta-delta features to form a 60-dimensional vector. The vector is extracted every 10 ms, using a Hamming window of 20 ms. And silence frames are detected and discarded by an energy-based voice activity detector (VAD).

Fig. 3.
figure 3

Different locations of audio data collection. (a) Power grid dispatching hall. (b) DTS room

Unless stated otherwise, we partition data gathered from DTS room into two subsets. We use one subset as development data and the other as evaluation data. In order to carry out experiments for short utterance conditions, original speech utterances are split into 2 s, 5 s, 10 s (only contain active frames) duration as short test segments. We randomly select initial frame and create 500 truncated segments for each duration. To test the effectiveness of modified LDA in ASV tasks with limited target domain data, we frame the domain mismatch compensation problem as reducing the mismatch between the data collected from different locations. We regard speech utterances collected from power grid dispatching hall as in-domain data, and utterances collected from dispatcher training simulator (DTS) room are considered as out-domain data. In this case, the speech files from DTS room are used as development data and speech files from power grid dispatching hall are used as evaluation data.

4.2 I-Vector Extraction and PLDA Modeling

To extract i-vector, we train a UBM with 512 Gaussian components on development data and use UBM to estimate the BW statistics. The TV subspace has a dimension of 400 and is trained on same development data. For LDA and modified LDA training, the reduced dimension is kept at 200. Length normalization is applied to LDA projected i-vectors to convert their behavior into Gaussian. Then a PLDA model with 150 latent variables is trained. We train the World MAP estimator on development data. The prior probability used are 0.1 for target and 0.9 for non-target.

4.3 Evaluation Criteria

There are two kind of mistakes in ASV system: a false rejection happens when a genuine speaker is incorrectly rejected and a false alarm when an imposter is accepted. In our experiment, the system performance is evaluated using equal error rate (EER) in which the false rejection rate and false alarm rate are equal. Also, we report experimental results in terms of minimum detection cost function (minDCF).

5 Results and Discussions

5.1 Baseline ASV System Performance

In the first series of experiments, we compare the performance of baseline ASV system in different test durations. The experiments are conducted on speech files collected from DTS room. We use 3 min raw speech for enrollment and three types of truncated segments (contain 2 s, 5 s 10 s active frames respectively) for test i-vector extraction. The results are presented in Fig. 4.

Fig. 4.
figure 4

Baseline ASV system performance for different test duration conditions

It can be observed that system performance in terms of both EER and minDCF degrades monotonically with the decrease in speech duration. When ASV system is presented with 2 s short utterance, the EER and minDCF increase 182% and 142% respectively compared to 10 s test utterance. This illustrates the need for proposed i-vector extraction method.

Next, we use speech files from power grid dispatching hall as evaluation data. Similarly, 3 min raw speech is used for enrollment and 2 s, 5 s, 10 s truncated speech segments are used for testing. This series of experiments aims to show the effect of in-domain development data on the performance of baseline system. The results are presented in Fig. 5.

Fig. 5.
figure 5

Performance comparison using in-domain and out-domain development data

As shown in Fig. 5, there is a gap in performance on power grid dispatching hall enroll/test set when hyper-parameters are trained with development data gathered from DTS room. In Sect. 5.3, we employ modified LDA to reduce this performance degradation.

5.2 Proposed Method for I-Vector Extraction

In this section, we conduct experiments to test the effectiveness of incorporating historical test information in short utterance i-vector extraction. We use speech files collected from DTS room as both development and evaluation data. The results are presented in Table 1.

Table 1. Performance comparison of baseline system and system incorporating historical test information (proposed-1) in i-vector extraction

Experimental results reported in Table 1 show when enough historical information is inserted, the proposed method could achieve noticeable improvement in terms of EER and minDCF compared with the baseline i-vector system in different short duration conditions. We observe that the relative improvement increases with the decrease in test utterance duration. This suggests that incorporating historical information is useful for short utterance.

To analyze the behavior of our method more precisely, we investigate the system performance in terms of EER for each newly added test utterance. We conduct the experiment on 10 random draws from the entire truncated speech segments pool and evaluate the performance individually. The results are averaged over 10 random draws for statistical significance. We notice that a minimum amount of data should be presented for proposed system to obtain stable gain. The average EER of 10 s test utterance condition are presented in Fig. 6. In 2 s and 5 s utterance conditions, the patterns are similar.

Fig. 6.
figure 6

Average EER of the 10 s test utterance condition

5.3 Modified LDA

As shown in Sect. 5.1, when ASV system is developed using data which is outside the target domain, it significantly affects the performance due to the mismatch between development and evaluation data. To investigate this situation, we use speech files collected from DTS room as development data and speech files collected from power grid dispatching hall as evaluation data. We use modified LDA projection to replace the conventional LDA in baseline system. System performance in terms of EER and minDCF are presented in the Table 2.

Table 2. Performance comparison of baseline system, system with modified LDA (proposed-2)

From Table 2, a relative gain of at least 16.4% in EER and 18.3% in minDCF is observed after applying modified LDA. In terms of bridging the performance gap between a matched baseline (DTS room data for both development and evaluation) and a mismatched baseline (DTS room data for development, power grid dispatching hall data for evaluation) system, we are able to recover at least 63% of the performance gap for different duration conditions. It demonstrates that modified LDA is quite successful in reducing the volume of in-domain development data.

Finally, we conduct experiment on system integrating the proposed i-vector extraction method and modified LDA. We develop system on speech data collected from DTS room and evaluate performance on data collected from power grid dispatching hall. From Table 3, it can be observed that further improvement is achieved with combined approach. Compared to baseline, it shows at least 20% improvement for different test segment durations.

Table 3. Performance of system using combined approach (proposed-3)

6 Conclusions and Future Work

The performance of i-vector/PLDA ASV systems depends on a large quantity of in-domain development data for PLDA training. During the evaluation, it is also critical that the speech duration is long enough to reduce the uncertainty in i-vector estimation. In many practical applications, the speaker verification performance is affected due to the difficulty in collecting significant amount of speech data. In this study, we propose modification for i-vector ASV system to address the issue of performance degradation with limited voice data. With the aid of historical test information, we observe a relative improvement of 9.4% in EER for 2 s test duration condition. When system is trained on mismatched development dataset, we are able to recover at least 63% of performance gap using modified LDA projection. The best performance is achieved with combined method, where we obtain relative improvement in the range of 20–29% over baseline system.

Despite the promising results, there are still some problems to study in the future. For example, currently world MAP estimator assumes the prior probabilities when the corresponding scores are not encountered in the score GMM training data. While it is anticipated that this situation is rare, we intend to investigate its effect on system performance. In addition, speakers can overlap in different domains and the data in one domain can be multi-modal. Such multi-modality can lead to misrepresentation of the speaker and non-speaker information [16]. We intend to extend our modified LDA method to compensate for speaker population difference among different portions of training data. Also, we intend to investigate the relationship between system performance and different sizes of in-domain data used for LDA training. In our future work, we intend to explore applying proposed methods onto deep neural networks (DNN) based systems. Using DNN instead of GMM to derive speaker specific information is a very promising direction to look at.