Skip to main content
Log in

Multi-factor authentication model based on multipurpose speech watermarking and online speaker recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


In this paper, a Multi-Factor Authentication (MFA) method is developed by a combination of Personal Identification Number (PIN), One Time Password (OTP), and speaker biometric through the speech watermarks. For this reason, a multipurpose digital speech watermarking applied to embed semi-fragile and robust watermarks simultaneously in the speech signal, respectively to provide tamper detection and proof of ownership. Similarly, the blind semi-fragile speech watermarking technique, Discrete Wavelet Packet Transform (DWPT) and Quantization Index Modulation (QIM) are used to embed the watermark in an angle of the wavelet’s sub-bands where more speaker specific information is available. For copyright protection of the speech, a blind and robust speech watermarking are used by applying DWPT and multiplication. Where less speaker specific information is available the robust watermark is embedded through manipulating the amplitude of the wavelet’s sub-bands. Experimental results on TIMIT, MIT, and MOBIO demonstrate that there is a trade-off among recognition performance of speaker recognition systems, robustness, and capacity which are presented by various triangles. Furthermore, threat model and attack analysis are used to evaluate the feasibility of the developed MFA model. Accordingly, the developed MFA model is able to enhance the security of the systems against spoofing and communication attacks while improving the recognition performance via solving problems and overcoming limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others


  1. Akhaee MA, Kalantari NK, Marvasti F (2009) Robust multiplicative audio and speech watermarking using statistical modeling. In IEEE International Conference on Communications, ICC’09. 2009. IEEE

  2. Akhaee MA, Kalantari NK, Marvasti F (2010) Robust audio and speech watermarking using Gaussian and Laplacian modeling. Signal Process 90(8):2487–2497

    Article  MATH  Google Scholar 

  3. Al-Nuaimy W et al (2011) An SVD audio watermarking approach using chaotic encrypted images. Digit Sig Process 21(6):764–779

    Article  Google Scholar 

  4. Baroughi AF, Craver S (2014) Additive attacks on speaker recognition. In IS&T/SPIE Electronic imaging. International Society for Optics and Photonics

  5. Besacier L, Bonastre J-F, Fredouille C (2000) Localization and selection of speaker-specific information with statistical modeling. Speech Comm 31(2):89–106

    Article  Google Scholar 

  6. Bimbot F et al (2004) A tutorial on text-independent speaker verification. EURASIP J Appl Sig Process 2004:430–451

    Article  Google Scholar 

  7. Bolten JB (2003) E-authentication guidance for federal agencies. Office of Management and Budget, 2003

  8. Brookes M (2006) VOICEBOX: a speech processing toolbox for MATLAB

  9. Chaturvedi A, Mishra D, Mukhopadhyay S (2013) Improved biometric-based three-factor remote user authentication scheme with key agreement using smart card. In Information systems security, Springer, p 63–77

  10. Dehak N et al (2011) Front-end factor analysis for speaker verification. Audio Speech Lang Process IEEE Trans 19(4):788–798

    Article  Google Scholar 

  11. Faundez-Zanuy M, Hagmüller M, Kubin G (2006) Speaker verification security improvement by means of speech watermarking. Speech Comm 48(12):1608–1619

    Article  MATH  Google Scholar 

  12. Faundez-Zanuy M, Hagmüller M, Kubin G (2007) Speaker identification security improvement by means of speech watermarking. Pattern Recogn 40(11):3027–3034

    Article  MATH  Google Scholar 

  13. Garofolo JS, L.D. Consortium (1993) TIMIT: acoustic-phonetic continuous speech corpus, Linguistic Data Consortium

  14. Hinkley DV (1969) On the ratio of two correlated normal random variables. Biometrika 56(3):635–639

    Article  MathSciNet  MATH  Google Scholar 

  15. Huber R, Stögner H, Uhl A (2011) Two-factor biometric recognition with integrated tamper-protection watermarking. In Communications and multimedia security, Springer

  16. Hyon S (2012) An investigation of dependencies between frequency components and speaker characteristics based on phoneme mean F-ratio contribution. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012. Asia-Pacific: IEEE

  17. Kenny P (2012) A small foot-print i-vector extractor. In Proc. Odyssey

  18. Khitrov M (2013) Talking passwords: voice biometrics for data access and security. Biom Technol Today 2013(2):9–11

    Article  Google Scholar 

  19. Kim J-J, Hong S-P (2011) A method of risk assessment for multi-factor authentication. J Inf Process Syst (JIPS) 7(1):187–198

    Article  Google Scholar 

  20. Kumar A, Lee HJ (2013) Multi-factor authentication process using more than one token with watermark security. In Future information communication technology and applications, Springer, p 579–587

  21. Li C-T, Hwang M-S (2010) An efficient biometrics-based remote user authentication scheme using smart cards. J Netw Comput Appl 33(1):1–5

    Article  Google Scholar 

  22. Li Q, Memon N, Sencar HT (2006) Security issues in watermarking applications-A deeper look. In Proceedings of the 4th ACM international workshop on Contents protection and security. ACM

  23. Lu X, Dang J (2008) An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification. Speech Comm 50(4):312–322

    Article  Google Scholar 

  24. Mallat S (2008) A wavelet tour of signal processing: the sparse way. Academic press

  25. McCool C et al (2012) Bi-modal person recognition on a mobile phone: using mobile phone data. In Multimedia and Expo Workshops (ICMEW), 2012 I.E. International Conference on, IEEE

  26. Mohamed S et al (2013) A method for speech watermarking in speaker verification

  27. Nematollahi MA, Akhaee MA, Al-Haddad SAR, Gamboa-Rosales H (2015) Semi-fragile digital speech watermarking for online speaker recognition. EURASIP J Audio Speech Music Process 2015(1):1–15

    Article  Google Scholar 

  28. Nematollahi MA, Al-Haddad S (2015) Distant speaker recognition: an overview. Int J Humanoid Robot 12(03):1–45

    Google Scholar 

  29. Nematollahi MA, Gamboa-Rosales H, Akhaee MA, Al-Haddad SAR (2015) Robust digital speech watermarking for online speaker recognition. Mathematical Problems in Engineering, 2015

  30. O’Gorman L (2003) Comparing passwords, tokens, and biometrics for user authentication. Proc IEEE 91(12):2021–2040

    Article  Google Scholar 

  31. Pathak MA, Raj B (2013) Privacy-preserving speaker verification and identification using gaussian mixture models. Audio Speech Lang Process IEEE Trans 21(2):397–406

    Article  Google Scholar 

  32. Reynolds DA (1995) Speaker identification and verification using Gaussian mixture speaker models. Speech Comm 17(1):91–108

    Article  Google Scholar 

  33. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Sig Process 10(1):19–41

    Article  Google Scholar 

  34. Roberts C (2007) Biometric attack vectors and defences. Comput Secur 26(1):14–25

    Article  Google Scholar 

  35. Seyed Omid Sadjadi MS, Heck L (2013) MSR Identity toolbox v1.0: A MATLAB toolbox for speaker recognition research, IEEE

  36. Simon J (2012) DataHash

  37. Woo RH, Park A, Hazen TJ (2006) The MIT mobile device speaker verification corpus: data collection and preliminary experiments. In Speaker and Language Recognition Workshop, IEEE Odyssey 2006: The. 2006. IEEE

  38. Wu Z et al (2015) Spoofing and countermeasures for speaker verification: a survey. Speech Comm 66:130–153

    Article  Google Scholar 

Download references


The authors would like to appreciate anonymous reviewers who have made helpful comments on this drafts of this paper.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mohammad Ali Nematollahi.

Appendix A

Appendix A

Discrete Fourier Transform (DFT) is assumed as Weibull distribution. However, the distribution of the DWPT sub-bands is assumed as a Generalized Gaussian Distribution (GGD) [2]. GGD can be defined as in Eq. (14), if μ 2 s  = 0 and σ 2 s are assumed.

$$ {f}_s\left(s;\mu, {\sigma}_s,v\right)=\frac{1}{2\varGamma \left(1+\frac{1}{v}\right)A\left({\sigma}_sv\right)} exp\left\{-{\left|\frac{s-\mu }{A\left({\sigma}_sv\right)}\right|}^v\right\} $$

where Γ(.) corresponds to Gamma function which is expressed by \( \varGamma (x)={\displaystyle {\int}_0^{\infty }{t}^{x-1}{e}^{-t}dt\cong}\sqrt{2\pi }{x}^{x-\frac{1}{2}}{e}^{-x},v \) corresponds to the shape of the distribution which can be estimated by statistical moment of the signal.

If the watermarked speech signal is passing through AWGN channel, it is possible to formulate the watermarked speech signal at receiver based on Eqs. (15) and (16).

$$ {r}_i=\alpha \times {s}_i+{n}_i\ if\ {m}_i=1 $$
$$ {r}_i=\frac{1}{\alpha}\times {s}_i+{n}_i\ if\ {m}_i=0 $$

where n i corresponds to the amount of noise which is contaminated the watermarked speech signal. To estimate the probability of the watermark bits when it is 1, Eq. (17) is expressed:

$$ \left.R\right|1=\frac{{\displaystyle {\sum}_A}{\left(\alpha \times {s}_i+{n}_i\right)}^4}{{\displaystyle {\sum}_B}{\left({s}_i+{n}_i\right)}^4}\Rightarrow \left.R\right|1=\frac{\alpha^4{\displaystyle {\sum}_A}{s}_i^4+4{\alpha}^3{\displaystyle {\sum}_A}{s}_i^3{n}_i+6{\alpha}^2{\displaystyle {\sum}_A}{s}_i^2{n}_i^2+4\alpha {\displaystyle {\sum}_A}{s}_i{n}_i^3+{\displaystyle {\sum}_A}{n}_i^4}{{\displaystyle {\sum}_B}{s}_i^4+4{\displaystyle {\sum}_B}{s}_i^3{n}_i+6{\displaystyle {\sum}_B}{s}_i^2{n}_i^2+4{\displaystyle {\sum}_B}{s}_i{n}_i^3+{\displaystyle {\sum}_B}{n}_i^4} $$

As seen, the summation of different parameters in Eq. (17) are affected the amount of the detection threshold. By considering Central Limit Theorem (CLT), there is possible to compute different series in nominator and denominator based on Normal distribution. Due to large value for μ and long length of the speech frames, the Normal distribution is often generated positive numbers which can modeled parameters like ∑ A n 4 i which is always positive. Equations (18) and (19) are computed the mean and variance respectively.

$$ E\left\{\sum {s}_i^4\right\}=\sum E\left\{{s}_i^4\right\}=M{\mu}_4 $$
$$ \begin{array}{l}var\left({\displaystyle \sum {s}_i^4}\right)=E{\left\{\left({\displaystyle \sum \left({s}_i^4-M{\mu}_4\right)}\right)\right\}}^2=E{\left\{\left({\displaystyle \sum \left({s}_i^4-{\mu}_4\right)}\right)\right\}}^2=\hfill \\ {}{\displaystyle \sum E{\left\{\left(\left({s}_i^4-{\mu}_4\right)\right)\right\}}^2}={\displaystyle \sum \left(E\left\{{s}_i^8-{\mu}_4^2\right\}\right)}=M{\mu}_8-M{\mu}_4^2\hfill \end{array} $$

where M corresponds to the length of each set of A and B. By applying the moment of GGD for r = 4 and r = 8, Eqs. (20) and (21) are estimated.

$$ {\mu}_4=\frac{\sigma_s^4\ \Gamma \left(\frac{1}{v}\right)\ \Gamma \left(\frac{5}{v}\right)\ }{\Gamma^2\left(\frac{3}{v}\right)} $$
$$ {\mu}_8=\frac{\sigma_s^8\ {\Gamma}^3\left(\frac{1}{v}\right)\ \Gamma \left(\frac{9}{v}\right)\ }{\Gamma^4\left(\frac{3}{v}\right)} $$

By considering Eqs. (18) and (19), Eq. (22) is formulated.

$$ \sum {s}_i^4\sim \mathcal{N}\left(M{\mu}_4,M{\mu}_8-M{\mu}_4^2\right) $$

If the mean of the noise is assumed as zero, Eq. (23) can be expressed.

$$ {n}_i\sim \mathcal{N}\left(0,{\sigma}_n^2\right)\ \Rightarrow E\left\{{n}_i^m\right\}=\left\{\begin{array}{ll}0\hfill & for\ m=2k+1\hfill \\ {}\left(m-1\right)\left(m-3\right)\dots \times 1\times {\sigma}_n^m\hfill & for\ m=2k\hfill \end{array}\begin{array}{c}\hfill\ \hfill \\ {}\hfill\ \hfill \end{array}\right. $$

Then, the Normal distribution of 4 moment noise component can be estimated as in Eq. (24).

$$ {\displaystyle \sum {n}_i^4\sim \mathcal{N}}\left(3M{\sigma}_n^4,96M{\sigma}_n^8\right) $$

The other parameters in Eq. (17) can be computed from Eq. (25) to (27).

$$ {\displaystyle \sum {s}_i^3{n}_i\sim \mathcal{N}}\left(0,M{\mu}_6{\sigma}_n^2\right)\ \&\ {\mu}_6=\frac{\sigma_s^6\ {\Gamma}^2\left(\frac{1}{v}\right)\ \Gamma \left(\frac{7}{v}\right)\ }{\Gamma^3\left(\frac{3}{v}\right)} $$
$$ {\displaystyle \sum {s}_i^2{n}_i^2\sim \mathcal{N}}\left(M{\sigma}_s^2{\sigma}_n^2,3M{\mu}_4{\sigma}_n^4-M{\sigma}_s^4{\sigma}_n^4\right) $$
$$ {\displaystyle \sum {s}_i{n}_i^3\sim \mathcal{N}}\left(0,15M{\sigma}_s^2{\sigma}_n^6\right) $$

In order to simplify the computation, two free auxiliary parameters p and q are used in Eq. (28). Therefore, R|1,p,q can formulated as in Eq. (29).

$$ p={\displaystyle {\sum}_B{s}_i^4\ \&\ q}=\frac{{\displaystyle {\sum}_A{s}_i^4}}{{\displaystyle {\sum}_B{s}_i^4}} $$
$$ \left.R\right|1,p,q=\frac{\alpha^4pq+4{\alpha}^3{\displaystyle {\sum}_A{s}_i^3{n}_i+6{\alpha}^2}{\displaystyle {\sum}_A{s}_i^2{n}_i^2+4\alpha }{\displaystyle {\sum}_A{s}_i{n}_i^3+}{\displaystyle {\sum}_A{n}_i^4}}{p+4{\displaystyle {\sum}_B{s}_i^3{n}_i+6}{\displaystyle {\sum}_B{s}_i^2{n}_i^2+4}{\displaystyle {\sum}_B{s}_i{n}_i^3+{\displaystyle {\sum}_B{n}_i^4}}}=\frac{u}{w} $$

where u and w are defined themselves by Eqs. (30) and (31).

$$ \begin{array}{l}{f}_U(u)\sim \mathcal{N}\left({\alpha}^4pq+6{\alpha}^2M{\sigma}_s^2{\sigma}_n^2+3M{\sigma}_n^4,\ 16{\alpha}^6M{\mu}_6{\sigma}_n^2+36{\alpha}^4\right.\left(3M{\mu}_4{\sigma}_n^4-M{\sigma}_s^4{\sigma}_n^4\right)+16{\alpha}^2\times 15M{\sigma}_s^2{\sigma}_n^6+\hfill \\ {}\left.96M{\sigma}_n^8\right)\hfill \end{array} $$
$$ {f}_W(w)\sim \mathcal{N}\left(p+6M{\sigma}_s^2{\sigma}_n^2+3M{\sigma}_n^4,\ 16M{\mu}_6{\sigma}_n^2+36\left(3M{\mu}_4{\sigma}_n^4-M{\sigma}_s^4{\sigma}_n^4\right)+16\times 15M{\sigma}_s^2{\sigma}_n^6+96M{\sigma}_n^8\right) $$

The density of \( \frac{u}{w} \) is computed to estimate the pdf of R|1,p,q. By considering independency and normal distribution for two parameters of u and w, it is possible to express Eq. (32):

$$ {f}_{R\Big|1,p,q}(r)={\displaystyle {\int}_{-\infty}^{\infty}\left|w\right|{f}_{U,W}\left(wr,w\right)\ dw} $$

Also, if U and W are assumed as normal distribution and independent, then f U,W (u, w) is formulated as in Eq. (33):

$$ {f}_{U,W}\left(u,w\right)={f}_U(u)\times {f}_W(w) $$

Equation (34) is closed-form solution for Eq. (31) which has already discussed in literature [14].

$$ D(r)=\frac{b(r)c(r)}{a^3(r)}\ \frac{1}{\sqrt{2\pi }{\sigma}_u{\sigma}_w}\left[2\Phi \left(\frac{b(r)}{a(r)}\right)-1\right]+\frac{1}{a^3(r)\pi {\sigma}_u{\sigma}_w}{e}^{-\frac{1}{2}\left(\frac{\mu_u^2}{\sigma_u^2}+\frac{\mu_w^2}{\sigma_w^2}\right)} $$

Each parameter in Eq. (34) is defined based on Eqs. (35) to (38):

$$ a(r)=\sqrt{\frac{r^2}{\sigma_u^2}+\frac{1}{\sigma_w^2}} $$
$$ b(r)=\frac{\mu_u}{\sigma_u^2}r+\frac{\mu_w}{\sigma_w^2} $$
$$ c(r)= exp\left\{\frac{1}{2}\frac{b^2(r)}{a^2(r)}-\frac{1}{2}\left(\frac{\mu_u^2}{\sigma_u^2}+\frac{\mu_w^2}{\sigma_w^2}\right)\ \right\} $$
$$ \Phi (r)={\displaystyle {\int}_{-\infty}^r\frac{1}{\sqrt{2\pi }}\ {e}^{-\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.{u}^2}\ du} $$

As a result, Eq. (39) formulate the density of R|1:

$$ {f}_{R\Big|1}\left(r\Big|1\right)={\displaystyle {\int}_L^U{\displaystyle {\int}_{-\infty}^{\infty }{f}_{R\Big|1,p,q}}}\left(r\Big|1,p,q\right)\ {f}_P(p)\ {f}_Q(q) $$

The lowest bound and the highest bound are applied to restrict the energy ration between two A and B sets within L and U which is stated as in Eq. (40):

$$ L<\frac{{\displaystyle {\sum}_A{r}_i^4}}{{\displaystyle {\sum}_B{r}_i^4}}<U $$

Although Eq. (22) is expressed the density of parameter P, Eq. (41) is formulated the density of parameter q based on the ratio between independent and normal distribution.

$$ {f}_Q(q)=\frac{D(q)}{{\displaystyle {\int}_L^UD(q)\ dq}} $$

With using same manner in Eq. (17), the probability of r|0 is also computable. Therefore, Eq. (42) can estimate the probability of detected error:

$$ {P}_e=\frac{1}{2}{\displaystyle {\int}_T^{\infty }f\left(r\Big|0\right)}\ dr+\frac{1}{2}{\displaystyle {\int}_{-\infty}^Tf\left(r\Big|1\right)}\ dr $$

The threshold is estimated by minimizing the error as in Eq. (43):

$$ \frac{\partial {P}_e}{\partial T}=0\ \Rightarrow\ {f}_r\left(T\Big|0\right)={f}_r\left(T\Big|1\right) $$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nematollahi, M.A., Gamboa-Rosales, H., Martinez-Ruiz, F.J. et al. Multi-factor authentication model based on multipurpose speech watermarking and online speaker recognition. Multimed Tools Appl 76, 7251–7281 (2017).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: