A corroborative study on improving pitch determination by time–frequency cepstrum decomposition using wavelets
 797 Downloads
Abstract
A new waveletbased method is presented in this work for estimating and tracking the pitch period. The main idea of the proposed new approach consists in extracting the cepstrum excitation signal and applying on it a wavelet transform whose resulting approximation coefficients are smoothed, for a better pitch determination. Although the principle of the algorithms proposed has already been considered previously, the novelty of our methods relies in the use of powerful wavelet transforms well adapted to pitch determination. The wavelet transforms considered in this article are the discrete wavelet transform and the dual tree complex wavelet transform. This article, by all the provided experimental results, corroborates the idea of decomposing the cepstrum excitation by using wavelet transforms for improving pitch detection. Another interesting point of this article relies in using a simple but efficient voicing decision (which actually improves a similar voicing criterion we proposed in a preceding published study) which on one hand respects the realtime process with low latency and on the other hand allows obtaining low classifications errors. The accuracy of the proposed pitch tracking algorithms has been evaluated using the international Bagshaw and the Keele databases which include male and female speakers. Our various experimental results demonstrate that the proposed methods provide important performance improvements when compared with previously published pitch determination algorithms.
Keywords
Wavelet transforms Approximation coefficients Cepstrum signal Pitch estimation Pitch tracking Voicing decisionBackground

The extraction of the cepstrum excitation signal and the wavelet decomposition of this signal into 3 levels in order to obtain the approximation signals we enhance using a VisuShrink method (Donoho and Johnstone 1995) followed by a hard thresholding;

An exhaustive search of the maximum peaks from the smoothed approximation signals in order to estimate the pitch period;

Voiced/Unvoiced classification errors minimization using a simple voicing decision in order to track the pitch.
Background on WTs
DWT

Multiresolution representation using a subband filter bank.

The use of wavelets for iterating the filtering process at each level of decomposition.
DTCWT

The shift is nearly invariant;

The decomposition is directionally selective in two and higher dimensions;

The multidimensional is non separable.
The cepstrum signal

The first one concerns the excitation cepstrum located in the high quefrencies.

The second one concerns the vocal tract cepstrum located in the low quefrencies.
 1
An Hamming window is applied to the shorttime input signal in order to reduce the discontinuities at the boundaries;
 2
An FFT (Fast Fourier Transform) and a modulus operators are applied then to this windowed signal in order to obtain an amplitude spectrum;
 3
A log operator is then applied;
 4
Finally an IFFT (Inverse Fast Fourier Transform) is applied to this log amplitude spectrum in order to obtain the cepstrum signal.
The ideas of the proposed approach in comparison with the Advanced Cepstrum (ACEP) method

Step 1: From the cepstrum signal, we can easily separate the vocal tract signal (which is located in the low quefrencies), from the excitation cepstrum located in the high quefrencies. Then we apply to the excitation part of the cepstrum signal a wavelet transform in order to obtain the approximation coefficients we enhance.

Step 2: From the enhanced approximations, we estimate the pitch period;

Step 3: Simple but efficient voiced/unvoiced decisions are carried out in order to track the pitch period.
Extraction of the cepstrum excitation
Enhancement of the approximation coefficients

cA represents the approximation coefficients for the DWT and DTCWT decompositions. The cA coefficients allow to obtain the maximum peak index in the real low pass subband;

The factor 0.6745 in the denominator rescales the numerator in order to make \(\sigma\) a suitable estimator for the standard deviation.
The pitch period estimation

Fs is the sampling frequency;

\(I_{max}(j)\) is the maximum peak index given by the highest signal amplitude of the 3 decompositions levels related to the jth analyzed frame.
Voicing decisions and pitch tracking
In this section, we present a smart and easy technique for voicing decision, which respects realtime and uses only the preceding frames in order to track the pitch. In a speech signal, most of the voiced regions contain speech or speaker specific attributes, while silences or background noises are completely undesirable. However, we have to know if the cepstrum excitation signal exhibits periodic peaks (voiced regions) or random ones (unvoiced regions). The role of the pitch tracking algorithm is to detect correctly the voiced/unvoiced speech components.
The voicing decision
Corrections

Starting with isolated peaks: the pitch peak is eliminated if its duration is bellow 13.5 ms (which corresponds to 9 frames). The shift used between two consecutive frames, in our study, is 1.5 ms (30 samples).

Concerning valleys: we rebuilt the pitch contour linearly if the duration of the valley is also below 13.5 ms. We propose to regularize the pitch tracking in respecting the real time process. So emphasis is placed on the very low latency obtained which is 13.5 ms.
Experimental results
The test of the proposed approach and the voicing decision used were evaluated over the Bagshaw (Bagshaw et al. 1993) and Keele (Plante et al. 1995) databases.
The databases
The Bagshaw database Paul Bagshaw’s database was recorded at the University of Edinburgh (Centre for Speech Technology Research) and authored by Paul Bagshaw. The speech and laryngograph signals of this database were sampled at 20 kHz. It contains 0.12 h of speech, 50 English sentences each pronounced by one male and one female speaker. The fundamental frequency was computed by estimating the location of the glottal pulses in the laryngograph data and taking the inverse of the distance between each pair of consecutive pulses. Each fundamental frequency estimate is associated to the time instant in the middle of the pair of pulses used to derive the estimate.
The Keele database The Keele Pitch Database was recorded at Keele University. Data were collected for five male and five female English speakers, each of them read a phonetically balanced text: the “northwind story”. The speech and laryngograph signals were sampled at 20 kHz. The fundamental frequency was estimated by applying an autocorrelation on windows of 25.6 ms shifted by intervals of 10 ms.
Tracking the pitch period
Results

Classification Error (CE) is the percentage of unvoiced frames classified as voiced plus the percentage of voiced frames classified as unvoiced (Chu and Alwan 2009).

\(N_{UV\rightarrow V}\) is the number of unvoiced frames classified as voiced;

\(N_{V\rightarrow UV}\) is the number of voiced frames classified as unvoiced;

N is the total number of frames in the utterances.

Gross Error Rate (GER): percentage of voiced frames with an estimated F0 value that deviates from the reference value more than 20 %. When the error is less than −20 %, it is counted as a gross error low; errors exceeding +20 % are counted as gross error high.

Mean is the mean of the absolute differences between the reference and the estimated fundamental frequency values.

Standard Deviation (SD) is the standard deviation of the absolute differences between the estimated and reference pitch values.

CEP is the cepstrumbased pitch reference estimation algorithm (Noll 1967) for extracting the pitch as the frequency whose inverse maximizes the cepstrum signal. CEP is concerned by the problem of harmonics and also by the maximum F0 value it can detect.

MCEP is the Modified CEP (Kobayashu and Shimamura 1998) which introduces the “clipping” method for removing the high frequencies in order to provide a solution of the noise problem. By using an IFFT the pitch period is extracted from the cepstrum signal.

ACEP is the Advanced CEP (Weiping et al. 2004), which carries out a 3 levels wavelet transform.

WCEPD for Wavelet and Cepstrum Excitation for Pitch Determination (Bahja et al. 2012) is a pitch tracking method based on a wavelet transform in the temporal domain. It is designed to estimate the pitch period of the speech signal from the cepstrum excitation signal processed by a wavelet transform.

eCATE++ for enhanced Circular Autocorrelation of the Temporal Excitation (Bahja et al. 2013) is an algorithm for pitch detection based on an implicit circular autocorrelation of the glottal excitation signal.

DWT and DTCWT concern the two wavelet algorithms used in our approach under the voicing decision presented above.
CE, GER and Absdeviation for the male corpus of the Bagshaw database
Method  CE %  Gross error  Absdeviation  

Low (%)  High (%)  Mean (Hz)  SD (Hz)  
CEP  0.27  1.11  2.96  3.51  3.76 
MCEP  0.23  0.65  0.88  2.41  2.98 
ACEP  0.14  1.16  0.25  2.31  3.01 
WCEPD  0.11  0.41  0.06  3.15  2.84 
eCATE++  0.08  0.27  0.71  1.82  2.91 
DWT  0.13  0.31  0.01  3.01  2.56 
DTCWT  0.16  0.24  0.00  2.06  2.29 
CE, GER and Absdeviation for the female corpus of the Bagshaw database
Method  CE (%)  Gross error  Absdeviation  

Low (%)  High (%)  Mean (Hz)  SD (Hz)  
CEP  0.23  1.46  3.07  10.68  9.39 
MCEP  0.17  0.99  1.94  8.45  7.89 
ACEP  0.10  1.04  0.54  8.38  7.63 
WCEPD  0.17  0.54  0.22  10.86  7.29 
eCATE++  0.06  0.31  0.39  4.27  5.50 
DWT  0.15  0.38  0.31  10.37  6.37 
DTCWT  0.14  0.39  0.22  6.48  5.42 
 The Gross Pitch Error (GPE) (Nakatani et al. 2008):where \(N_{vv}\) is the number of frames considered as voiced both from the pitch tracker and the reference pitch contours; vv means both voiced; and$$\begin{aligned} GPE=\frac{N_{GE}}{N_{vv}}*100\,\% \end{aligned}$$(8)
\(N_{GE}\) is the number of voiced frames for which \(\frac{F0_{i,estimated}}{F0_{i,reference }}1 > 0.2\) where i is the frame number.
 F0 Frame Error (FFE) metric (Nakatani et al. 2008):It sums the three types of errors mentioned above:$$\begin{aligned} FFE=\frac{N_{vv}}{N}*GPE+CE \end{aligned}$$(9)$$\begin{aligned} FFE=\frac{N_{V\rightarrow UV}+ N_{UV\rightarrow V}+N_{GE}}{N}*100\,\% \end{aligned}$$(10)
GPE rates for pitch estimation using Keele University database
PDA  GPE (%)  

Male speakers  Female speakers  Mean  
CEP  3.7  4.2  3.95 
PRAAT  2.9  3.3  3.1 
YIN  3.5  1.2  2.35 
eCATE++  0.48  0.40  0.44 
DWT  0.38  0.34  0.36 
DTCWT  0.37  0.30  0.33 
Performance of PDAs using the Keele database
PDA  GPE (%)  CE (%)  FFE (%) 

YIN  2.28  6.28  7.23 
SWIPE  0.62  3.92  4.19 
SPM  0.75  3.02  3.31 
CSAPM  0.67  2.27  2.59 
eCATE++  0.44  0.65  1.55 
DWT  0.36  0.78  1.41 
DTCWT  0.33  0.81  1.39 
GPE and MFPE for algorithms using the Keele and the Bagshaw corpora
Keele database  Bagshaw database  

PDA  GPE (%)  MFPE (Hz)  GPE (%)  MFPE (Hz) 
CPD  3.95  –  4.65  – 
eSRPD  3.90  –  1.40  – 
PRAAT  3.10  0.19  2.27  −0.77 
YIN  2.35  0.55  2.25  −0.39 
RAPT  2.62  0.79  2.45  −0.06 
SAFE  2.98  −0.36  2.45  −1.39 
eCATE++  0.44  −0.03  0.81  −1.67 
DWT  0.36  −0.26  0.25  −2.39 
DTCWT  0.33  −0.11  0.25  −0.52 
Conclusion
The presented work focuses especially on the estimation of the pitch period, the pitch tracking algorithm and the voiced/unvoiced decision in realtime. This study corroborates the idea of decomposing the cepstrum excitation signal using powerful wavelet transforms such as DWT or DTCWT for improving pitch determination. The main contributions of the presented algorithm consists in obtaining a very low latency (13.5 ms), which must be compared with the latency obtained by the eCATE++ algorithm (20.25 ms), and low classification errors for both the Bagshaw and Keele databases.
Notes
Authors' contributions
FB and JDM conceived and designed the study with the help of EIE and DA who proposed initially the use of wavelets for decomposing cepstrum. All the experiments have been realized by FB. FB and JDM drafted the initial manuscript and all the authors significantly contributed to its revision. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank the University Mohammed 5 for having partly supported this study.
Competing interests
The authors declare that they have no competing interests.
References
 Bagshaw PC, Hiller SM, Jack MA (1993) Enhanced pitch tracking and the processing of f0 contours for computer aided intonation teaching. Proc Eur Conf Speech Technol 2:1000–1003Google Scholar
 Bahja F, Di Martino J, Ibn Elhaj E (2012) On the use of wavelets and cepstrum excitation for pitch determination in realtime. In: ICMCS conference, pp 150–153Google Scholar
 Bahja F, Di Martino J, Ibn Elhaj E, Aboutajdine D (2013) An overview of the cate algorithms for realtime pitch determination. J Signal Image Video Process. doi: 10.1007/s1176001304884 Google Scholar
 Ben Messaoud MA, Bouzid A, Ellouze N (2009) A new method for pitch tracking and voicing decision based on spectral multiscale analysis. Signal Process Int J 3:144152 Issue 5Google Scholar
 Ben Messaoud MA, Bouzid A, Ellouze N (2011) Using multiscale product spectrum for single and multipitch estimation. IET Signal Process J 5(3):344–355CrossRefGoogle Scholar
 Ben Messaoud MA, Bouzid A, Ellouze N (2012) Pitch estimation and voiced decision by spectral autocorrelation compression of multiscale product. JEPTALNRECITAL Conf 1:201–208Google Scholar
 Chang G, Yu B, Vetterli M (2000) Adaptive wavelet thresholding for image denoising and compression. IEEE Trans Image Process 9(9):1532–1546CrossRefGoogle Scholar
 Chu W, Alwan A (2009) Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend. In: ICASSPGoogle Scholar
 Chu W, Alwan A (2012) Safe: a statistical approach to f0 estimation under clean and noisy conditions. IEEE Trans Audio Speech Lang Process 20(3):933–967CrossRefGoogle Scholar
 De Cheveigné A, Kawahara H (2002) Yin, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930CrossRefGoogle Scholar
 Donoho DL, Johnstone IM (1995) Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 90(432):1200–1224CrossRefGoogle Scholar
 Donoho DL, Johnstone IM, Kerkyacharian G, Picard D (1995) Wavelet shrinkage: Asymptopia? J R Stat Soc Ser B 57:301–369Google Scholar
 Ghosh PK, Ortega A, Narayanan S (2007) Pitch period estimation using multipulse model and wavelet transform. In: Proceedings of InterSpeech, pp 2761–2764Google Scholar
 Hermes DJ, Wiley J (1993) Pitch analysis. In: Cooke M, Beet S, Crawford M (eds) Visual representation of speech signals. Wiley, Amsterdam, pp 1–25Google Scholar
 Hess W (1983) Pitch determination of speech signals: algorithms and devices. Springer, BerlinCrossRefGoogle Scholar
 Kadambe S, Faye BoudreauxBartels G (1992) Application of the wavelet transform for pitch detection of speech signals. IEEE Trans Info Theory 38(2):917–924CrossRefGoogle Scholar
 Kingsbury N (1998a) The dualtree complex wavelet transform : a new efficient tool for image restoration and enhancement. In: Proceedings of EUSIPCO, pp 319–322Google Scholar
 Kingsbury N (1998b) The dualtree complex wavelet transform : a new technique for shift invariance and directional filters. In: 8th IEEE DSP workshopGoogle Scholar
 Kingsbury NG, Zymnis A, Pena A (2004) Dtmri data visualisation using the dualtree complex wavelet transform. In: Proceedings of the IEEE symposium on biomedical imaging, pp 328–331Google Scholar
 Kobayashu H, Shimamura T (1998) A modified cepstrum method for pitch extraction. In: Proceedings of the IEEE AsiaPacific conference on circuits and systems, pp 299–302Google Scholar
 Krusback D, Niederjohn R (1991) An autocorrelation pitch detector and voicing decision with confidence measures developed for noisecorrupted speech. IEEE Trans Signal Process 39(2):319–329CrossRefGoogle Scholar
 Kwitt R, Meerwald P, Uhl A (2010) Blind detection of additive spreadspectrum watermarking in the dualtree complex wavelet domain. Int J Digit Crime Forensics 2(2):3446CrossRefGoogle Scholar
 Miller MA, Kingsbury NG (2008) Image modeling using interscale phase properties of complex wavelet coefficients. IEEE Trans Image Process 17(9):1491–1499CrossRefGoogle Scholar
 Miller MA, Kingsbury NG, Hobbs RW (2005) Seismic imaging using complex wavelets. In: Proceedings of the ICASSP conference, pp 557–560Google Scholar
 Nakatani T, Amano S, Irino T, Ishizuka K, Kondo T (2008) A method for fundamental frequency estimation and voicing decision: application to infant utterances recorded in real acoustical environments. Speech Commun 50(3):203–214CrossRefGoogle Scholar
 Nelson JDB, Pang SK, Kingsbury NG, Godsill SJ (2008) Tracking ground based targets in aerial video with dualtree complex wavelet polar matching and particle filtering. In: 11th international conference on information fusion, pp 1–7Google Scholar
 Noll AM (1967) Cepstrum pitch determination. J Acoust Soc Am 41:293–309CrossRefGoogle Scholar
 Obaidat MS, Brodzik A, Sadoun B (1998) A performance evaluation study of four wavelet algorithms for the pitch period estimation of speech signals. Inf Sci 112:213–221CrossRefGoogle Scholar
 Plante F, Meyer F, Ainsworth WA (1995) A pitch extraction reference database. In: Proceedings of eurospeech, pp 837–840Google Scholar
 Rabiner LR, Sambur MR (1977) Voicedunvoicedsilence detection using the itakura lpc distance measure. In: Proceedings of ICASSP, pp 323–326Google Scholar
 Weiping H, Xiuxin W, Gomez P (2004) Robust pitch extraction in pathological voice based on wavelet and cepstrum. In: Proceedings of EUSIPCO, pp 297–300Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.