Skip to main content

Part of the book series: Springer Theses ((Springer Theses))

Abstract

This chapter gives an overview of the methods for speech and music analysis implemented by the author in the openSMILE toolkit. The methods described, include all the relevant processing steps from an audio signal to a classification result. These steps include pre-processing and segmentation of the input, feature extraction (i.e., computation of acoustic Low-level Descriptors (LLDs) and summarisation of these descriptors in high level segments), and modelling (e.g., classification).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In openSMILE the FFT with complex valued output (and also the inverse FFT) is implemented by the cTransformFFT component. Magnitude and Phase can be computed with the cFFTmagphase component.

  2. 2.

    In openSMILE windowing of audio samples (i.e., short-time analysis) can be performed with the cFramer component.

  3. 3.

    http://opensmile.audeering.com.

  4. 4.

    In openSMILE pre-emphasis can be implemented with the cPreemphasis component on a continuous signal, or with the cVectorPreemphasis component on a frame base (Hidden Markov Toolkit (Young et al. 2006) (HTK) compatible behaviour).

  5. 5.

    RMS and logarithmic energy can be computed in openSMILE with the cEnergy component.

  6. 6.

    openSMILE defines \(8.674676 \times 10^{-19}\) as a floor value for the argument of the log, for samples scaled to the range of \(-1\)\(+1\). In case of sample value range from \(-32767\) to \(+32767\) (HTK compatible mode), the floor value for the argument of the log is 1.

  7. 7.

    The loudness approximation and the signal intensity as defined here can be extracted in openSMILE with the cIntensity component.

  8. 8.

    In openSMILE the option dBpsd must be enabled in the cFftMagphase component in order to compute logarithmic power spectral densities.

  9. 9.

    In openSMILE these spectral scale transformations and spline interpolation can be applied with the cSpecScale component.

  10. 10.

    http://www.speex.org/.

  11. 11.

    The SPEEX version of the Bark transformation is implemented in openSMILE as forward transformation only. Not all components will work, as most components require a backward scale transformation.

  12. 12.

    For an implementation, see the cMelspec component in openSMILE and scale transformation functions in the smileUtil library.

  13. 13.

    Band spectra can be computed in openSMILE with the cMelspec component, which—despite the name Melspec—can compute general band spectra for all supported frequency scales from a linear magnitude or power spectrum.

  14. 14.

    In openSMILE the cMelspec component implements these filterbanks for various frequency scales (not only Mel).

  15. 15.

    In openSMILE the FIR filterbanks with Gabor, gammatone, high- and low-pass filters can be applied with the cFirFilterbank component.

  16. 16.

    In openSMILE these spectral descriptors can be extracted with the cSpectral component.

  17. 17.

    In openSMILE, this is implemented in the cSpectral component.

  18. 18.

    This is the current default in all openSMILE feature sets up to version 2.0. An option for normalisation might appear in later versions.

  19. 19.

    In the cSpectral component.

  20. 20.

    Enabled by the option normBandEnergies of the cSpectral component of openSMILE.

  21. 21.

    ACF according to this equation is implemented in openSMILE in the cAcf component.

  22. 22.

    In openSMILE linear predictive coding is supported via the cLpc component.

  23. 23.

    As implemented in openSMILE in the cLpc component.

  24. 24.

    In openSMILE the cLsp component implements LSP computation based on code from the Speex codec library (www.speex.org).

  25. 25.

    In openSMILE formant extraction is implemented via this method in the cFormant component, which processes the AR LP coefficients from the cLpc component.

  26. 26.

    PLP via this method is implemented in openSMILE via the cPlp component.

  27. 27.

    In openSMILE this Bark scale can be selected in the cMelspec component by setting the specScale option to ‘bark_schroed’.

  28. 28.

    openSMILE allows for this flexibility because the PLP procedure builds on a chain of components: cTransformFFT, cFFTmagphase, cMelspec (for the non-linear band spectrum), and cPlp (for equal loudness and intensity power law and autoregressive modelling and cepstral coefficients).

  29. 29.

    In openSMILE it is enabled by setting htkcompatible to 1 in the cPlp component.

  30. 30.

    Configurable via the option compression in the openSMILE component cPlp.

  31. 31.

    In openSMILE MFCC are computed via cMelspec (taking FFT magnitude spectrum from cFFTmagphase as input) and cMfcc.

  32. 32.

    In openSMILE the floor value is also \(10^{-8}\) by default, and 1 when htkcompatible=1 in cMfcc.

  33. 33.

    Please note, that the DCT equation given in Young et al. (2006) and here differ because Young et al. (2006) start the summation at \(b=1\) for the first Mel-spectrum band, while here the first band is set at \(b=0\).

  34. 34.

    PLP-CC can be computed in openSMILE by creating a chain of cFFTmagphase, cMelspec, and cPlp and setting the appropriate options for cepstral coefficients in the cPlp component.

  35. 35.

    In openSMILE this behaviour is implemented in the pitch smoother components and in the cPitchACF component; the output \(F_0\) final contains \(F_0\) with values forced to 0 for unvoiced regions. See the documentation for more details.

  36. 36.

    In the cPitchACF component, which requires combined ACF and Cepstrum input from two instances of the cAcf component.

  37. 37.

    The method is implemented in openSMILE in two components: cSpecScale which performs spectral peak enhancement, smoothing, octave scale interpolation, and auditory weighting; cPitchShs which expects the spectrum produced by cSpecScale and performs the shifting, compression, and summation as well as pitch candidate estimation by peak picking.

  38. 38.

    \(\gamma \) can be changed in openSMILE via the compressionFactor option of the cPitchShs component.

  39. 39.

    The greedy peak picking algorithm behaviour is achieved in openSMILE when the greedyPeakAlgo option is set to 1. The old (non-greedy) version of the algorithm searched through the peaks from lowest to highest frequency and considered the first peak found as the first candidate. Another candidate was only added if the magnitude was higher than that of the previous first candidate. This behaviour was sub-optimal for Viterbi based smoothing, which requires multiple candidates to evaluate the best path among them.

  40. 40.

    In openSMILE this behaviour is not implemented in the cPitchShs component, but is rather implemented via the configuration, e.g., for the smileF0_base.conf and IS13_ComParE.conf configurations. Thereby, the cValbasedSelector component is used to force F0 values to 0 (indicating unvoiced parts) if the energy falls below the threshold.

  41. 41.

    Available in openSMILE via the cPitchSmoother component.

  42. 42.

    In openSMILE the Viterbi based pitch smoothing is implemented in the cPitchSmoother Viterbi component.

  43. 43.

    In openSMILE version 2.0 and above, these parameters are implemented by the cHarmonics component.

  44. 44.

    This definition of Jitter is implemented in openSMILE in the cPitchJitter component. It can be enabled via the jitterLocal option.

  45. 45.

    This definition of delta Jitter is implemented in openSMILE in the cPitchJitter component. It can be enabled via the jitterDDP option.

  46. 46.

    searchRangeRel option of the cPitchJitter component in openSMILE.

  47. 47.

    minCC option in openSMILE.

  48. 48.

    sourceQualityMean and sourceQualityRange options in cPitchJitter of openSMILE.

  49. 49.

    In openSMILE CHROMA features are supported by the cChroma component, which requires a semi-tone band spectrum as input, which can be generated by the cTonespec component (preferred) or by the (more general) cMelspec component.

  50. 50.

    In openSMILE CENS features can be computed from CHROMA (PCP) features with the cCens component.

  51. 51.

    In openSMILE the simple difference function can be applied with the cDeltaRegression component with the delta window size set to 0 (option deltaWin \(=\) 0).

  52. 52.

    In openSMILE these delta regression coefficients can be computed with the cDeltaRegression component.

  53. 53.

    Option deltaWin in openSMILE component cDeltaRegression.

  54. 54.

    In openSMILE the smoothing via a moving average window is implemented in the cContourSmoother component. Feature names often carry the suffix _sma, which stands for ‘smoothed (with) moving average (filtering)’.

  55. 55.

    In openSMILE univariate functionals are accessible via the cFunctionals component.

  56. 56.

    Implementations of mean value related functionals are contained in the cFunctionalMeans component in openSMILE, which can be activated by setting functionalsEnabled = Means in the configuration of cFunctionals.

  57. 57.

    And is the implementation used in openSMILE.

  58. 58.

    And also implemented in the cFunctionalMeans component.

  59. 59.

    In openSMILE the norm option of cFunctionalMeans can be set to segment to normalise counts and times etc. by N.

  60. 60.

    Implemented in openSMILE in the cFunctionalMoments component.

  61. 61.

    In openSMILE extreme values can be extracted with the cFunctionalExtremes component.

  62. 62.

    Percentiles are implemented in openSMILE in the cFunctionalPercentiles component.

  63. 63.

    In openSMILE the temporal centroid is implemented by the cFunctionalRegression component, as the sums are shared with the regression equations, thus computing both descriptors in the same component increases the efficiency.

  64. 64.

    In openSMILE the cFunctionalRegression component computes linear and quadratic regression coefficients.

  65. 65.

    As used in this thesis, in order to avoid a name conflict with the quadratic regression coefficients a and b and time t.

  66. 66.

    In openSMILE, the time scaling feature is enabled by the normRegCoeff option in cFunctionalRegression component. Setting it to 1 enables the relative time scale \(g=1/N\) and setting it to 2 enables the absolute time scale in seconds.

  67. 67.

    Option normInputs in openSMILE component cFunctionalRegression—also affects linear and quadratic error.

  68. 68.

    Option normInputs in the openSMILE component cFunctionalRegression—note that this option also affects the regression coefficients as it effectively normalises the input range.

  69. 69.

    In openSMILE these functionals are implemented in the component cFunctionalTimes.

  70. 70.

    Configurable with the norm option in openSMILE.

  71. 71.

    In openSMILE these functionals can be applied with the cFunctionalPeaks2 component; the cFunctionalPeaks component contains an older, obsolete peak picking algorithm.

  72. 72.

    In openSMILE in cFunctionalPeaks2 norm=second has to be set for this behaviour (default).

  73. 73.

    norm=frame in openSMILE.

  74. 74.

    norm=segment in openSMILE.

  75. 75.

    In openSMILE the norm option controls this behaviour (frames, seconds, segment—respectively).

  76. 76.

    See the absThresh and relThresh options in the openSMILE component cFunctionalPeaks2.

  77. 77.

    In openSMILE segment-based temporal functionals can be computed with the component cFunctionalSegments.

  78. 78.

    Use the ravgLng option of the cFunctionalSegments component in openSMILE.

  79. 79.

    This length can be changed via the pauseMinLng option of the cFunctionalSegments component.

  80. 80.

    Computed in openSMILE by the cFunctionalOnset component.

  81. 81.

    Provided by the cFunctionalCrossings component in openSMILE.

  82. 82.

    Sample-based functionals are provided by the cFunctionalSamples component in openSMILE.

  83. 83.

    In openSMILE the cFunctionalDCT component computes DCT coefficient functionals.

  84. 84.

    In openSMILE the cFunctionalLpc component computes LP-analysis functionals.

  85. 85.

    In openSMILE the cFunctionalModulation component computes modulation spectrum functionals.

  86. 86.

    In openSMILE, the statistics can be applied to the modulation spectrum with the cSpectral component. Also other components which expect magnitude spectra (e.g., ACF in cAcf) can read from the output of cFunctionalModulation.

  87. 87.

    These features are not part of openSMILE (yet). It is planned to include them in future releases. C code is available from the author of this thesis upon request.

  88. 88.

    E.g., as is also implemented in the CURRENNT toolkit (http://sourceforge.net/projects/currennt) and the RNNLIB (http://sourceforge.net/projects/rnnl/).

References

  • R.G. Bachu, S. Kopparthi, B. Adapa, B.D. Barkana, Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy, in Advanced Techniques in Computing Sciences and Software Engineering, ed. by K. Elleithy (Springer, Netherlands, 2010), pp. 279–282. doi:10.1007/978-90-481-3660-5_47. ISBN 978-90-481-3659-9

    Google Scholar 

  • A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, V. Aharonson, The impact of F0 extraction errors on the classification of prominence and emotion, in Proceedings of 16-th ICPhS (Saarbrücken, Germany, 2007), pp. 2201–2204

    Google Scholar 

  • L.L. Beranek, Acoustic Measurements (Wiley, New York, 1949)

    Google Scholar 

  • C.M. Bishop, Neural Networks for Pattern Recognition (Oxford University Press, New York, 1995)

    MATH  Google Scholar 

  • R.B. Blackman, J. Tukey, Particular pairs of windows, The Measurement of Power Spectra, from the Point of View of Communications Engineering (Dover, New York, 1959)

    Google Scholar 

  • S. Böck, M. Schedl, Polyphonic piano note transcription with recurrent neural networks, in Proceedings of ICASSP 2012 (Kyoto, 2012), pp. 121–124

    Google Scholar 

  • P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. IFA Proc. 17, 97–110 (1993)

    Google Scholar 

  • P. Boersma, Praat, a system for doing phonetics by computer. Glot Int. 5(9/10), 341–345 (2001)

    Google Scholar 

  • B.P. Bogert, M.J.R. Healy, J.W. Tukey, The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking, in Proceedings of the Symposium on Time Series Analysis, chapter 15, ed. by M. Rosenblatt (Wiley, New York, 1963), pp. 209–243

    Google Scholar 

  • C.H. Chen, Signal Processing Handbook. Electrical Computer Engineering, vol. 51 (CRC Press, New York, 1988), 840 p. ISBN 978-0824779566

    Google Scholar 

  • A. Cheveigne, H. Kawahara, YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. (JASA) 111(4), 1917–1930 (2002)

    Article  Google Scholar 

  • T.-S. Chi, L.-Y. Yeh, C.-C. Hsu, Robust emotion recognition by spectro-temporal modulation statistic features. J. Ambient Intell. Humaniz. Comput. 3, 47–60 (2012). doi:10.1007/s12652-011-0088-5

    Article  Google Scholar 

  • J. Cooley, P. Lewis, P. Welch, The finite fourier transform. IEEE Trans. Audio Electroacoust. 17(2), 77–85 (1969)

    Article  MathSciNet  Google Scholar 

  • C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  • R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, M. Schröder, Feeltrace: an instrument for recording perceived emotion in real time, in Proceedings of the ISCA Workshop on Speech and Emotion (Newcastle, Northern Ireland, 2000), pp. 19–24

    Google Scholar 

  • G. Dahl, T. Sainath, G. Hinton, Improving deep neural networks for LVCSR using rectified linear units and dropout, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013), pp. 8609–8613

    Google Scholar 

  • G. Dalquist, A. Björk, N. Anderson, Numerical Methods (Prentice Hall, Englewood Cliffs, 1974)

    Google Scholar 

  • S. Damelin, W. Miller, The Mathematics of Signal Processing (Cambridge University Press, Cambridge, 2011). ISBN 978-1107601048

    Book  MATH  Google Scholar 

  • G. de Krom, A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. J. Speech Hear. Res. 36, 254–266 (1993)

    Article  Google Scholar 

  • J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete-Time Processing of Speech Signals, University of Michigan, Macmillan Publishing Company (1993)

    Google Scholar 

  • P. Deuflhard, Newton Methods For Nonlinear Problems: Affine Invariance and Adaptive Algorithms. Springer Series in Computational Mathematics, vol. 35 (Springer, Berlin, 2011), 440 p

    Google Scholar 

  • E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. McRorie, J.C. Martin, L. Devillers, S. Abrilian, A. Batliner, N. Amir, K. Karpouzis, The HUMAINE Database. Lecture Notes in Computer Science, vol. 4738 (Springer, Berlin, 2007), pp. 488–500

    Google Scholar 

  • J. Durbin, The fitting of time series models. Revue de l’Institut International de Statistique (Review of the International Statistical Institute) 28(3), 233–243 (1960)

    Article  MATH  Google Scholar 

  • C. Duxbury, M. Sandler, M. Davies, A hybrid approach to musical note onset detection, in Proceedings of the Digital Audio Effect Conference (DAFX’02) (Hamburg, Germany, 2002), pp. 33–38

    Google Scholar 

  • L.D. Enochson, R.K. Otnes, Programming and Analysis for Digital Time Series Data, 1st edn. U.S. Department of Defense, Shock and Vibration Information Center (1968)

    Google Scholar 

  • F. Eyben, B. Schuller, Music classification with the Munich openSMILE toolkit, in Proceedings of the Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval (ISMIR) (ISMIR, Utrecht, 2010a). http://www.music-ir.org/mirex/abstracts/2010/FE1.pdf

  • F. Eyben, B. Schuller, Tempo estimation from tatum and meter vectors, in Proceedings of the Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval (ISMIR) (ISMIR, Utrecht, 2010b). www.music-ir.org/mirex/abstracts/2010/ES1.pdf

  • F. Eyben, M. Wöllmer, B. Schuller, openEAR—introducing the Munich open-source emotion and affect recognition toolkit, in Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction (ACII 2009), vol. I (IEEE, Amsterdam, 2009a), pp. 576–581

    Google Scholar 

  • F. Eyben, M. Wöllmer, B. Schuller, A. Graves, From speech to letters—using a novel neural network architecture for grapheme based ASR, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009b), pp. 376–380

    Google Scholar 

  • F. Eyben, M. Wöllmer, B. Schuller, openSMILE—The Munich versatile and fast open-source audio feature extractor, in Proceedings of ACM Multimedia 2010 (ACM, Florence, 2010a), pp. 1459–1462

    Google Scholar 

  • F. Eyben, S. Böck, B. Schuller, A. Graves, Universal onset detection with bidirectional long-short term memory neural networks, in Proceedings of ISMIR 2010 (ISMIR, Utrecht, The Netherlands, 2010b), pp. 589–594

    Google Scholar 

  • F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, R. Cowie, On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces (JMUI) 3(1–2), 7–19 (2010c). doi:10.1007/s12193-009-0032-6

    Google Scholar 

  • F. Eyben, M. Wöllmer, B. Schuller, A multi-task approach to continuous five-dimensional affect sensing in natural speech, ACM Trans. Interact. Intell. Syst. 2(1), Article No. 6, 29 p. Special Issue on Affective Interaction in Natural Environments (2012)

    Google Scholar 

  • G. Fant, Speech Sounds and Features (MIT press, Cambridge, 1973), p. 227

    Google Scholar 

  • H.G. Feichtinger, T. Strohmer, Gabor Analysis and Algorithms (Birkhäuser, Boston, 1998). ISBN 0-8176-3959-4

    Book  MATH  Google Scholar 

  • J.-B.-J. Fourier, Théorie analytique de la chaleur, University of Lausanne, Switzerland (1822)

    Google Scholar 

  • T. Fujishima, Realtime chord recognition of musical sound: a system using common lisp music, in Proceedings of the International Computer Music Conference (ICMC) 1999 (Bejing, China, 1999), pp. 464–467

    Google Scholar 

  • S. Furui, Digital Speech Processing: Synthesis, and Recognition. Signal Processing and Communications, 2nd edn. (Marcel Denker Inc., New York, 1996)

    Google Scholar 

  • C. Glaser, M. Heckmann, F. Joublin, C. Goerick, Combining auditory preprocessing and bayesian estimation for robust formant tracking. IEEE Trans. Audio Speech Lang. Process. 18(2), 224–236 (2010)

    Article  Google Scholar 

  • E. Gómez, Tonal description of polyphonic audio for music content processing. INFORMS J. Comput. 18(3), 294–304 (2006). doi:10.1287/ijoc.1040.0126

    Article  Google Scholar 

  • F. Gouyon, F. Pachet, O. Delerue. Classifying percussive sounds: a matter of zero-crossing rate? in Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00) (Verona, Italy, 2000)

    Google Scholar 

  • A. Graves, Supervised sequence labelling with recurrent neural networks. Doctoral thesis, Technische Universität München, Munich, Germany (2008)

    Google Scholar 

  • A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)

    Article  Google Scholar 

  • W.D. Gregg, Analog & Digital Communication (Wiley, New York, 1977). ISBN 978-0-471-32661-8

    Google Scholar 

  • M. Grimm, K. Kroschel, S. Narayanan, Support vector regression for automatic recognition of spontaneous emotions in speech, in Proceedings of ICASSP 2007, vol. 4 (IEEE, Honolulu, 2007), pp. 1085–1088

    Google Scholar 

  • B. Hammarberg, B. Fritzell, J. Gauffin, J. Sundberg, L. Wedin, Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngol. 90, 441–451 (1980)

    Article  Google Scholar 

  • H. Hanson, Glottal characteristics of female speakers: acoustic correlates. J. Acoust. Soc. Am. (JASA) 101, 466–481 (1997)

    Article  Google Scholar 

  • H. Hanson, E.S. Chuang, Glottal characteristics of male speakers: acoustic correlates and comparison with female data. J. Acoust. Soc. Am. (JASA) 106, 1064–1077 (1999)

    Article  Google Scholar 

  • F.J. Harris, On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66, 51–83 (1978)

    Article  Google Scholar 

  • H. Hermansky, Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. (JASA) 87, 1738–1752 (1990)

    Article  Google Scholar 

  • H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in Proceedings of ICASSP 1992, vol. 1 (IEEE, San Francisco, 1992), pp. 121–124

    Google Scholar 

  • D.J. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. (JASA) 83(1), 257–264 (1988)

    Article  Google Scholar 

  • W. Hess, Pitch Determination of Speech Signals: Algorithms and Devices (Springer, Berlin, 1983)

    Book  Google Scholar 

  • S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  • S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, in A Field Guide to Dynamical Recurrent Neural Networks, ed. by S.C. Kremer, J.F. Kolen (IEEE Press, New York, 2001)

    Google Scholar 

  • ISO16:1975. ISO Standard 16:1975 Acoustics: Standard tuning frequency (Standard musical pitch). International Organization for Standardization (ISO) (1975)

    Google Scholar 

  • T. Joachims, Text categorization with support vector machines: learning with many relevant features, in Proceedings of the 10th European Conference on Machine Learning (ECML-98), ed. by C. Nédellec, C. Rouveirol (Springer, Chemnitz, 1998), pp. 137–142

    Google Scholar 

  • J.D. Johnston, Transform coding of audio signals using perceptual noise criteria. IEEE J. Sel. Areas Commun. 6(2), 314–332 (1988)

    Article  Google Scholar 

  • P. Kabal, R.P. Ramachandran, The computation of line spectral frequencies using Chebyshev polynomials. IEEE Trans. Acoust. Speech Signal Process. 34(6), 1419–1426 (1986)

    Article  Google Scholar 

  • J.F. Kaiser, Some useful properties of teager’s energy operators, in Proceedings of ICASSP 1993, vol. 3, pp. 149–152, (IEEE, Minneapolis, 1993). doi:10.1109/ICASSP.1993.319457

  • G.S. Kang, L.J. Fransen, Application of line spectrum pairs to low bit rate speech encoders, in Proceedings of ICASSP 1985, vol.10 (IEEE, Tampa, 1985), pp. 244–247. doi:10.1109/ICASSP.1985.1168526

  • R. Kendall, E. Carterette, Difference thresholds for timbre related to spectral centroid, in Proceedings of the 4-th International Conference on Music Perception and Cognition (ICMPC) (Montreal, Canada, 1996), pp. 91–95

    Google Scholar 

  • J.F. Kenney, E.S. Keeping, Root mean square, Mathematics of Statistics, vol. 1, 3rd edn. (Van Nostrand, Princeton, 1962), pp. 59–60

    Google Scholar 

  • A. Khintchine, Korrelationstheorie der stationären stochastischen prozesse. Math. Ann. 109, 604–615 (1934)

    Article  MathSciNet  MATH  Google Scholar 

  • A. Kießling, Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung (Shaker, Aachen, 1997). ISBN 978-3-8265-2245-1

    Google Scholar 

  • A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, vol. 25, ed. by F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Curran Associates, Inc., 2012), pp. 1097–1105

    Google Scholar 

  • K. Kroschel, G. Rigoll, B. Schuller, Statistische Informationstechnik, 5th edn. (Springer, Berlin, 2011)

    Book  MATH  Google Scholar 

  • K. Lee, M. Slaney, Acoustic chord transcription and key extraction from audio using key-dependent HMMs trained on synthesized audio. IEEE Trans. Audio Speech Lang. Process. 16(2), 291–301 (2008). doi:10.1109/TASL.2007.914399. ISSN 1558-7916

    Article  Google Scholar 

  • P. Lejeune-Dirichlet, Sur la convergence des séries trigonométriques qui servent à représenter une fonction arbitraire entre des limites données. Journal für die reine und angewandte Mathematik 4, 157–169 (1829)

    Article  MathSciNet  Google Scholar 

  • N. Levinson, A heuristic exposition of wiener’s mathematical theory of prediction and filtering. J. Math. Phys. 25, 110–119 (1947a)

    Google Scholar 

  • N. Levinson, The Wiener RMS error criterion in filter design and prediction. J. Math. Phys. 25(4), 261–278 (1947b)

    Google Scholar 

  • P.I. Lizorkin, Fourier transform, in Encyclopaedia of Mathematics, ed. by M. Hazewinkel (Springer, Berlin, 2002). ISBN 1-4020-0609-8

    Google Scholar 

  • I. Luengo, Evaluation of pitch detection algorithms under real conditions, in Proceedings of ICASSP 2007, vol. 4 (IEEE, Honolulu, 2007), pp. 1057–1060

    Google Scholar 

  • J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(5), 561–580 (1975)

    Article  Google Scholar 

  • J. Makhoul, L. Cosell, LPCW: an LPC vocoder with linear predictive spectral warping, in Proceedings of ICASSP 1976 (IEEE, Philadelphia, 1976), pp. 466–469

    Google Scholar 

  • B.S. Manjunath, P. Salembier, T. Sikoraa (eds.), Introduction to MPEG-7: Multimedia Content Description Interface (Wiley, Berlin, 2002), 396 p. ISBN 978-0-471-48678-7

    Google Scholar 

  • P. Martin, Détection de \(f_0\) par intercorrelation avec une fonction peigne. J. Etude Parole 12, 221–232 (1981)

    Google Scholar 

  • P. Martin, Comparison of pitch detection by cepstrum and spectral comb analysis, in Proceedings of ICASSP 1982 (IEEE, Paris, 1982), pp. 180–183

    Google Scholar 

  • J. Martinez, H. Perez, E. Escamilla, M.M. Suzuki, Speaker recognition using mel frequency cepstral coefficients (MFCC) and vector quantization (VQ) techniques, in Proceedings of the 22nd International Conference on Electrical Communications and Computers (CONIELECOMP) (Cholula, Puebla, 2012), pp. 248–251. doi:10.1109/CONIELECOMP.2012.6189918

  • P. Masri, Computer modelling of sound for transformation and synthesis of musical signal. Doctoral thesis, University of Bristol, Bristol (1996)

    Google Scholar 

  • S. McCandless, An algorithm for automatic formant extraction using linear prediction spectra. IEEE Trans. Acoust. Speech Signal Process. 22, 134–141 (1974)

    Article  Google Scholar 

  • D.D. Mehta, D. Rudoy, P.K. Wolfe, Kalman-based autoregressive moving average modeling and inference for formant and antiformant tracking. J. Acoust. Soc. Am. (JASA) 132(3), 1732–1746 (2012)

    Article  Google Scholar 

  • H. Misra, S. Ikbal, H. Bourlard, H. Hermansky, Spectral entropy based feature for robust ASR, in Proceedings of ICASSP 2004, vol. 1 (IEEE, Montreal, Canada, 2004), pp. I–193–6. doi:10.1109/ICASSP.2004.1325955

  • O. Mubarak, E. Ambikairajah, J. Epps, T. Gunawan, Modulation features for speech and music classification, in Proceedings of the 10th IEEE Singapore International Conference on Communication systems (ICCS) 2006 (IEEE, 2006), pp. 1–5. doi:10.1109/ICCS.2006.301515

  • M. Müller, Information Retrieval for Music and Motion (Springer, Berlin, 2007)

    Book  Google Scholar 

  • M. Müller, F. Kurth, M. Clausen, Audio matching via chroma-based statistical features, in Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR) (London, 2005a), pp. 288–295

    Google Scholar 

  • M. Müller, F. Kurth, M. Clausen, Chroma-based statistical audio features for audio matching, in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE, 2005b), pp. 275–278

    Google Scholar 

  • N.J. Nalini, S. Palanivel, Emotion recognition in music signal using AANN and SVM. Int. J. Comput. Appl. 77(2), 7–14 (2013)

    Google Scholar 

  • A.M. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. (JASA) 41(2), 293–309 (1967)

    Article  Google Scholar 

  • A.M. Noll, Pitch determination of human speech by the harmonic product spectrum, the harmonic sum spectrum, and a maximum likelihood estimate, in Symposium on Computer Processing in Communication, vol. 19 (University of Brooklyn, New York, 1970), pp. 779–797, edited by the Microwave Institute

    Google Scholar 

  • A.H. Nuttal, Some windows with very good sidelobe behavior. IEEE Trans. Acoust. Speech Signal Process. ASSP 29, 84–91 (1981)

    Article  Google Scholar 

  • A.V. Oppenheim, R.W. Schafer, Digital Signal Processing (Prentice-Hall, Englewood Cliffs, 1975)

    MATH  Google Scholar 

  • A.V. Oppenheim, A.S. Willsky, S. Hamid, Signals and Systems, 2nd edn. (Prentice Hall, Upper Saddle River, 1996)

    Google Scholar 

  • A.V. Oppenheim, R.W. Schafer, J.R. Buck, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, 1999)

    Google Scholar 

  • T.W. Parsons, Voice and Speech Processing. Electrical and Computer Engineering (University of Michigan, McGraw-Hill, 1987)

    Google Scholar 

  • S. Patel, K.R. Scherer, J. Sundberg, E. Björkner, Acoustic markers of emotions based on voice physiology, in Proceedings of Speech Prosody 2010 (ISCA, Chicago, 2010), pp. 100865:1–4

    Google Scholar 

  • G. Peeters, A large set of audio features for sound description. Technical report, IRCAM, Switzerland (2004). http://recherche.ircam.fr/equipes/analyse-synthese/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf. Accessed 3 Sept. 2013

  • V. Pham, C. Kermorvant, J. Louradour, Dropout improves recurrent neural networks for handwriting recognition, in CoRR (2013) (online), arXiv:1312.4569

  • J. Platt, Sequential minimal optimization: a fast algorithm for training support vector machines, Technical report MSR-98-14, Microsoft Research (1998)

    Google Scholar 

  • L.R. Rabiner, On the use of autocorrelation analysis for pitch detection. IEEE Trans. Acoust. Speech Signal Process. 25(1), 24–33 (1977). doi:10.1109/TASSP.1977.1162905

    Article  Google Scholar 

  • L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  • L.R. Rabiner, B.H. Juang, An introduction to hidden markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)

    Article  Google Scholar 

  • L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, 1st edn. (Prentice Hall, Englewood Cliffs, 1993)

    MATH  Google Scholar 

  • L. Rade, B. Westergren, Springers Mathematische Formeln (German translation by P. Vachenauer), 3rd edn. (Springer, Berlin, 2000). ISBN 3-540-67505-1

    Book  Google Scholar 

  • J.F. Reed, F. Lynn, B.D. Meade, Use of coefficient of variation in assessing variability of quantitative assays. Clin. Diagn. Lab. Immunol. 9(6), 1235–1239 (2002)

    Google Scholar 

  • M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the RPROP algorithm, in Proceedings of the IEEE International Conference on Neural Networks, vol. 1 (IEEE, San Francisco, 1993), pp. 586–591. doi:10.1109/icnn.1993.298623

  • F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions, in Proceedings of the 2nd International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), held in conjunction with FG 2013 (IEEE, Shanghai, 2013), pp. 1–8

    Google Scholar 

  • S. Rosen, P. Howell, The vocal tract as a linear system, Signals and Systems for Speech and Hearing, 1st edn. (Emerald Group, 1991), pp. 92–99. ISBN 978-0125972314

    Google Scholar 

  • G. Ruske, Automatische Spracherkennung. Methoden der Klassifikation und Merkmalsextraktion, 2nd edn. (Oldenbourg, Munich, 1993)

    Google Scholar 

  • K.R. Scherer, J. Sundberg, L. Tamarit, G.L. Salomão, Comparing the acoustic expression of emotion in the speaking and the singing voice. Comput. Speech Lang. 29(1), 218–235 (2015). doi:10.1016/j.csl.2013.10.002

    Article  Google Scholar 

  • B. Schölkopf, A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning) (MIT Press, Cambridge, 2002)

    Google Scholar 

  • M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, M. Wöllmer, Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)

    Article  Google Scholar 

  • M.R. Schroeder, Period histogram and product spectrum: new methods for fundamental-frequency measurement. J. Acoust. Soc. Am. (JASA) 43, 829–834 (1968)

    Article  Google Scholar 

  • M.R. Schroeder, Recognition of complex acoustic signals, in Life Sciences Research Reports, vol. 5, ed. by T.H. Bullock (Abakon Verlag, Berlin, 1977), 324 p

    Google Scholar 

  • B. Schuller, Automatische Emotionserkennung aus sprachlicher und manueller Interaktion. Doctoral thesis, Technische Universität München, Munich, Germany (2006)

    Google Scholar 

  • B. Schuller, Intelligent Audio Analysis. Signals and Communication Technology (Springer, Berlin, 2013)

    Google Scholar 

  • B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013), 344 p. ISBN 978-1119971368

    Google Scholar 

  • B. Schuller, G. Rigoll, M. Lang, Hidden Markov model-based speech emotion recognition, in Proceedings of ICASSP 2003, vol. 2 (IEEE, Hong Kong, 2003), pp. II 1–4

    Google Scholar 

  • B. Schuller, D. Arsić, F. Wallhoff, G. Rigoll, Emotion recognition in the noise applying large acoustic feature sets, in Proceedings of the 3rd International Conference on Speech Prosody (SP) 2006 (ISCA, Dresden, 2006), pp. 276–289

    Google Scholar 

  • B. Schuller, F. Eyben, G. Rigoll, Fast and robust meter and tempo recognition for the automatic discrimination of ballroom dance styles, in Proceedings of ICASSP 2007, vol. I (IEEE, Honolulu, 2007), pp. 217–220

    Google Scholar 

  • B. Schuller, F. Eyben, G. Rigoll, Beat-synchronous data-driven automatic chord labeling, in Proceedings 34. Jahrestagung für Akustik (DAGA) 2008 (DEGA, Dresden, 2008), pp. 555–556

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, F. Jurcicek, The INTERSPEECH 2009 emotion challenge, in Proceedings of INTERSPEECH 2009 (Brighton, 2009a), pp. 312–315

    Google Scholar 

  • B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: A benchmark comparison of performances, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009b), pp. 552–557

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The INTERSPEECH 2010 paralinguistic challenge, in Proceedings of INTERSPEECH 2010 (ISCA, Makuhari, 2010), pp. 2794–2797

    Google Scholar 

  • B. Schuller, A. Batliner, S. Steidl, F. Schiel, J. Krajewski, The INTERSPEECH 2011 speaker state challenge, in Proceedings of INTERSPEECH 2011 (ISCA, Florence, 2011), pp. 3201–3204

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 speaker trait challenge, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012a)

    Google Scholar 

  • B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012b), pp. 361–362. October

    Google Scholar 

  • B. Schuller, F. Pokorny, S. Ladstätter, M. Fellner, F. Graf, L. Paletta. Acoustic geo-sensing: recognising cyclists’ route, route direction, and route progress from cell-phone audio, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013a), pp. 453–457

    Google Scholar 

  • B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al., The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, in Proceedings of INTERSPEECH 2013 (ISCA, Lyon, 2013b), pp. 148–152

    Google Scholar 

  • M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  • C.E. Shannon, A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948). (Reprint with corrections in: ACM SIGMOBILE Mobile Computing and Communications Review 5(1), 3–55 (2001))

    Google Scholar 

  • M. Slaney, An efficient implementation of the patterson-holdsworth auditory filter bank. Technical Report 35, Apple Computer Inc. (1993)

    Google Scholar 

  • M. Soleymani, M.N. Caro, E.M. Schmidt, Y.-H. Yang, The MediaEval 2013 brave new task: emotion in music, in Proceedings of the MediaEval 2013 Workshop (CEUR-WS.org, Barcelona, 2013)

    Google Scholar 

  • F.K. Soong, B.-W. Juang, Line spectrum pair (LSP) and speech data compression, in Proceedings of ICASSP 1984 (IEEE, San Diego, 1984), pp. 1.10.1–1.10.4

    Google Scholar 

  • A. Spanias, T. Painter, V. Atti, Audio Signal Processing and Coding (Wiley, Hoboken, 2007), 464 p. ISBN 978-0-471-79147-8

    Google Scholar 

  • J. Stadermann, G. Rigoll, A hybrid SVM/HMM acoustic modeling approach to automatic speech recognition, in Proceedings of INTERSPEECH 2004 (ISCA, Jeju, 2004), pp. 661–664

    Google Scholar 

  • J. Stadermann, G. Rigoll, Hybrid NN/HMM acoustic modeling techniques for distributed speech recognition. Speech Commun. 48(8), 1037–1046 (2006)

    Article  Google Scholar 

  • J.F. Steffensen, Interpolation, 2nd edn. (Dover Publications, New York, 2012), 256 p. ISBN 978-0486154831

    Google Scholar 

  • P. Suman, S. Karan, V. Singh, R. Maringanti, Algorithm for gunshot detection using mel-frequency cepstrum coefficients (MFCC), in Proceedings of the Ninth International Conference on Wireless Communication and Sensor Networks, ed. by R. Maringanti, M. Tiwari, A. Arora. Lecture Notes in Electrical Engineering, vol. 299 (Springer, India, 2014), pp. 155–166. doi:10.1007/978-81-322-1823-4_15. ISBN 978-81-322-1822-7

    Google Scholar 

  • J. Sundberg, The Science of the Singing Voice (Northern Illinois University Press, Dekalb, 1987), p. 226. ISBN 978-0-87580-542-9

    Google Scholar 

  • D. Talkin, A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, New York, 1995), pp. 495–518. ISBN 0444821694

    Google Scholar 

  • L. Tamarit, M. Goudbeek, K.R. Scherer, Spectral slope measurements in emotionally expressive speech, in Proceedings of SPKD-2008 (ISCA, 2008), paper 007

    Google Scholar 

  • H.M. Teager, S.M. Teager, Evidence for nonlinear sound production mechanisms in the vocal tract, in Proceedings of Speech Production and Speech Modelling, Bonas, France, ed. by W.J. Hardcastle, A. Marchal. NATO Advanced Study Institute Series D, vol. 55 (Kluwer Academic Publishers, Boston, 1990), pp. 241–261

    Google Scholar 

  • E. Terhardt, Pitch, consonance, and harmony. J. Acoust. Soc. Am. (JASA) 55, 1061–1069 (1974)

    Article  Google Scholar 

  • E. Terhardt, Calculating virtual pitch. Hear. Res. 1, 155–182 (1979)

    Article  Google Scholar 

  • H. Traunmueller, Analytical expressions for the tonotoc sensory scale. J. Acoust. Soc. Am. (JASA) 88, 97–100 (1990)

    Article  Google Scholar 

  • K. Turkowski, S. Gabriel, Filters for common resampling tasks, in Graphics Gems, ed. by A.S. Glassner (Academic Press, New York, 1990), pp. 147–165. ISBN 978-0-12-286165-9

    Chapter  Google Scholar 

  • G. Tzanetakis, P. Cook, Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002). doi:10.1109/TSA.2002.800560. ISSN 1063-6676

    Article  Google Scholar 

  • P.-F. Verhulst, Recherches mathématiques sur la loi d’accroissement de la population (mathematical researches into the law of population growth increase). Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles 18, 1–42 (1945)

    Google Scholar 

  • D. Ververidis, C. Kotropoulos, Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)

    Article  Google Scholar 

  • A.J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  • B. Vlasenko, B. Schuller, A. Wendemuth, G. Rigoll., Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing, in Proceedings of the 2nd International Conference on Affective Computing and Intelligent Interaction (ACII) 2007, ed. by A. Paiva, R. Prada, R.W. Picard. Lecture Notes in Computer Science, Lisbon, Portugal, vol. 4738 (Springer, Berlin, 2007), pp. 139–147

    Google Scholar 

  • A.L. Wang, An industrial-strength audio search algorithm, in Proceedings of ISMIR (Baltimore, 2003)

    Google Scholar 

  • F. Weninger, F. Eyben, B. Schuller, The TUM approach to the mediaeval music emotion task using generic affective audio features, in Proceedings of the MediaEval 2013 Workshop (CEUR-WS.org, Barcelona, 2013)

    Google Scholar 

  • F. Weninger, F. Eyben, B. Schuller, On-line continuous-time music mood regression with deep recurrent neural networks, in Proceedings of ICASSP 2014 (IEEE, Florence, 2014), pp. 5449–5453

    Google Scholar 

  • P. Werbos, Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990)

    Article  Google Scholar 

  • N. Wiener, Generalized harmonic analysis. Acta Math. 55(1), 117–258 (1930)

    Article  MathSciNet  MATH  Google Scholar 

  • N. Wiener, Extrapolation, Intrapolation and Smoothing of Stationary Time Series, M.I.T. Press Paperback Series (Book 9) (MIT Press, Cambridge, 1964), 163 p

    Google Scholar 

  • M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, R. Cowie, Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies, in Proceedings of INTERSPEECH 2008 (ISCA, Brisbane, 2008), pp. 597–600

    Google Scholar 

  • M. Wöllmer, F. Eyben, A. Graves, B. Schuller, G. Rigoll, Improving keyword spotting with a tandem BLSTM-DBN architecture, in Advances in Non-linear Speech Processing: Revised selected papers of the International Conference on Nonlinear Speech Processing (NOLISP) 2009, ed. by J. Sole-Casals, V. Zaiats. Lecture Notes on Computer Science (LNCS), vol. 5933/2010 (Springer, Vic, 2010), pp. 68–75

    Google Scholar 

  • M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, G. Rigoll, LSTM-Modeling of Continuous Emotions in an Audiovisual Affect Recognition Framework. Image Vis. Comput. (IMAVIS) 31(2), 153–163. Special Issue on Affect Analysis in Continuous Input (2013)

    Google Scholar 

  • S. Wu, T.H. Falk, W.-Y. Chan, Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011). doi:10.1016/j.specom.2010.08.013. ISSN 0167-6393 (Perceptual and Statistical Audition)

    Article  Google Scholar 

  • Q. Yan, S. Vaseghi, E. Zavarehei, B. Milner, J. Darch, P. White, I. Andrianakis, Formant-tracking linear prediction model using hmms and kalman filters for noisy speech processing. Comput. Speech Lang. 21(3), 543–561 (2007). doi:10.1016/j.csl.2006.11.001

    Google Scholar 

  • S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book, Cambridge University Engineering Department, for HTK version 3.4 edition (2006)

    Google Scholar 

  • E. Yumoto, W.J. Gould, Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoust. Soc. Am. (JASA) 71(6), 1544–1549 (1981)

    Article  Google Scholar 

  • G. Zhou, J.H.L. Hansen, J.F. Kaiser, Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 9(3), 201–216 (2001). doi:10.1109/89.905995

    Article  Google Scholar 

  • X. Zuo, P. Fung, A cross gender and cross lingual study of stress recognition in speech without linguistic features, in Proceedings of the 17th ICPhS (Hong Kong, China, 2011)

    Google Scholar 

  • E. Zwicker, Subdivision of the audible frequency range into critical bands. J. Acoust. Soc. Am. (JASA) 33(2), 248–248 (1961)

    Article  Google Scholar 

  • E. Zwicker, Masking and psychological excitation as consequences of ear’s frequency analysis, in Frequency Analysis and Periodicity Detection in Hearing, ed. by R. Plomp, G.F. Smoorenburg (Sijthoff, Leyden, 1970)

    Google Scholar 

  • E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. (JASA) 68, 1523–1525 (1980)

    Article  Google Scholar 

  • E. Zwicker, H. Fastl, Psychoacoustics—Facts and Models, 2nd edn. (Springer, Berlin, 1999), 417 p. ISBN 978-3540650638

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Eyben .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Eyben, F. (2016). Acoustic Features and Modelling. In: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer Theses. Springer, Cham. https://doi.org/10.1007/978-3-319-27299-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27299-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27298-6

  • Online ISBN: 978-3-319-27299-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics