Study on Speech Representation Based on Spikegram for Speech Fingerprints

  • Dung Kim TranEmail author
  • Masashi UnokiEmail author
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 82)


This paper investigates the abilities of spikegrams in representing the content and voice identifications of speech signals. Current speech representation models employ block-based coding techniques to transform speech signals into spectrograms to extract suitable features for further analysis. One issue with this approach is that a speaker produces different speech signals for the same speech content; therefore, processing speech signals in a piecewise manner will result in different spectrograms, and consequently, different fingerprints will be produced for the same spoken words by the same speaker. For this reason, the consistency of speech representation models in the variations of speech is essential to obtain accurate and reliable speech fingerprints. It has been reported that sparse coding surpasses block-based coding in representing speech signals in the way that it is able to capture the underlying structures of speech signals. An over-complete representation model – known as a spikegram – can be created by using a matching pursuit algorithm and Gammatone dictionary to provide a better alternative to a spectrogram. This paper reports the ability of spikegrams in representing the speech content and voice identities of speakers, which can be used for improving the robustness of speech fingerprints.


Speech fingerprint Spikegram Matching pursuit algorithm Gammatone filterbank Non-negative matrix factorization 



This work was supported by a Grant-in-Aid for Scientific Research (B) (No. 17H01761).


  1. 1.
    Cano, P., Batle, E., Kalker, T., Haitsma, J.: A review of algorithms for audio fingerprinting. In: IEEE Workshop Multimedia Signal Processing (2002)Google Scholar
  2. 2.
    Wang, A.L.-C.: An Industrial-Strength Audio Search Algorithm (2003)Google Scholar
  3. 3.
    Milano, D.: Content Control: Digital Watermarking and Fingerprinting, White Paper: Video Water Marking and FingerprintingGoogle Scholar
  4. 4.
    Pichevar, R., Najaf-Zadeh, H., Thibault, L., Lahdili, H.: Auditory-inspired sparse representation of audio signals. Speech Commun. 53(5), 643–657 (2011)CrossRefGoogle Scholar
  5. 5.
    Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41(12), 3397–3415 (1993)CrossRefzbMATHGoogle Scholar
  6. 6.
    Evan, S., Lewicki, M.S.: Efficient coding of time-relative structure using spikes. Neural Comput. 17(1), 19–45 (2005)CrossRefzbMATHGoogle Scholar
  7. 7.
  8. 8.
    Unoki, M., Akagi, M.: A method of signal extraction from noisy signal based on auditory scene analysis. Speech Commun. 27(3–4), 261–279 (1999)CrossRefGoogle Scholar
  9. 9.
    Ellis, D.: Robust Landmark-Based Audio Fingerprinting.
  10. 10.
    He, D.C., Wang, L.: Texture unit, texture spectrum, and texture analysis. IEEE Trans. Geosci. Remote Sens. 28(4), 509–512 (1990)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.School of Information ScienceJapan Advanced Institute of Science and TechnologyNomiJapan

Personalised recommendations