A Multimedia Application: Spatial Perceptual Entropy of Multichannel Audio Signals
- 1.4k Downloads
Usually multimedia data have to be compressed before transmitting, and higher compression rate, or equivalently lower bitrate, relieves the load of communication channels but impacts negatively the quality. We investigate the bitrate lower bound for perceptually lossless compression of a major type of multimedia—multichannel audio signals. This bound equals to the perceptible information rate of the signals. Traditionally, Perceptual Entropy (PE), based primarily on monaural hearing measures the perceptual information rate of individual channels. But PE cannot measure the spatial information captured by binaural hearing, thus is not suitable for estimating Spatial Audio Coding (SAC) bitrate bound. To measure this spatial information, we build a Binaural Cue Physiological Perception Model (BCPPM) on the ground of binaural hearing, which represents spatial information in the physical and physiological layers. This model enables computing Spatial Perceptual Entropy (SPE), the lower bitrate bound for SAC. For real-world stereo audio signals of various types, our experiments indicate that SPE reliably estimates their spatial information rate. Therefore, "SPE plus PE" gives lower bitrate bounds for communicating multichannel audio signals with transparent quality.
KeywordsSound Source Audio Signal Interaural Time Difference Lossless Compression Interaural Level Difference
A central goal in multimedia communications is to deliver quality contents with the lowest possible bitrate. By quality, we mean the perceived fidelity of the received contents against the original contents. And the lowest possible bitrate depends on two disparate concepts: entropy and perception. Entropy measures the quantity of information . But not all information is perceptible.
To pursue this goal, we want to know how many bits are sufficient to convey quality multimedia contents. Lossless compression always ensures the highest possible quality, in which the objective redundancy in the multimedia contents is the only source of compression, and there is a limit, the Shannon entropy, the lowest possible bitrate with perfect decompression. Nevertheless, this limit is very hard if not impossible to compute due to the diversity and complexity of probability models of multimedia contents. By Huffman coding, run-length coding, arithmetic coding, and other entropy coding techniques, the state-of-the-art lossless audio coders today typically achieve a compression rate of 1/3-2/3 or 230–460 kbps per channel for CD music .
Lossless compression generally conveys higher than necessary quality in multimedia communications. Multimedia contents abound subjective irrelevancy—objective information we cannot sense. Perceptually lossless compression suffices. For audio signals, this means lossless to the extent that the distortion after decompression is imperceptible to normal human ears (usually called transparent coding), the bitrate can be much lower than the true lossless coding. Perceptual audio coding  by removing the irrelevancy greatly reduces communication bandwidth or storage space. Psychoacoustics provides a quantitative theory on this irrelevancy [4, 5, 6, 7]: the limits of auditory perception, such as the audible frequency range (20–20000 Hz), the Absolute Threshold of Hearing (ATH), and the masking effect . In state-of-the-art perceptual audio coders, such as MPEG-2/4 Advanced Audio Coding (AAC [9, 10]), 64 kbps is enough for transparent coding . The Shannon entropy cannot measure the perceptible information or give the bitrate bound in this case.
We can see that if Open image in new window in (1) assumes conservative values (smaller), PE will be larger. On the other hand, Adaptive Multirate (AMR ) and Adaptive Multirate Wide Band (AMR-WB ) use a priori knowledge of human voicing, also reducing bitrate. Apart from these two points, PE reliably predicts the lowest bitrate required for transparent audio coding. Since formulated, PE has found widespread use in audio coding and has become a fundamental theory in this field. Main stream perceptual audio coders, such as MP3  and AAC, all employ PE as an important psychoacoustic parameter, leading to various practical methods not just theory.
Nevertheless, PE has significant limitation to measure perceptual information. This limitation primarily comes from the underlying monaural hearing model. Human has two ears to receive sound waves in a 3-dimensional space: not only is the time and frequency information perceived—needing just individual ears—but also spatial information or localization information—needing both ears for spatial sampling. Due to the unawareness of binaural hearing, PE of multichannel audio signals is simplified to the supposition of PE of individual channels, which is significantly larger than real quantity of information received because multichannel audio signals usually correlate. The purpose of this paper is to measure the perceptual information of binaural hearing.
We first analyze the localization principle of binaural hearing and give a spatial hearing model on the physical and physiological layers. Then we propose a Binaural Cue Physiological Perception Model (BCPPM) based on binaural hearing. Finally using binaural frequency-domain perception property, we give a formula to compute the quantity of spatial information and numerical results of spatial information estimation of real-world stereo audio signals.
With the left and right ears, human being is able to detect spatial information: sound source localization and sound source spaciousness. The former comprises of the range, azimuth, and elevation, in other words, the 3-dimensional spherical coordinate. The later can be measured by angle span of auditory images.
On the physical layer, sound waves propagate from sources along different paths to the ears and then in the ear canals and finally to the cochlea, absorbed and reflected by walls, floors, torso, head, and other objects on the way. Those sound waves carry objective localization information. On the physiological layer, sound waves are transformed to neural cell excitation and inhibition by the auditory system. There are different types of auditory neural cell responding to different types of sound stimulus, such as intensity, frequency, and delay. Thus physical quantities become physiological data.
In audio compression, irrelevancy removing is mainly on the physical and physiological layers. In the following, we discuss the representation of binaural cues on the two layers—BCPPM.
1.1. Spatial Information on the Physical Layer
2. Physiological Perception Modeling of Binaural Hearing
Although a real head is far from being the rigid ball, the above results are basically correct. In 2002, Macpherson and Middlebrooks demonstrated that the duplex theory is suitable for a variety of audio signals: pure tones, wide band signals, high pass signals, as well as low pass signals . Exception is high frequency signals with envelope delay .
Unlike ILD and ITD, the spectral cue needs prior knowledge to provide elevation information. In principle, sounds may have arbitrary spectra. A listener is not able to detect the elevation angle based solely on the spectra: any characteristics may come from sound sources themselves and may come from the filtering effect of pinnae. The listener cannot tell.
Blauert reported a very interesting auditory phenomenon of narrow-band sound sources on the medial plane: the elevation angles given by subjects are independent of the real elevation angles but depended on the signal frequencies . For wide-band signals of familiar types, it is easy for our auditory system to compare the pinnae filtered spectra (some frequency amplified and some decayed) to the spectra in memory, and based on the difference, reliable elevation angle estimation can be given (Figure 3). But for narrow-band signals, pinnae filtered spectra do not have detectable shape difference, just level difference. Thus the elevation angle detection will be very unreliable. In fact, the elevation angles given by the subjects are the angles at which the narrow-band signals have the maximum gain due to the pinnae filtering. For example, the peak gain frequency when the sounds come from the front is 3 kHz for most people . So wherever a sound of 3 kHz came from, most subjects pointed at the front.
From the perspective of signal processing, sound wave propagation is roughly a Linear Time Invariant (LTI) system. To describe this LTI system in binaural hearing, we have Head-Related Transfer Function (HRTF [27, 28, 29]) or equivalently Head-Related Impulse Response (HRIR). In open space, HRTF/HRIR is the function of source location, that is, range, azimuth, and elevation.
Obviously, ILD and ITD are not only source location dependent, but also frequency dependent.
To obtain accurate relationship between sound source locations and sound wave propagation, more realistic head models or real heads are needed. In 1994, the MIT Media Lab collected HRTFs on 710 locations in the 3-dimensional space using the KEMAR head . In 2001, CIPIC of U.C. Davis examined HRTFs of 45 subjects and 2 KEMAR heads . Individual difference of HRTFs is revealed in HRTFs obtained by the experiments. Nevertheless, there are common characteristics that are sufficient to derive subject-independent spatial information.
2.1. Spatial Information on the Physiological Layer
In human auditory system, ITD and ILD of external sound sources stimulate or inhabit specific neural cells in the full audible frequency range. This process comprises of two steps: Frequency-to-Place Transform (FPT) [32, 33] and Binaural Processing (BP).
In 1960, Bèkèsy reported that sounds of different frequencies generate surface waves on the basilar membrane in cochlea with peak amplitudes at different places, which are determined by the frequencies . In other words, a specific frequency is mapped to a specific place on the basilar membrane, or FPT, and this specific frequency for a given place is called Characteristic Frequency (CF ). Hair cells on that place then transform the mechanical swing into electric signals of auditory nerves.
The neural signals from the left and right ears corresponding to the same frequency meet in the brain. Our auditory system then extracts the ITD and ILD information in the signals. Currently, there are two kinds of theories on this process: Excitation-Excitation (EE ) and Excitation-Inhibition (EI ). The former proposed that there are auditory nerve cells of EE-type located between the inferior colliculus and the medial superior olive, and specific EE-type cells there have maximum excitation for signals with specific ITD and ILD; the latter proposed that there are auditory nerve cells of EI-type located between the inferior colliculus and the lateral superior olive, and specific EI-type cells there have maximum inhibition for signals with specific ITD and ILD. The common ground of the two theories is that specific nerve cells are only sensitive to specific ITD and ILD, which are called characteristic ITD and characteristic ILD. In some literatures, characteristic ITD is also called Best Delay (BD ) or Characteristic Delay (CD ). Both the EE-type and EI-type have supports from physiological research, but the latter explains better the various binaural hearing phenomena .
In the Breebaart model, only if the internal delay and attenuation are exactly compensated by the external ITD and ILD, the corresponding EI-type elements will have the largest inhibition. Thus, knowing the position of the EI-type element with the largest inhibition, the auditory system finds the ITD and ILD of the external audio signals.
The Breebaart model also implies the calculation of Interaural Coherence (IC), which manifests as the trough of the excitation surface, in accordance with the EI-type assumption. Nevertheless, there is no direct physiological quantity related to IC in this model.
In 2004, Faller and Merimma reported that IC relates to perceiving sound image width and stability, as well as sound field ambience [46, 47]. On the other hand, by the precedence effect [48, 49] of spatial hearing—sound source localization depending primarily on the direct sounds to the ears and essentially irrespective to reflection and reverberation—which contributes to lowering IC, Faller proposed that our auditory system use ITD and ILD to localize sound sources only if IC approaches 1. Since direct sounds to the ears have near 1 cross-correlation, this explains the precedence effect.
2.2. Binaural Cue Physiological Perception Model (BCPPM)
Since the wavelength (0.012–17 m) of sound in the audible range (20–20000 Hz) is much longer than light, and comparable to normal objects in our surrounding—leading to significant interference and diffraction—spatial information from hearing is limited initially. This limited information is first compromised by noises and other interferences from other sound sources, as indicated by Open image in new window in Figure 7. Then during transformation from mechanical swing to electric impulses, part of the information is lost again due to the limited frequency range and dynamic range, the limited frequency and temporal resolution, and physiological noises of our auditory system, as is indicated by Open image in new window in Figure 7.
In Section 1.1, we see that the physical data of sound source localization in binaural hearing are in form of ITD and ILD. In Section 2.1, we see that ITD and ILD are transformed to maximum inhibition of specific EI-type auditory nerve cells in the Breebaart model, and the physiological data are in the form of coordinates of the delay-attenuation network.
When there are multiple sound sources, background noises, reflection, diffraction, and reverberation, IC becomes another type of physical data conveying the overall sound field information.
Since spatial hearing on the physiological layer is too complex and uncertainty to be incorporated in computational model for common listeners, we restrict the calculation of perceptible spatial information to that directly related to ITD, ILD, and IC and physiological data corresponding to the three cues. In fact, spatial coding systems use the cues to represent spatial information.
We first review the psychoacoustic foundation of PE, mainly the nonlinear frequency resolution (Critical Band, CB [50, 51]) of our hearing system, spreading functions in the frequency domain for noises and tones and tonality estimation.
The BCPPM consists of 3 modules.
Frequency-to-Place Transform in Cochlea.
This process separates sounds into a bank of subband signals, essentially the subband filtering in MHM. The subband filter can be implemented by DFT with spectral lines grouped to subbands according to CB or by the Cochlear Filter Bank (CFB ) proposed by Baumgarte in 2002.
Effective Channel Noises.
The effective channel noise for ITD, ILD, and IC ( Open image in new window , Open image in new window , and Open image in new window in Figure 10) is a simplified method to model the limited precision, intrinsic noises, and intersource interference in our hearing system. Part of the noise comes directly from grains of delay and attenuation ( Open image in new window and Open image in new window in Figure 6). For example, if Open image in new window , Open image in new window . Generally, Open image in new window and Open image in new window are functions of frequency. A related concept is Just Noticeable Difference (JND) in psychoacoustics, indicating the overall sensitivity of our auditory system. On the other hand, ITD, ILD, and IC are not independent, there are interactions among them. The effective channel noise should also incorporate the interactions.
3. Computing Spatial Perceptual Entropy (SPE) Based on BCPPM
3.1. SPE Definition
From the information theory viewpoint, we see BCPPM as a double-in-multiple-out system (Figure 10). The double-in is the left ear entrance sound and the right ear entrance sound. The multiple-out consists of 75 effective ITDs, ILDs, and ICs (25 CBs, each with a tuple of ITD, ILD, and IC).
Like in computing PE, we view each path that leads to an output as a lossy subchannel. Then there are 75 such subchannels. Unlike PE, what a subchannel conveys is not a subband spectrum but one of ITD, ILD, and IC of the subband corresponding to the sub-channel.
In each sub-channel, there are intrinsic channel noises (resolution of spatial hearing), and among sub-channels, there are interchannel interferences (interaction of binaural cues). Then there is an effective noise for each sub-channel.
For some probability distributions, say uniform distribution, (5), (6), and (7) can be readily calculated.
3.2. CB Filterbank
Critical Bands for 2048-point DFT, sampling frequency 48 kHz .
3.3. Binaural Cues Computation
where Open image in new window is the indexes of CB, Open image in new window and Open image in new window the starting DFT spectral index of Open image in new window and Open image in new window (Table 2), Open image in new window and Open image in new window the Open image in new window th spectral lines from left and right ear entrance signals.
3.4. Effective Spatial Perception Data
The resolutions or quantization steps of the binaural cues (Figure 12) can be determined by JND experiments. Denote by Open image in new window , Open image in new window , and Open image in new window the resolutions of ITD, ILD, and IC, respectively. Generally, they are signal dependent and frequency dependent. For simplicity, we use constant values [44, 54]: Open image in new window ms, Open image in new window dB, and Open image in new window .
Larger IC usually implies higher ITD perception precision or equivalently morespatial information. When IC approaches 1, the activity surface will have a very sharp decreasing toward the point with the lowest auditory nerve activity. In this case, the uncertainty of ITD is very small and is determined precisely. When IC decreases to 0, the surface becomes flatter, leading to larger uncertainty or lower precision of ITD. In the extreme case, when Open image in new window , the gradient along the IC axis will be constantly 0, there is no well defined trough point and ITD is completely indeterminable.
From (13) we see that when IC(b)=1, Open image in new window assumes the minimum Open image in new window and the auditory system has the highest resolution for ITD; when Open image in new window , Open image in new window , the resolution of ITD is lower but there is still spatial information from ITD; when Open image in new window , Open image in new window , the resolution of ITD is 0 and there is no spatial information in ITD.
where Open image in new window represents the round down function.
where N is the number of spectral lines in one transform, or 1024 in this case; Open image in new window , Open image in new window , and Open image in new window can be found from (9), (10), and (11), respectively; Open image in new window , Open image in new window , and Open image in new window are the JNDs of ILD, ITD, and IC on CB b, respectively, obtained from subjective listening experiments; and Open image in new window is the amplitude compression factor, assuming 0.6 .
We evaluate SPE of 126 stereo sequences from 3GPP and MPEG, which are classified into speech, single instrument, simple mixture, and complex mixture, all sampled at 44.1 kHz. For comparison, we also evaluate PE of these sequences.
In the following experiments, Open image in new window , Open image in new window , and Open image in new window assume constant and conservative values, and their frequency dependency is also ignored. The overall SPE is the sum of entropy of effective IC, ILD, and ITD perception data, shown in (4).
4.1. Perceptual Spatial Information of Stereo Sequences
From Figure 15 we find that speech sequences generally have the lowest spatial information rate, mean 2.75 kbps, this is in accordance with the recording practice that voices usually stay in direct front of the sound field; single instrument sequences and simple mixture sequences have similar spatial information rate, mean 3.49 kbps and 3.66 kbps, respectively; complex mixture sequences generally have the highest spatial information rate, mean 6.90 kbps, this can be explained by multiple sound sources at diverse sound field locations in this type of sequences.
In Parametric Stereo (PS ) coding, it is reported that 7.7 kbps of spatial parameter bitrate is sufficient for transparent spatial audio quality, agreeing very well with our SPE computation.
4.2. Temporal Variation of Spatial Information Rate in a Single Senescence
The test data show that for es02 with stable voice from the front, SPE stays at 1-2 kbps; for sc03 with multiple instruments and strong spatial impression, SPE stays at about 7 kbps. But within either sequence, the SPE changes little.
4.3. Overall Perceptual Information in Stereo Sequences
Using PE to evaluate the perceptual information, only intrachannel redundancy and irrelevancy are exploited; the overall PE is simply the sum of PE of the left and right channels. Using SPE based on BCPPM, interchannel redundancy and irrelevancy are also exploited; the overall perceptual information is about one normal audio channel plus some spatial parameters, which has significantly lower bitrate.
We have developed the Binaural Cues Physiological Perceptual Model (BCPPM) to measure the perceptible information, or Spatial Perceptual Entropy (SPE), in multichannel audio signals and have given a lower bitrate bound in multimedia communications for this type of contents. BCPPM models the physical and physiological processing of human spatial hearing into a parallel of lossy communication subchannels with inter-subchannel interference, and SPE is the overall channel capacity. Each of these subchannels carries ITD, ILD, or IC with addictive noises, resulted from intrinsic noises of binaural cues perception and interferences among the cues within the same CB. Experiments on stereo signals of different types have confirmed that SPE is compatible with the spatial parameter bitrate and spatial impression in SAC.
Nevertheless, SPE gives only the lower bitrate bound for transparent quality. We will extend SPE to give the bound for given subjective quality in the future. Then in mobile, internet, and other communications networks conveying multichannel audio signals, we can use the estimated bound to allocate bandwidth for a particular Quality of Service (QoS), transparent or degraded and thus save bandwidth or improve the overall QoS. On the other hand, current SAC may benefit from SPE—dynamically allocating bitrate to accommodate varying spatial contents—thus improving quality and reducing overall bitrate.
This research is supported by the National Science Foundation of China Grant no. 60832002.
- 2.Lossless comparison http://wiki.hydrogenaudio.org/index.php?title=Lossless_comparison
- 4.Zwicker E, Fastl H: Psychoacoustics Facts and Models. Berlin, Germany, Springer; 1990.Google Scholar
- 5.Moore BCJ: An Introduction to the Psychology of Hearing. 5th edition. Elsevier Academic Press, London, UK; 2003.Google Scholar
- 7.Hall JL: Auditory psychophysics for coding applications. In The Digital Signal Processing Handbook. Edited by: Madisetti V, Williams D. CRC Press, Boca Raton, Fla, USA; 1998:39.1-39.25.Google Scholar
- 8.Moore BCJ: Masking in the human auditory system. In Collected Papers on Digital Audio Bit-Rate Reduction. Edited by: Gilchrist N, Grewin C. Audio Engineering Society, New York, NY, USA; 1996:9-19.Google Scholar
- 9.ISO/IEC JTC1/SC29/WG11 : Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 7: Advanced Audio Coding (AAC). ISO/IEC 13818-7, 2005Google Scholar
- 10.ISO/IEC JTC1/SC29/WG11 : Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 3: Audio, Subpart 4: General Audio Coding. ISO/IEC 14496-3, 2005Google Scholar
- 13.Johnston JD: Estimation of perceptual entropy using noise masking criteria. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '88), May 1988 2524-2527.Google Scholar
- 14.3GPP : Mandatory speech CODEC speech processing functions; AMR speech Codec; General description. 3GPP TS 26.071, 2008, http://www.3gpp.org/ftp/Specs/html-info/26071.htm
- 15.3GPP : Speech codec speech processing functions; Adaptive Multi-Rate—Wideband (AMR-WB) speech codec; General description. 3GPP TS 26.171, 2008, http://www.3gpp.org/ftp/Specs/html-info/26171.htm
- 16.ISO/IEC , JTC1/SC29/WG11 MPEG : Information technology—coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s—part 3: audio. ISO/IEC 11172-3, 1992Google Scholar
- 17.Blauert J: Spatial Hearing: The Psychophysics of Human Sound Localization. MIT Press, Cambridge, Mass, USA; 1997.Google Scholar
- 21.Blauert J: Sound localization in the median plane. Acustica 1969-1970, 22(4):205-213.Google Scholar
- 27.Møller H, Sørensen MF, Hammershøi D, Jensen CB: Head-related transfer functions of human subjects. Journal of the Audio Engineering Society 1995, 43(5):300-321.Google Scholar
- 29.Huang Y, Enesty J (Eds): Spatial hearing In Audio Signal Processing for Next-Generation Multimedia Communication Systems. Kluwer Academic Publishers, Norwell, Mass, USA; 2004:345-370.Google Scholar
- 31.Algazi VR, Duda RO, Thompson DM, Avendano C: The CIPIC HRTF database. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Electroacoustics, October 2001, New Paltz, NY, USA 99-102.Google Scholar
- 34.von Bèkèsy G: Experiments in Hearing. McGraw Hill, New York, NY, USA; 1960.Google Scholar
- 35.Møller AR: Hearing: Anatomy, Physiology, and Disorders of the Auditory System. 2nd edition. Academic Press, Burlington, Vt, USA; 2006.Google Scholar
- 36.Rose JE, Gross NB, Geisler CD, Hind JE: Some neural mechanisms in the inferior colliculus of the cat which may be relevant to localization of a sound source. Journal of Neurophysiology 1966, 29(2):288-314.Google Scholar
- 37.Park TJ: IID sensitivity differs between two principal centers in the interaural intensity difference pathway: the LSO and the IC. Journal of Neurophysiology 1998, 79(5):2416-2431.Google Scholar
- 39.Stern RM, Wang DeL, Brown G: Binaural sound localization. In Computational Auditory Scene Analysis. Edited by: Brown G, Wang DeL. Wiley/IEEE Press, New York, NY, USA; 2006.Google Scholar
- 49.Litovsky RY, Rakerd B, Yin TCT, Hartmann WM: Psychophysical and physiological evidence for a precedence effect in the median sagittal plane. Journal of Neurophysiology 1997, 77(4):2223-2226.Google Scholar
- 51.Scharf B: Critical bands. In Foundations of Modern Auditory Theory. Academic Press, New York, NY, USA; 1970.Google Scholar
- 54.Breebaart J, Herre J, Faller C, et al.: MPEG spatial audio coding/MPEG surround: overview and current status. AES 119th Convention, October 2005, New York, NY, USAGoogle Scholar
- 57.Rödén J, Breebaart J, Hilpert J, et al.: A study of the MPEG surround quality versus bit-rate curve. AES 123rd Convention, October 2007, New York, NY, USAGoogle Scholar
- 58.Breebaart J, Hotho G, Koppens J, Schuijers E, Oomen W, van de Par S: Background, concept, and architecture for the recent MPEG surround standard on multichannel audio compression. Journal of the Audio Engineering Society 2007, 55(5):331-351.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.