A Multimedia Application: Spatial Perceptual Entropy of Multichannel Audio Signals

Open Access
Research Article
Part of the following topical collections:
  1. Multimedia Communications over Next Generation Wireless Networks


Usually multimedia data have to be compressed before transmitting, and higher compression rate, or equivalently lower bitrate, relieves the load of communication channels but impacts negatively the quality. We investigate the bitrate lower bound for perceptually lossless compression of a major type of multimedia—multichannel audio signals. This bound equals to the perceptible information rate of the signals. Traditionally, Perceptual Entropy (PE), based primarily on monaural hearing measures the perceptual information rate of individual channels. But PE cannot measure the spatial information captured by binaural hearing, thus is not suitable for estimating Spatial Audio Coding (SAC) bitrate bound. To measure this spatial information, we build a Binaural Cue Physiological Perception Model (BCPPM) on the ground of binaural hearing, which represents spatial information in the physical and physiological layers. This model enables computing Spatial Perceptual Entropy (SPE), the lower bitrate bound for SAC. For real-world stereo audio signals of various types, our experiments indicate that SPE reliably estimates their spatial information rate. Therefore, "SPE plus PE" gives lower bitrate bounds for communicating multichannel audio signals with transparent quality.


Sound Source Audio Signal Interaural Time Difference Lossless Compression Interaural Level Difference 

1. Introduction

A central goal in multimedia communications is to deliver quality contents with the lowest possible bitrate. By quality, we mean the perceived fidelity of the received contents against the original contents. And the lowest possible bitrate depends on two disparate concepts: entropy and perception. Entropy measures the quantity of information [1]. But not all information is perceptible.

To pursue this goal, we want to know how many bits are sufficient to convey quality multimedia contents. Lossless compression always ensures the highest possible quality, in which the objective redundancy in the multimedia contents is the only source of compression, and there is a limit, the Shannon entropy, the lowest possible bitrate with perfect decompression. Nevertheless, this limit is very hard if not impossible to compute due to the diversity and complexity of probability models of multimedia contents. By Huffman coding, run-length coding, arithmetic coding, and other entropy coding techniques, the state-of-the-art lossless audio coders today typically achieve a compression rate of 1/3-2/3 or 230–460 kbps per channel for CD music [2].

Lossless compression generally conveys higher than necessary quality in multimedia communications. Multimedia contents abound subjective irrelevancy—objective information we cannot sense. Perceptually lossless compression suffices. For audio signals, this means lossless to the extent that the distortion after decompression is imperceptible to normal human ears (usually called transparent coding), the bitrate can be much lower than the true lossless coding. Perceptual audio coding [3] by removing the irrelevancy greatly reduces communication bandwidth or storage space. Psychoacoustics provides a quantitative theory on this irrelevancy [4, 5, 6, 7]: the limits of auditory perception, such as the audible frequency range (20–20000 Hz), the Absolute Threshold of Hearing (ATH), and the masking effect [8]. In state-of-the-art perceptual audio coders, such as MPEG-2/4 Advanced Audio Coding (AAC [9, 10]), 64 kbps is enough for transparent coding [11]. The Shannon entropy cannot measure the perceptible information or give the bitrate bound in this case.

In 1988, Johnston proposed Perceptual Entropy (PE [12, 13]) for audio coding based on psychoacoustics. PE gives the lower bitrate bound for perceptual audio coding:
where PE is measured in bits per sample, Open image in new window the length of block transform (usually DFT), nint() integer rounding, Open image in new window the index of starting bin of subband Open image in new window , Open image in new window the Open image in new window th transform coefficient, Open image in new window the undetectable distortion upper bound of band, Open image in new window and Open image in new window the number of bins in subband Open image in new window . Table 1 lists PE for various mono audio signals. The last column gives nears transparent bitrates of current coders, slightly lower than the upper bound of PE.
Table 1

PE and bitrate of various mono audio signals [13].

Sampling Rate (kHz)

Band Width (kHz)

PE (bits/sample)

Bitrate (kbps)

Near Transparent Coding Bitrate (kbps)





12.2 (AMR [14])





23.85 (AMR-WB [15])





64 (AAC [9])

We can see that if Open image in new window in (1) assumes conservative values (smaller), PE will be larger. On the other hand, Adaptive Multirate (AMR [14]) and Adaptive Multirate Wide Band (AMR-WB [15]) use a priori knowledge of human voicing, also reducing bitrate. Apart from these two points, PE reliably predicts the lowest bitrate required for transparent audio coding. Since formulated, PE has found widespread use in audio coding and has become a fundamental theory in this field. Main stream perceptual audio coders, such as MP3 [16] and AAC, all employ PE as an important psychoacoustic parameter, leading to various practical methods not just theory.

Nevertheless, PE has significant limitation to measure perceptual information. This limitation primarily comes from the underlying monaural hearing model. Human has two ears to receive sound waves in a 3-dimensional space: not only is the time and frequency information perceived—needing just individual ears—but also spatial information or localization information—needing both ears for spatial sampling. Due to the unawareness of binaural hearing, PE of multichannel audio signals is simplified to the supposition of PE of individual channels, which is significantly larger than real quantity of information received because multichannel audio signals usually correlate. The purpose of this paper is to measure the perceptual information of binaural hearing.

We first analyze the localization principle of binaural hearing and give a spatial hearing model on the physical and physiological layers. Then we propose a Binaural Cue Physiological Perception Model (BCPPM) based on binaural hearing. Finally using binaural frequency-domain perception property, we give a formula to compute the quantity of spatial information and numerical results of spatial information estimation of real-world stereo audio signals.

With the left and right ears, human being is able to detect spatial information: sound source localization and sound source spaciousness. The former comprises of the range, azimuth, and elevation, in other words, the 3-dimensional spherical coordinate. The later can be measured by angle span of auditory images.

Human spatial hearing is a complex procedure of physics, physiology, and psychology (Figure 1). Psychology stays on the top of this procedure. On this layer, hearing is transformed to cognition, substantially influenced by subject psychological state, other senses, especially visual perception, and knowledge, implying that the same sound does not necessarily produce the same hearing perception. In Spatial Hearing, Blauert gives examples that different subjects in the same sound environment have diverse description of the environment [17]. In 1998, Hofman et al. reported in Nature that subjects with modified pinnae shape lost the elevation detection ability at the beginning but gradually regained that full ability [18]. This phenomenon demonstrates that the subjects were able to learn the correspondence between frequency response characteristics with the modified pinnae and sound from different elevations and used the knowledge to guide elevation detection. Due to the above reasons, spatial hearing on the psychological layer is too complicated to be exploited in audio compression systems, which cannot assume any specific states, senses, and knowledge of listeners.
Figure 1

3 Layers of auditory sound source localization.

On the physical layer, sound waves propagate from sources along different paths to the ears and then in the ear canals and finally to the cochlea, absorbed and reflected by walls, floors, torso, head, and other objects on the way. Those sound waves carry objective localization information. On the physiological layer, sound waves are transformed to neural cell excitation and inhibition by the auditory system. There are different types of auditory neural cell responding to different types of sound stimulus, such as intensity, frequency, and delay. Thus physical quantities become physiological data.

In audio compression, irrelevancy removing is mainly on the physical and physiological layers. In the following, we discuss the representation of binaural cues on the two layers—BCPPM.

1.1. Spatial Information on the Physical Layer

As early as 1907, Rayleigh studied the physics of spatial hearing [19]: Interaural Time Difference (ITD) and Interaural Level Difference (ILD). Also Rayleigh has two seminal discoveries: the famous duplex theory, that is, below 1.5 kHz, ITD is the primary localization cue and above 1.5 kHz ILD instead, the head-shadow effect, that is, the blocking and reflection of sounds by head produce a maximum of 20 dB intensity difference. Both discoveries are derived based on the rigid ball modeling of head (Figure 2).
Figure 2

The rigid ball model of human head used by Rayleigh.

2. Physiological Perception Modeling of Binaural Hearing

Although a real head is far from being the rigid ball, the above results are basically correct. In 2002, Macpherson and Middlebrooks demonstrated that the duplex theory is suitable for a variety of audio signals: pure tones, wide band signals, high pass signals, as well as low pass signals [20]. Exception is high frequency signals with envelope delay [17].

ITD and ILD are not all the localization cues. On the medial plane (which cuts perpendicularly through the middle of the line connecting the left and right ears), all sound sources have Open image in new window  ms and Open image in new window  dB. But when they have different elevations, our auditory system can detect the difference by elevation-related spectral characteristics [21, 22, 23, 24]. Due to the asymmetric structure of pinnae [25], the interference of sound waves is both wavelength related and elevation related (Figure 3). For example, the frequency of the lowest spectral amplitude (interference annihilation) is a function of the elevation [26]. This is the root of our elevation detection ability. This spectral cue does not depend on binaural hearing, so it is also called monaural cue.
Figure 3

Elevation angle detection (Modified from ). http://interface.cipic.ucdavis.edu/CIL_tutorial/3D_psych/elev.htm

Unlike ILD and ITD, the spectral cue needs prior knowledge to provide elevation information. In principle, sounds may have arbitrary spectra. A listener is not able to detect the elevation angle based solely on the spectra: any characteristics may come from sound sources themselves and may come from the filtering effect of pinnae. The listener cannot tell.

Blauert reported a very interesting auditory phenomenon of narrow-band sound sources on the medial plane: the elevation angles given by subjects are independent of the real elevation angles but depended on the signal frequencies [17]. For wide-band signals of familiar types, it is easy for our auditory system to compare the pinnae filtered spectra (some frequency amplified and some decayed) to the spectra in memory, and based on the difference, reliable elevation angle estimation can be given (Figure 3). But for narrow-band signals, pinnae filtered spectra do not have detectable shape difference, just level difference. Thus the elevation angle detection will be very unreliable. In fact, the elevation angles given by the subjects are the angles at which the narrow-band signals have the maximum gain due to the pinnae filtering. For example, the peak gain frequency when the sounds come from the front is 3 kHz for most people [21]. So wherever a sound of 3 kHz came from, most subjects pointed at the front.

From the perspective of signal processing, sound wave propagation is roughly a Linear Time Invariant (LTI) system. To describe this LTI system in binaural hearing, we have Head-Related Transfer Function (HRTF [27, 28, 29]) or equivalently Head-Related Impulse Response (HRIR). In open space, HRTF/HRIR is the function of source location, that is, range, azimuth, and elevation.

Figure 4 shows the HRTFs in binaural hearing. Signal Open image in new window goes from the source though the left and right paths to the left and right ears, respectively. Denote by Open image in new window the left path HRTF and by Open image in new window the right path HRTF. Then Open image in new window is the entrance signal of the left ear, so is Open image in new window . Since the signal may have any spectra, localization cannot be determined solely by Open image in new window or Open image in new window .
Figure 4

Binaural hearing transfer functions.

Suppose that there are no strict zeros in the signal and the HRTFs. To exclude the effect of Open image in new window , we define Binaural Difference Transfer Function (BDTF):
which is independent of Open image in new window and located related. BDTF contains the same spatial information as Open image in new window and Open image in new window . In fact, we can find ILD and ITD from it:

Obviously, ILD and ITD are not only source location dependent, but also frequency dependent.

To obtain accurate relationship between sound source locations and sound wave propagation, more realistic head models or real heads are needed. In 1994, the MIT Media Lab collected HRTFs on 710 locations in the 3-dimensional space using the KEMAR head [30]. In 2001, CIPIC of U.C. Davis examined HRTFs of 45 subjects and 2 KEMAR heads [31]. Individual difference of HRTFs is revealed in HRTFs obtained by the experiments. Nevertheless, there are common characteristics that are sufficient to derive subject-independent spatial information.

2.1. Spatial Information on the Physiological Layer

In human auditory system, ITD and ILD of external sound sources stimulate or inhabit specific neural cells in the full audible frequency range. This process comprises of two steps: Frequency-to-Place Transform (FPT) [32, 33] and Binaural Processing (BP).

In 1960, Bèkèsy reported that sounds of different frequencies generate surface waves on the basilar membrane in cochlea with peak amplitudes at different places, which are determined by the frequencies [34]. In other words, a specific frequency is mapped to a specific place on the basilar membrane, or FPT, and this specific frequency for a given place is called Characteristic Frequency (CF [35]). Hair cells on that place then transform the mechanical swing into electric signals of auditory nerves.

The neural signals from the left and right ears corresponding to the same frequency meet in the brain. Our auditory system then extracts the ITD and ILD information in the signals. Currently, there are two kinds of theories on this process: Excitation-Excitation (EE [36]) and Excitation-Inhibition (EI [37]). The former proposed that there are auditory nerve cells of EE-type located between the inferior colliculus and the medial superior olive, and specific EE-type cells there have maximum excitation for signals with specific ITD and ILD; the latter proposed that there are auditory nerve cells of EI-type located between the inferior colliculus and the lateral superior olive, and specific EI-type cells there have maximum inhibition for signals with specific ITD and ILD. The common ground of the two theories is that specific nerve cells are only sensitive to specific ITD and ILD, which are called characteristic ITD and characteristic ILD. In some literatures, characteristic ITD is also called Best Delay (BD [38]) or Characteristic Delay (CD [39]). Both the EE-type and EI-type have supports from physiological research, but the latter explains better the various binaural hearing phenomena [40].

In 1948, Jeffress gave a physiological model for ITD perception [41, 42]—delay line model—the foundational contribution, having lasting impact in the field (Figure 5). Neural signals in the form of spike train from the left and right auditory pathways meet at some coincidence counter after traveling along the left and right delay lines and trigger the counter, which is in fact a physiological cross-correlation calculator. The specific counter having the largest counts is the counter to which the delay difference along the left and right delay lines exactly compensates the ITD. For example, sounds from the medial plane ( Open image in new window ) generate the largest counts in the middle counter of the Jeffress network. The coincidence counters can be classified as EE-type auditory nerve cells.
Figure 5

Jeffress model: delay line network.

In 2001, Breebaart et al. extended the Jeffress model by incorporating attenuators [43, 44, 45] (Figure 6). An important difference to the Jeffress model is the use of EI-type elements instead of the EE-type elements in the Breebaart model. Due to the attenuators, ILD can be extracted by the extended model.
Figure 6

Breebaart model: delay-attenuation network.

In the Breebaart model, only if the internal delay and attenuation are exactly compensated by the external ITD and ILD, the corresponding EI-type elements will have the largest inhibition. Thus, knowing the position of the EI-type element with the largest inhibition, the auditory system finds the ITD and ILD of the external audio signals.

The Breebaart model also implies the calculation of Interaural Coherence (IC), which manifests as the trough of the excitation surface, in accordance with the EI-type assumption. Nevertheless, there is no direct physiological quantity related to IC in this model.

In 2004, Faller and Merimma reported that IC relates to perceiving sound image width and stability, as well as sound field ambience [46, 47]. On the other hand, by the precedence effect [48, 49] of spatial hearing—sound source localization depending primarily on the direct sounds to the ears and essentially irrespective to reflection and reverberation—which contributes to lowering IC, Faller proposed that our auditory system use ITD and ILD to localize sound sources only if IC approaches 1. Since direct sounds to the ears have near 1 cross-correlation, this explains the precedence effect.

2.2. Binaural Cue Physiological Perception Model (BCPPM)

From the viewpoint of the information theory, the channel from the physical layer to the physiological layer is lossy, and less spatial information survives during the course (Figure 7).
Figure 7

Spatial information loss.

Since the wavelength (0.012–17 m) of sound in the audible range (20–20000 Hz) is much longer than light, and comparable to normal objects in our surrounding—leading to significant interference and diffraction—spatial information from hearing is limited initially. This limited information is first compromised by noises and other interferences from other sound sources, as indicated by Open image in new window in Figure 7. Then during transformation from mechanical swing to electric impulses, part of the information is lost again due to the limited frequency range and dynamic range, the limited frequency and temporal resolution, and physiological noises of our auditory system, as is indicated by Open image in new window in Figure 7.

The loss of spatial information manifests as offset and disperses, related to multisource interference, limited SNR in the physical and physiological system. For example, sometimes a single source becomes multiple sources of mirrored sound images due to reflection by, say walls and floors. These sources have the same frequency range, so auditory filtering cannot separate them. And the perceived ITD and ILD are determined by the combined effects of BDTFs of those sources, typically leading to biased and vaguer location perception (Figure 8). A large sound source has similar localization effects. In the Breebaart model, the resolution of ITD and ILD is limited by the fineness of the delay elements and attenuation elements: no ITD smaller than the delay offered by one delay element can be detected and no ILD smaller than the attenuation offered by one attenuation element can be detected. This is in analogy to the ATH in monaural hearing. The limited ITD and ILD resolution turns out to limited localization resolution.
Figure 8

Two types of spatial information loss.

In Section 1.1, we see that the physical data of sound source localization in binaural hearing are in form of ITD and ILD. In Section 2.1, we see that ITD and ILD are transformed to maximum inhibition of specific EI-type auditory nerve cells in the Breebaart model, and the physiological data are in the form of coordinates of the delay-attenuation network.

When there are multiple sound sources, background noises, reflection, diffraction, and reverberation, IC becomes another type of physical data conveying the overall sound field information.

Since spatial hearing on the physiological layer is too complex and uncertainty to be incorporated in computational model for common listeners, we restrict the calculation of perceptible spatial information to that directly related to ITD, ILD, and IC and physiological data corresponding to the three cues. In fact, spatial coding systems use the cues to represent spatial information.

We first review the psychoacoustic foundation of PE, mainly the nonlinear frequency resolution (Critical Band, CB [50, 51]) of our hearing system, spreading functions in the frequency domain for noises and tones and tonality estimation.

To calculate PE, Johnston used a Monaural Hearing Model (MHM, Figure 9). In this model, a 25-subband filterbank filters incoming audio signals. Each subband has a bandwidth of CB at the corresponding frequency (CB1-CB25 in Figure 9), increasing from low to high frequency. Each subband also acts as a lossy subchannel, and the loss of audio information is due to the intrinsic noises of hearing system (ATH) and interchannel interference (masking effect). ATH is signal dependent, usually as a table or a fitting function of experimental data. Masking is signal dependent, usually obtained by convoluting the tonality-dependent spreading functions with the signal spectra. Combining both, we have effective channel noises ( Open image in new window - Open image in new window in Figure 9).
Figure 9

Monaural Hearing Model (MHM) used to calculate PE.

There is no place for localization in the MHM. The critical limit of the model is the lack of binaural processing—only spectral-temporal information but not spatial information. The Breebaart delay-attenuation network just models the binaural processing. So we borrow the idea of lossy multichannel in MHM and combine MHM with the Breebaart model—Binaural Cue Physiological Processing Model (BCPPM, Figure 10).
Figure 10

Binaural Cue Physiological Perception Model (BCPPM).

The BCPPM consists of 3 modules.

Frequency-to-Place Transform in Cochlea.

This process separates sounds into a bank of subband signals, essentially the subband filtering in MHM. The subband filter can be implemented by DFT with spectral lines grouped to subbands according to CB or by the Cochlear Filter Bank (CFB [52]) proposed by Baumgarte in 2002.

Delay-Attenuation Network.

This is the same as that in Figure 6. After the Time-to-Place Transform, external audio signals change into spike trains of auditory nerve signals, which arrive at the corresponding delay-attenuation networks. Then the networks output ITD, ILD, and IC for each critical band. From the location of the maximum inhibition (lowest excitation, the trough of the neural excitation surface in Figure 11), we can derive ITD and ILD. From the gradient of the trough, we can derive IC: faster descending or larger gradient implies larger IC ( Open image in new window ); slower descending or smaller gradient implies smaller IC ( Open image in new window ).
Figure 11

An example of auditory nerve excitaton surface with Open image in new window  ms and Open image in new window  dB, adapted from [ 42 ].

Effective Channel Noises.

The effective channel noise for ITD, ILD, and IC ( Open image in new window , Open image in new window , and Open image in new window in Figure 10) is a simplified method to model the limited precision, intrinsic noises, and intersource interference in our hearing system. Part of the noise comes directly from grains of delay and attenuation ( Open image in new window and Open image in new window in Figure 6). For example, if Open image in new window , Open image in new window . Generally, Open image in new window and Open image in new window are functions of frequency. A related concept is Just Noticeable Difference (JND) in psychoacoustics, indicating the overall sensitivity of our auditory system. On the other hand, ITD, ILD, and IC are not independent, there are interactions among them. The effective channel noise should also incorporate the interactions.

3. Computing Spatial Perceptual Entropy (SPE) Based on BCPPM

In this section, we will define SPE using the BCPPM and then discuss in detail the computational implementation of BCPPM, including 3 core components: the CB filterbank, binaural cues computation, and perceptible information computation (Figure 12).
Figure 12

SPE calculation.

3.1. SPE Definition

From the information theory viewpoint, we see BCPPM as a double-in-multiple-out system (Figure 10). The double-in is the left ear entrance sound and the right ear entrance sound. The multiple-out consists of 75 effective ITDs, ILDs, and ICs (25 CBs, each with a tuple of ITD, ILD, and IC).

Like in computing PE, we view each path that leads to an output as a lossy subchannel. Then there are 75 such subchannels. Unlike PE, what a subchannel conveys is not a subband spectrum but one of ITD, ILD, and IC of the subband corresponding to the sub-channel.

In each sub-channel, there are intrinsic channel noises (resolution of spatial hearing), and among sub-channels, there are interchannel interferences (interaction of binaural cues). Then there is an effective noise for each sub-channel.

Under this setting, each sub-channel will have a channel capacity. We denote SPE(c), SPE(t), and SPE(l) for the capacity of IC, ITD, and ILD sub-channels respectively. Then SPE is defined as the overall capacity of these sub-channels, or the sum of capacities of all the sub-channels:
To derive SPE( Open image in new window ), SPE( Open image in new window ), and SPE( Open image in new window ), we need probability models for IC, ITD, and ILD. Although the binaural cues are continuous, the effective noise quantizes them into discrete values. Let Open image in new window , Open image in new window , and Open image in new window denote the discrete ILD, ITD, and IC source probability spaces:
where Open image in new window , Open image in new window , and Open image in new window are the Open image in new window th discrete values of ILD, ITD, and IC, respectively, and Open image in new window , Open image in new window , and Open image in new window the corresponding probabilities. Then we have

For some probability distributions, say uniform distribution, (5), (6), and (7) can be readily calculated.

3.2. CB Filterbank

We use the same method as that in PE to implement the CB filterbank. Audio signals are first transformed to the frequency domain by DFT of 2048 points with 50% overlap between adjacent transform blocks. Then a DFT spectrum is partitioned into 25 CBs according to Table 2 [41]. Then basic processing unit is the subspectra of each CB.
Table 2

Critical Bands for 2048-point DFT, sampling frequency 48 kHz [40].

CB Index

Frequency Range (Hz)

Spectral Index

CB Index

Frequency Range (Hz)

CB Index


0 Open image in new window 100

0 Open image in new window 3


2000 Open image in new window 2320

85 Open image in new window 98


100 Open image in new window 200

4 Open image in new window 8


2320 Open image in new window 2700

99 Open image in new window 114


200 Open image in new window 300

9 Open image in new window 12


2700 Open image in new window 3150

115 Open image in new window 133


300 Open image in new window 400

13 Open image in new window 16


3150 Open image in new window 3700

134 Open image in new window 157


400 Open image in new window 510

17 Open image in new window 21


3700 Open image in new window 4400

158 Open image in new window 187


510 Open image in new window 630

22 Open image in new window 26


4400 Open image in new window 5300

188 Open image in new window 225


630 Open image in new window 770

27 Open image in new window 32


5300 Open image in new window 6400

226 Open image in new window 272


770 Open image in new window 920

33 Open image in new window 38


6400 Open image in new window 7700

273 Open image in new window 328


920 Open image in new window 1080

39 Open image in new window 45


7700 Open image in new window 9500

329 Open image in new window 404


1080 Open image in new window 1270

46 Open image in new window 53


9500 Open image in new window 12000

405 Open image in new window 511


1270 Open image in new window 1480

54 Open image in new window 62


12000 Open image in new window 15000

512 Open image in new window 639


1480 Open image in new window 1720

63 Open image in new window 72


15000 Open image in new window 24000

640 Open image in new window 1023


1720 Open image in new window 2000

73 Open image in new window 84

3.3. Binaural Cues Computation

ILD is the ratio of left ear entrance signal intensity to right ear entrance signal intensity. Since DFT preserves signal energy, we can use DFT subspectra energy ratio to compute ILD on each CB [53]:

where Open image in new window is the indexes of CB, Open image in new window and Open image in new window the starting DFT spectral index of Open image in new window and Open image in new window (Table 2), Open image in new window and Open image in new window the Open image in new window th spectral lines from left and right ear entrance signals.

Time shift corresponds to linear phase shift in the frequency domain. Therefore, we can use group delay (slope of phase-frequency curve) of subband signal to derive ITD on each subband:
where Open image in new window is the bandwidth of Open image in new window , and arg represents the phase of a complex number. A more reliable but also more complex method is to use least square fitting to find the group delays and then ITD:

The summation range, Open image in new window to Open image in new window , is left out for simplicity.

Due to the property that time-domain normalized correlation is equivalent to the real part of correlation in the frequency domain, IC of each CB can be derived as the following:

where the summation range is also Open image in new window to Open image in new window , and "*" represents conjugate.

3.4. Effective Spatial Perception Data

The resolutions or quantization steps of the binaural cues (Figure 12) can be determined by JND experiments. Denote by Open image in new window , Open image in new window , and Open image in new window the resolutions of ITD, ILD, and IC, respectively. Generally, they are signal dependent and frequency dependent. For simplicity, we use constant values [44, 54]: Open image in new window  ms, Open image in new window dB, and Open image in new window .

IC has different impacts on ITD and ILD perception. In 2001, Hartmann Constan reported that the difference of JND of ILD for correlated noises and uncorrelated noises is only 0.5 dB [55]. This can be explained by the fact that signal power is independent phase, which influences correlation, and lower IC is partly the result of increasing phase noise. This is illustrated in Figure 13: when IC decreases, the gradient along the ILD axis keeps almost unchanged, but the gradient along the ITD significantly decreases.
Figure 13

The different effects of IC on ITD and ILD perception.

Larger IC usually implies higher ITD perception precision or equivalently morespatial information. When IC approaches 1, the activity surface will have a very sharp decreasing toward the point with the lowest auditory nerve activity. In this case, the uncertainty of ITD is very small and is determined precisely. When IC decreases to 0, the surface becomes flatter, leading to larger uncertainty or lower precision of ITD. In the extreme case, when Open image in new window , the gradient along the IC axis will be constantly 0, there is no well defined trough point and ITD is completely indeterminable.

By the above analysis, we ignore the effect of IC on ILD and only consider the effect of IC on ITD for SPE computation. Lower IC leads to lower resolution of ITD. This is equivalent to higher JND of ITD. Then the effective JND on subband Open image in new window , denoted as Open image in new window , can be formulated as the following:

From (13) we see that when IC(b)=1, Open image in new window assumes the minimum Open image in new window and the auditory system has the highest resolution for ITD; when Open image in new window , Open image in new window , the resolution of ITD is lower but there is still spatial information from ITD; when Open image in new window , Open image in new window , the resolution of ITD is 0 and there is no spatial information in ITD.

Then we have the following effective perception data Open image in new window , Open image in new window , and Open image in new window of ILD, ITD, and IC, respectively by quantization:

where Open image in new window represents the round down function.

Suppose that Open image in new window , Open image in new window , and Open image in new window are uniformly distributed by (6), (7), and (8), the SPE of IC, ITD, and ILD are

where N is the number of spectral lines in one transform, or 1024 in this case; Open image in new window , Open image in new window , and Open image in new window can be found from (9), (10), and (11), respectively; Open image in new window , Open image in new window , and Open image in new window are the JNDs of ILD, ITD, and IC on CB b, respectively, obtained from subjective listening experiments; and Open image in new window is the amplitude compression factor, assuming 0.6 [5].

4. Experiments

We evaluate SPE of 126 stereo sequences from 3GPP and MPEG, which are classified into speech, single instrument, simple mixture, and complex mixture, all sampled at 44.1 kHz. For comparison, we also evaluate PE of these sequences.

Figure 14 gives the computational procedure of SPE: stereo audio signals are windowed and block transformed to the frequency domain using 2048-point DFT; then on the 25 CBs, binaural cues are derived before transformed into effective spatial perception data, the entropy of which is SPE.
Figure 14

Flowchart of SPE Computation.

In the following experiments, Open image in new window , Open image in new window , and Open image in new window assume constant and conservative values, and their frequency dependency is also ignored. The overall SPE is the sum of entropy of effective IC, ILD, and ITD perception data, shown in (4).

4.1. Perceptual Spatial Information of Stereo Sequences

In this experiment, we compute perceptual spatial information by SPE for 4 classes of stereo sequences (Figure 15): each class consists of 12 sequences, sampled at 44.1 kHz; each data point is average of SPE over one sequence, measured by kbps.
Figure 15

Perceptual spatial information of stereo sequences sampled at 44. 1 kHz.

From Figure 15 we find that speech sequences generally have the lowest spatial information rate, mean 2.75 kbps, this is in accordance with the recording practice that voices usually stay in direct front of the sound field; single instrument sequences and simple mixture sequences have similar spatial information rate, mean 3.49 kbps and 3.66 kbps, respectively; complex mixture sequences generally have the highest spatial information rate, mean 6.90 kbps, this can be explained by multiple sound sources at diverse sound field locations in this type of sequences.

In Parametric Stereo (PS [56]) coding, it is reported that 7.7 kbps of spatial parameter bitrate is sufficient for transparent spatial audio quality, agreeing very well with our SPE computation.

4.2. Temporal Variation of Spatial Information Rate in a Single Senescence

In this experiment, we choose two sequences es02 of German male speech and sc03 of contemporary pop music from MPEG and compute their SPE frame by frame (Figure 16).
Figure 16

SPE of es02 (speech) and sc03(pop). (a): waveform of es02; (b): SPE curve of es02; (c): waveform of sc03; (d): SPE curve of sc03.

The test data show that for es02 with stable voice from the front, SPE stays at 1-2 kbps; for sc03 with multiple instruments and strong spatial impression, SPE stays at about 7 kbps. But within either sequence, the SPE changes little.

4.3. Overall Perceptual Information in Stereo Sequences

Using PE to evaluate the perceptual information, only intrachannel redundancy and irrelevancy are exploited; the overall PE is simply the sum of PE of the left and right channels. Using SPE based on BCPPM, interchannel redundancy and irrelevancy are also exploited; the overall perceptual information is about one normal audio channel plus some spatial parameters, which has significantly lower bitrate.

For the above reason, PE gives much higher bitrate bound than SPE (Figure 17). PE is compatible with the traditional perceptual coding schemes, such as MP3 and AAC, in which channels are basically processed individually (except the mid/side stereo and the intensity stereo). So PE gives meaningful bitrate bound for them. But in Spatial Audio Coding (SAC [52, 54, 57, 58, 59]), multichannel audio signals are processed as one or two core channels plus spatial parameters. SPE is necessary in this case and generally gives much lower bitrate bound ( Open image in new window 1/2). This agrees to the sharp bitrate reduction of SAC.
Figure 17

Perceptual Information of stereo sequences sampled at 44. 1 kHz, evaluated using PE and SPE.

5. Conclusion

We have developed the Binaural Cues Physiological Perceptual Model (BCPPM) to measure the perceptible information, or Spatial Perceptual Entropy (SPE), in multichannel audio signals and have given a lower bitrate bound in multimedia communications for this type of contents. BCPPM models the physical and physiological processing of human spatial hearing into a parallel of lossy communication subchannels with inter-subchannel interference, and SPE is the overall channel capacity. Each of these subchannels carries ITD, ILD, or IC with addictive noises, resulted from intrinsic noises of binaural cues perception and interferences among the cues within the same CB. Experiments on stereo signals of different types have confirmed that SPE is compatible with the spatial parameter bitrate and spatial impression in SAC.

Nevertheless, SPE gives only the lower bitrate bound for transparent quality. We will extend SPE to give the bound for given subjective quality in the future. Then in mobile, internet, and other communications networks conveying multichannel audio signals, we can use the estimated bound to allocate bandwidth for a particular Quality of Service (QoS), transparent or degraded and thus save bandwidth or improve the overall QoS. On the other hand, current SAC may benefit from SPE—dynamically allocating bitrate to accommodate varying spatial contents—thus improving quality and reducing overall bitrate.



This research is supported by the National Science Foundation of China Grant no. 60832002.


  1. 1.
    Shannon CE: A mathematical theory of communication. Bell System Technical Journal 1948, 27: 379-423, 623–656.MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
  3. 3.
    Painter T, Spanias A: Perceptual coding of digital audio. Proceedings of the IEEE 2000, 88(4):451-513. 10.1109/5.842996CrossRefGoogle Scholar
  4. 4.
    Zwicker E, Fastl H: Psychoacoustics Facts and Models. Berlin, Germany, Springer; 1990.Google Scholar
  5. 5.
    Moore BCJ: An Introduction to the Psychology of Hearing. 5th edition. Elsevier Academic Press, London, UK; 2003.Google Scholar
  6. 6.
    Zwicker E, Zwicker UT: Audio engineering and psychoacoustics. Matching signals to the final receiver, the human auditory system. Journal of the Audio Engineering Society 1991, 39(3):115-126.MathSciNetGoogle Scholar
  7. 7.
    Hall JL: Auditory psychophysics for coding applications. In The Digital Signal Processing Handbook. Edited by: Madisetti V, Williams D. CRC Press, Boca Raton, Fla, USA; 1998:39.1-39.25.Google Scholar
  8. 8.
    Moore BCJ: Masking in the human auditory system. In Collected Papers on Digital Audio Bit-Rate Reduction. Edited by: Gilchrist N, Grewin C. Audio Engineering Society, New York, NY, USA; 1996:9-19.Google Scholar
  9. 9.
    ISO/IEC JTC1/SC29/WG11 : Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 7: Advanced Audio Coding (AAC). ISO/IEC 13818-7, 2005Google Scholar
  10. 10.
    ISO/IEC JTC1/SC29/WG11 : Information Technology—Generic Coding of Moving Pictures and Associated Audio Information—Part 3: Audio, Subpart 4: General Audio Coding. ISO/IEC 14496-3, 2005Google Scholar
  11. 11.
    Bosi M, Goldberg RE: Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, Boston, Mass, USA; 2003.CrossRefGoogle Scholar
  12. 12.
    Johnston JD: Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications 1988, 6(2):314-323. 10.1109/49.608CrossRefGoogle Scholar
  13. 13.
    Johnston JD: Estimation of perceptual entropy using noise masking criteria. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '88), May 1988 2524-2527.Google Scholar
  14. 14.
    3GPP : Mandatory speech CODEC speech processing functions; AMR speech Codec; General description. 3GPP TS 26.071, 2008, http://www.3gpp.org/ftp/Specs/html-info/26071.htm
  15. 15.
    3GPP : Speech codec speech processing functions; Adaptive Multi-Rate—Wideband (AMR-WB) speech codec; General description. 3GPP TS 26.171, 2008, http://www.3gpp.org/ftp/Specs/html-info/26171.htm
  16. 16.
    ISO/IEC , JTC1/SC29/WG11 MPEG : Information technology—coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s—part 3: audio. ISO/IEC 11172-3, 1992Google Scholar
  17. 17.
    Blauert J: Spatial Hearing: The Psychophysics of Human Sound Localization. MIT Press, Cambridge, Mass, USA; 1997.Google Scholar
  18. 18.
    Hofman PM, Van Riswick JGA, Van Opstal AJ: Relearning sound localization with new ears. Nature Neuroscience 1998, 1(5):417-421. 10.1038/1633CrossRefGoogle Scholar
  19. 19.
    Strutt JW: On our perception of sound direction. Philosophical Magazine 1907, 13: 214-232.CrossRefGoogle Scholar
  20. 20.
    Macpherson EA, Middlebrooks JC: Listener weighting of cues for lateral angle: the duplex theory of sound localization revisited. Journal of the Acoustical Society of America 2002, 111(5):2219-2236. 10.1121/1.1471898CrossRefGoogle Scholar
  21. 21.
    Blauert J: Sound localization in the median plane. Acustica 1969-1970, 22(4):205-213.Google Scholar
  22. 22.
    Hebrank J, Wright D: Spectral cues used in the localization of sound sources on the median plane. Journal of the Acoustical Society of America 1974, 56(6):1829-1834. 10.1121/1.1903520CrossRefGoogle Scholar
  23. 23.
    Butler RA, Belendiuk K: Spectral cues utilized in the localization of sound in the median sagittal plane. Journal of the Acoustical Society of America 1977, 61(5):1264-1269. 10.1121/1.381427CrossRefGoogle Scholar
  24. 24.
    Rakerd B, Hartmann WM, McCaskey TL: Identification and localization of sound sources in the median sagittal plane. Journal of the Acoustical Society of America 1999, 106(5):2812-2820. 10.1121/1.428129CrossRefGoogle Scholar
  25. 25.
    Musicant AD, Butler RA: The influence of pinnae-based spectral cues on sound localization. Journal of the Acoustical Society of America 1984, 75(4):1195-1200. 10.1121/1.390770CrossRefGoogle Scholar
  26. 26.
    Asano F, Suzuki Y, Sone T: Role of spectral cues in median plane localization. Journal of the Acoustical Society of America 1990, 88(1):159-168. 10.1121/1.399963CrossRefGoogle Scholar
  27. 27.
    Møller H, Sørensen MF, Hammershøi D, Jensen CB: Head-related transfer functions of human subjects. Journal of the Audio Engineering Society 1995, 43(5):300-321.Google Scholar
  28. 28.
    Møller H: Fundamentals of binaural technology. Applied Acoustics 1992, 36(3-4):171-218. 10.1016/0003-682X(92)90046-UCrossRefGoogle Scholar
  29. 29.
    Huang Y, Enesty J (Eds): Spatial hearing In Audio Signal Processing for Next-Generation Multimedia Communication Systems. Kluwer Academic Publishers, Norwell, Mass, USA; 2004:345-370.Google Scholar
  30. 30.
    Gardner WG, Martin KD: HRTF measurements of a KEMAR. Journal of the Acoustical Society of America 1995, 97(6):3907-3908. 10.1121/1.412407CrossRefGoogle Scholar
  31. 31.
    Algazi VR, Duda RO, Thompson DM, Avendano C: The CIPIC HRTF database. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Electroacoustics, October 2001, New Paltz, NY, USA 99-102.Google Scholar
  32. 32.
    Greenwood DD: A cochlear frequency-position function for several species: 29 years later. Journal of the Acoustical Society of America 1990, 87(6):2592-2605. 10.1121/1.399052CrossRefGoogle Scholar
  33. 33.
    Greenwood DD: Critical bandwidth and the frequency coordinates of the basilar membrane. Journal of Acoustic Society America 1961, 33(10):1344-1356. 10.1121/1.1908437CrossRefGoogle Scholar
  34. 34.
    von Bèkèsy G: Experiments in Hearing. McGraw Hill, New York, NY, USA; 1960.Google Scholar
  35. 35.
    Møller AR: Hearing: Anatomy, Physiology, and Disorders of the Auditory System. 2nd edition. Academic Press, Burlington, Vt, USA; 2006.Google Scholar
  36. 36.
    Rose JE, Gross NB, Geisler CD, Hind JE: Some neural mechanisms in the inferior colliculus of the cat which may be relevant to localization of a sound source. Journal of Neurophysiology 1966, 29(2):288-314.Google Scholar
  37. 37.
    Park TJ: IID sensitivity differs between two principal centers in the interaural intensity difference pathway: the LSO and the IC. Journal of Neurophysiology 1998, 79(5):2416-2431.Google Scholar
  38. 38.
    Joris PX, Van de Sande B, Louage DH, van der Heijden M: Binaural and cochlear disparities. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(34):12917-12922. 10.1073/pnas.0601396103CrossRefGoogle Scholar
  39. 39.
    Stern RM, Wang DeL, Brown G: Binaural sound localization. In Computational Auditory Scene Analysis. Edited by: Brown G, Wang DeL. Wiley/IEEE Press, New York, NY, USA; 2006.Google Scholar
  40. 40.
    Breebaart J, van de Par S, Kohlrausch A: The contribution of static and dynamically varying ITDs and IIDs to binaural detection. Journal of the Acoustical Society of America 1999, 106(2):979-992. 10.1121/1.427110CrossRefGoogle Scholar
  41. 41.
    Jeffress LA: A place theory of sound localization. Journal of Comparative and Physiological Psychology 1948, 41(1):35-39.CrossRefGoogle Scholar
  42. 42.
    Joris PX, Smith PH, Yin TCT: Coincidence detection in the auditory system: 50 years after Jeffress. Neuron 1998, 21(6):1235-1238. 10.1016/S0896-6273(00)80643-1CrossRefGoogle Scholar
  43. 43.
    Breebaart J, van de Par S, Kohlrausch A: Binaural processing model based on contralateral inhibition. I. Model structure. Journal of the Acoustical Society of America 2001, 110(2):1074-1088. 10.1121/1.1383297CrossRefGoogle Scholar
  44. 44.
    Breebaart J, van de Par SD, Kohlrausch A: Binaural processing model based on contralateral inhibition. II. Dependence on spectral parameters. Journal of the Acoustical Society of America 2001, 110(2):1089-1104. 10.1121/1.1383298CrossRefGoogle Scholar
  45. 45.
    Breebaart J, van de Par SD, Kohlrausch A: Binaural processing model based on contralateral inhibition. III. Dependence on temporal parameters. Journal of the Acoustical Society of America 2001, 110(2):1105-1117. 10.1121/1.1383299CrossRefGoogle Scholar
  46. 46.
    Faller C, Merimaa J: Source localization in complex listening situations: selection of binaural cues based on interaural coherence. Journal of the Acoustical Society of America 2004, 116(5):3075-3089. 10.1121/1.1791872CrossRefGoogle Scholar
  47. 47.
    Goupell MJ, Hartmann WM: Interaural fluctuations and the detection of interaural incoherence: bandwidth effects. Journal of the Acoustical Society of America 2006, 119(6):3971-3986. 10.1121/1.2200147CrossRefGoogle Scholar
  48. 48.
    Zurek PM: The precedence effect. In Directional Hearing. Edited by: Yost WA, Gourevitch G. Springer, New York, NY, USA; 1987:85-105.CrossRefGoogle Scholar
  49. 49.
    Litovsky RY, Rakerd B, Yin TCT, Hartmann WM: Psychophysical and physiological evidence for a precedence effect in the median sagittal plane. Journal of Neurophysiology 1997, 77(4):2223-2226.Google Scholar
  50. 50.
    Fletcher H: Auditory patterns. Reviews of Modern Physics 1940, 12(1):47-65. 10.1103/RevModPhys.12.47CrossRefGoogle Scholar
  51. 51.
    Scharf B: Critical bands. In Foundations of Modern Auditory Theory. Academic Press, New York, NY, USA; 1970.Google Scholar
  52. 52.
    Faller C, Baumgarte F: Binaural cue coding—part II: schemes and applications. IEEE Transactions on Speech and Audio Processing 2003, 11(6):520-531. 10.1109/TSA.2003.818108CrossRefGoogle Scholar
  53. 53.
    Baumgarte F: Improved audio coding using a psychoacoustic model based on a cochlear filter bank. IEEE Transactions on Speech and Audio Processing 2002, 10(7):495-503. 10.1109/TSA.2002.804536CrossRefGoogle Scholar
  54. 54.
    Breebaart J, Herre J, Faller C, et al.: MPEG spatial audio coding/MPEG surround: overview and current status. AES 119th Convention, October 2005, New York, NY, USAGoogle Scholar
  55. 55.
    Hartmann WM, Constan ZA: Interaural coherence and the lateralization of noise by interaural level differences. Journal of the Acoustical Society of America 2001, 110(5):2680.CrossRefGoogle Scholar
  56. 56.
    Breebaart J, van de Par S, Kohlrausch A, Schuijers E: Parametric coding of stereo audio. EURASIP Journal on Applied Signal Processing 2005, 2005(9):1305-1322. 10.1155/ASP.2005.1305CrossRefMATHGoogle Scholar
  57. 57.
    Rödén J, Breebaart J, Hilpert J, et al.: A study of the MPEG surround quality versus bit-rate curve. AES 123rd Convention, October 2007, New York, NY, USAGoogle Scholar
  58. 58.
    Breebaart J, Hotho G, Koppens J, Schuijers E, Oomen W, van de Par S: Background, concept, and architecture for the recent MPEG surround standard on multichannel audio compression. Journal of the Audio Engineering Society 2007, 55(5):331-351.Google Scholar
  59. 59.
    Hilpert J, Disch S: The MPEG surround audio coding standard. IEEE Signal Processing Magazine 2009, 26(1):148-152.CrossRefGoogle Scholar

Copyright information

© Shuixian Chen et al. 2010

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Authors and Affiliations

  1. 1.Computer SchoolWuhan UniversityWuhanChina
  2. 2.National Engineering Research Center for Multimedia SoftwareWuhan UniversityWuhanChina

Personalised recommendations