1. Introduction

A central goal in multimedia communications is to deliver quality contents with the lowest possible bitrate. By quality, we mean the perceived fidelity of the received contents against the original contents. And the lowest possible bitrate depends on two disparate concepts: entropy and perception. Entropy measures the quantity of information [1]. But not all information is perceptible.

To pursue this goal, we want to know how many bits are sufficient to convey quality multimedia contents. Lossless compression always ensures the highest possible quality, in which the objective redundancy in the multimedia contents is the only source of compression, and there is a limit, the Shannon entropy, the lowest possible bitrate with perfect decompression. Nevertheless, this limit is very hard if not impossible to compute due to the diversity and complexity of probability models of multimedia contents. By Huffman coding, run-length coding, arithmetic coding, and other entropy coding techniques, the state-of-the-art lossless audio coders today typically achieve a compression rate of 1/3-2/3 or 230–460 kbps per channel for CD music [2].

Lossless compression generally conveys higher than necessary quality in multimedia communications. Multimedia contents abound subjective irrelevancy—objective information we cannot sense. Perceptually lossless compression suffices. For audio signals, this means lossless to the extent that the distortion after decompression is imperceptible to normal human ears (usually called transparent coding), the bitrate can be much lower than the true lossless coding. Perceptual audio coding [3] by removing the irrelevancy greatly reduces communication bandwidth or storage space. Psychoacoustics provides a quantitative theory on this irrelevancy [47]: the limits of auditory perception, such as the audible frequency range (20–20000 Hz), the Absolute Threshold of Hearing (ATH), and the masking effect [8]. In state-of-the-art perceptual audio coders, such as MPEG-2/4 Advanced Audio Coding (AAC [9, 10]), 64 kbps is enough for transparent coding [11]. The Shannon entropy cannot measure the perceptible information or give the bitrate bound in this case.

In 1988, Johnston proposed Perceptual Entropy (PE [12, 13]) for audio coding based on psychoacoustics. PE gives the lower bitrate bound for perceptual audio coding:

(1)

where PE is measured in bits per sample, the length of block transform (usually DFT), nint() integer rounding, the index of starting bin of subband , the th transform coefficient, the undetectable distortion upper bound of band, and the number of bins in subband . Table 1 lists PE for various mono audio signals. The last column gives nears transparent bitrates of current coders, slightly lower than the upper bound of PE.

Table 1 PE and bitrate of various mono audio signals [13].

We can see that if in (1) assumes conservative values (smaller), PE will be larger. On the other hand, Adaptive Multirate (AMR [14]) and Adaptive Multirate Wide Band (AMR-WB [15]) use a priori knowledge of human voicing, also reducing bitrate. Apart from these two points, PE reliably predicts the lowest bitrate required for transparent audio coding. Since formulated, PE has found widespread use in audio coding and has become a fundamental theory in this field. Main stream perceptual audio coders, such as MP3 [16] and AAC, all employ PE as an important psychoacoustic parameter, leading to various practical methods not just theory.

Nevertheless, PE has significant limitation to measure perceptual information. This limitation primarily comes from the underlying monaural hearing model. Human has two ears to receive sound waves in a 3-dimensional space: not only is the time and frequency information perceived—needing just individual ears—but also spatial information or localization information—needing both ears for spatial sampling. Due to the unawareness of binaural hearing, PE of multichannel audio signals is simplified to the supposition of PE of individual channels, which is significantly larger than real quantity of information received because multichannel audio signals usually correlate. The purpose of this paper is to measure the perceptual information of binaural hearing.

We first analyze the localization principle of binaural hearing and give a spatial hearing model on the physical and physiological layers. Then we propose a Binaural Cue Physiological Perception Model (BCPPM) based on binaural hearing. Finally using binaural frequency-domain perception property, we give a formula to compute the quantity of spatial information and numerical results of spatial information estimation of real-world stereo audio signals.

With the left and right ears, human being is able to detect spatial information: sound source localization and sound source spaciousness. The former comprises of the range, azimuth, and elevation, in other words, the 3-dimensional spherical coordinate. The later can be measured by angle span of auditory images.

Human spatial hearing is a complex procedure of physics, physiology, and psychology (Figure 1). Psychology stays on the top of this procedure. On this layer, hearing is transformed to cognition, substantially influenced by subject psychological state, other senses, especially visual perception, and knowledge, implying that the same sound does not necessarily produce the same hearing perception. In Spatial Hearing, Blauert gives examples that different subjects in the same sound environment have diverse description of the environment [17]. In 1998, Hofman et al. reported in Nature that subjects with modified pinnae shape lost the elevation detection ability at the beginning but gradually regained that full ability [18]. This phenomenon demonstrates that the subjects were able to learn the correspondence between frequency response characteristics with the modified pinnae and sound from different elevations and used the knowledge to guide elevation detection. Due to the above reasons, spatial hearing on the psychological layer is too complicated to be exploited in audio compression systems, which cannot assume any specific states, senses, and knowledge of listeners.

Figure 1
figure 1

3 Layers of auditory sound source localization.

On the physical layer, sound waves propagate from sources along different paths to the ears and then in the ear canals and finally to the cochlea, absorbed and reflected by walls, floors, torso, head, and other objects on the way. Those sound waves carry objective localization information. On the physiological layer, sound waves are transformed to neural cell excitation and inhibition by the auditory system. There are different types of auditory neural cell responding to different types of sound stimulus, such as intensity, frequency, and delay. Thus physical quantities become physiological data.

In audio compression, irrelevancy removing is mainly on the physical and physiological layers. In the following, we discuss the representation of binaural cues on the two layers—BCPPM.

1.1. Spatial Information on the Physical Layer

As early as 1907, Rayleigh studied the physics of spatial hearing [19]: Interaural Time Difference (ITD) and Interaural Level Difference (ILD). Also Rayleigh has two seminal discoveries: the famous duplex theory, that is, below 1.5 kHz, ITD is the primary localization cue and above 1.5 kHz ILD instead, the head-shadow effect, that is, the blocking and reflection of sounds by head produce a maximum of 20 dB intensity difference. Both discoveries are derived based on the rigid ball modeling of head (Figure 2).

Figure 2
figure 2

The rigid ball model of human head used by Rayleigh.

2. Physiological Perception Modeling of Binaural Hearing

Although a real head is far from being the rigid ball, the above results are basically correct. In 2002, Macpherson and Middlebrooks demonstrated that the duplex theory is suitable for a variety of audio signals: pure tones, wide band signals, high pass signals, as well as low pass signals [20]. Exception is high frequency signals with envelope delay [17].

ITD and ILD are not all the localization cues. On the medial plane (which cuts perpendicularly through the middle of the line connecting the left and right ears), all sound sources have  ms and  dB. But when they have different elevations, our auditory system can detect the difference by elevation-related spectral characteristics [2124]. Due to the asymmetric structure of pinnae [25], the interference of sound waves is both wavelength related and elevation related (Figure 3). For example, the frequency of the lowest spectral amplitude (interference annihilation) is a function of the elevation [26]. This is the root of our elevation detection ability. This spectral cue does not depend on binaural hearing, so it is also called monaural cue.

Figure 3
figure 3

Elevation angle detection (Modified from ). http://interface.cipic.ucdavis.edu/CIL_tutorial/3D_psych/elev.htm

Unlike ILD and ITD, the spectral cue needs prior knowledge to provide elevation information. In principle, sounds may have arbitrary spectra. A listener is not able to detect the elevation angle based solely on the spectra: any characteristics may come from sound sources themselves and may come from the filtering effect of pinnae. The listener cannot tell.

Blauert reported a very interesting auditory phenomenon of narrow-band sound sources on the medial plane: the elevation angles given by subjects are independent of the real elevation angles but depended on the signal frequencies [17]. For wide-band signals of familiar types, it is easy for our auditory system to compare the pinnae filtered spectra (some frequency amplified and some decayed) to the spectra in memory, and based on the difference, reliable elevation angle estimation can be given (Figure 3). But for narrow-band signals, pinnae filtered spectra do not have detectable shape difference, just level difference. Thus the elevation angle detection will be very unreliable. In fact, the elevation angles given by the subjects are the angles at which the narrow-band signals have the maximum gain due to the pinnae filtering. For example, the peak gain frequency when the sounds come from the front is 3 kHz for most people [21]. So wherever a sound of 3 kHz came from, most subjects pointed at the front.

From the perspective of signal processing, sound wave propagation is roughly a Linear Time Invariant (LTI) system. To describe this LTI system in binaural hearing, we have Head-Related Transfer Function (HRTF [2729]) or equivalently Head-Related Impulse Response (HRIR). In open space, HRTF/HRIR is the function of source location, that is, range, azimuth, and elevation.

Figure 4 shows the HRTFs in binaural hearing. Signal goes from the source though the left and right paths to the left and right ears, respectively. Denote by the left path HRTF and by the right path HRTF. Then is the entrance signal of the left ear, so is . Since the signal may have any spectra, localization cannot be determined solely by or .

Figure 4
figure 4

Binaural hearing transfer functions.

Suppose that there are no strict zeros in the signal and the HRTFs. To exclude the effect of, we define Binaural Difference Transfer Function (BDTF):

(2)

which is independent of and located related. BDTF contains the same spatial information as and . In fact, we can find ILD and ITD from it:

(3)

Obviously, ILD and ITD are not only source location dependent, but also frequency dependent.

To obtain accurate relationship between sound source locations and sound wave propagation, more realistic head models or real heads are needed. In 1994, the MIT Media Lab collected HRTFs on 710 locations in the 3-dimensional space using the KEMAR head [30]. In 2001, CIPIC of U.C. Davis examined HRTFs of 45 subjects and 2 KEMAR heads [31]. Individual difference of HRTFs is revealed in HRTFs obtained by the experiments. Nevertheless, there are common characteristics that are sufficient to derive subject-independent spatial information.

2.1. Spatial Information on the Physiological Layer

In human auditory system, ITD and ILD of external sound sources stimulate or inhabit specific neural cells in the full audible frequency range. This process comprises of two steps: Frequency-to-Place Transform (FPT) [32, 33] and Binaural Processing (BP).

In 1960, Bèkèsy reported that sounds of different frequencies generate surface waves on the basilar membrane in cochlea with peak amplitudes at different places, which are determined by the frequencies [34]. In other words, a specific frequency is mapped to a specific place on the basilar membrane, or FPT, and this specific frequency for a given place is called Characteristic Frequency (CF [35]). Hair cells on that place then transform the mechanical swing into electric signals of auditory nerves.

The neural signals from the left and right ears corresponding to the same frequency meet in the brain. Our auditory system then extracts the ITD and ILD information in the signals. Currently, there are two kinds of theories on this process: Excitation-Excitation (EE [36]) and Excitation-Inhibition (EI [37]). The former proposed that there are auditory nerve cells of EE-type located between the inferior colliculus and the medial superior olive, and specific EE-type cells there have maximum excitation for signals with specific ITD and ILD; the latter proposed that there are auditory nerve cells of EI-type located between the inferior colliculus and the lateral superior olive, and specific EI-type cells there have maximum inhibition for signals with specific ITD and ILD. The common ground of the two theories is that specific nerve cells are only sensitive to specific ITD and ILD, which are called characteristic ITD and characteristic ILD. In some literatures, characteristic ITD is also called Best Delay (BD [38]) or Characteristic Delay (CD [39]). Both the EE-type and EI-type have supports from physiological research, but the latter explains better the various binaural hearing phenomena [40].

In 1948, Jeffress gave a physiological model for ITD perception [41, 42]—delay line model—the foundational contribution, having lasting impact in the field (Figure 5). Neural signals in the form of spike train from the left and right auditory pathways meet at some coincidence counter after traveling along the left and right delay lines and trigger the counter, which is in fact a physiological cross-correlation calculator. The specific counter having the largest counts is the counter to which the delay difference along the left and right delay lines exactly compensates the ITD. For example, sounds from the medial plane () generate the largest counts in the middle counter of the Jeffress network. The coincidence counters can be classified as EE-type auditory nerve cells.

Figure 5
figure 5

Jeffress model: delay line network.

In 2001, Breebaart et al. extended the Jeffress model by incorporating attenuators [4345] (Figure 6). An important difference to the Jeffress model is the use of EI-type elements instead of the EE-type elements in the Breebaart model. Due to the attenuators, ILD can be extracted by the extended model.

Figure 6
figure 6

Breebaart model: delay-attenuation network.

In the Breebaart model, only if the internal delay and attenuation are exactly compensated by the external ITD and ILD, the corresponding EI-type elements will have the largest inhibition. Thus, knowing the position of the EI-type element with the largest inhibition, the auditory system finds the ITD and ILD of the external audio signals.

The Breebaart model also implies the calculation of Interaural Coherence (IC), which manifests as the trough of the excitation surface, in accordance with the EI-type assumption. Nevertheless, there is no direct physiological quantity related to IC in this model.

In 2004, Faller and Merimma reported that IC relates to perceiving sound image width and stability, as well as sound field ambience [46, 47]. On the other hand, by the precedence effect [48, 49] of spatial hearing—sound source localization depending primarily on the direct sounds to the ears and essentially irrespective to reflection and reverberation—which contributes to lowering IC, Faller proposed that our auditory system use ITD and ILD to localize sound sources only if IC approaches 1. Since direct sounds to the ears have near 1 cross-correlation, this explains the precedence effect.

2.2. Binaural Cue Physiological Perception Model (BCPPM)

From the viewpoint of the information theory, the channel from the physical layer to the physiological layer is lossy, and less spatial information survives during the course (Figure 7).

Figure 7
figure 7

Spatial information loss.

Since the wavelength (0.012–17 m) of sound in the audible range (20–20000 Hz) is much longer than light, and comparable to normal objects in our surrounding—leading to significant interference and diffraction—spatial information from hearing is limited initially. This limited information is first compromised by noises and other interferences from other sound sources, as indicated by in Figure 7. Then during transformation from mechanical swing to electric impulses, part of the information is lost again due to the limited frequency range and dynamic range, the limited frequency and temporal resolution, and physiological noises of our auditory system, as is indicated by in Figure 7.

The loss of spatial information manifests as offset and disperses, related to multisource interference, limited SNR in the physical and physiological system. For example, sometimes a single source becomes multiple sources of mirrored sound images due to reflection by, say walls and floors. These sources have the same frequency range, so auditory filtering cannot separate them. And the perceived ITD and ILD are determined by the combined effects of BDTFs of those sources, typically leading to biased and vaguer location perception (Figure 8). A large sound source has similar localization effects. In the Breebaart model, the resolution of ITD and ILD is limited by the fineness of the delay elements and attenuation elements: no ITD smaller than the delay offered by one delay element can be detected and no ILD smaller than the attenuation offered by one attenuation element can be detected. This is in analogy to the ATH in monaural hearing. The limited ITD and ILD resolution turns out to limited localization resolution.

Figure 8
figure 8

Two types of spatial information loss.

In Section 1.1, we see that the physical data of sound source localization in binaural hearing are in form of ITD and ILD. In Section 2.1, we see that ITD and ILD are transformed to maximum inhibition of specific EI-type auditory nerve cells in the Breebaart model, and the physiological data are in the form of coordinates of the delay-attenuation network.

When there are multiple sound sources, background noises, reflection, diffraction, and reverberation, IC becomes another type of physical data conveying the overall sound field information.

Since spatial hearing on the physiological layer is too complex and uncertainty to be incorporated in computational model for common listeners, we restrict the calculation of perceptible spatial information to that directly related to ITD, ILD, and IC and physiological data corresponding to the three cues. In fact, spatial coding systems use the cues to represent spatial information.

We first review the psychoacoustic foundation of PE, mainly the nonlinear frequency resolution (Critical Band, CB [50, 51]) of our hearing system, spreading functions in the frequency domain for noises and tones and tonality estimation.

To calculate PE, Johnston used a Monaural Hearing Model (MHM, Figure 9). In this model, a 25-subband filterbank filters incoming audio signals. Each subband has a bandwidth of CB at the corresponding frequency (CB1-CB25 in Figure 9), increasing from low to high frequency. Each subband also acts as a lossy subchannel, and the loss of audio information is due to the intrinsic noises of hearing system (ATH) and interchannel interference (masking effect). ATH is signal dependent, usually as a table or a fitting function of experimental data. Masking is signal dependent, usually obtained by convoluting the tonality-dependent spreading functions with the signal spectra. Combining both, we have effective channel noises (- in Figure 9).

Figure 9
figure 9

Monaural Hearing Model (MHM) used to calculate PE.

There is no place for localization in the MHM. The critical limit of the model is the lack of binaural processing—only spectral-temporal information but not spatial information. The Breebaart delay-attenuation network just models the binaural processing. So we borrow the idea of lossy multichannel in MHM and combine MHM with the Breebaart model—Binaural Cue Physiological Processing Model (BCPPM, Figure 10).

Figure 10
figure 10

Binaural Cue Physiological Perception Model (BCPPM).

The BCPPM consists of 3 modules.

Frequency-to-Place Transform in Cochlea.

This process separates sounds into a bank of subband signals, essentially the subband filtering in MHM. The subband filter can be implemented by DFT with spectral lines grouped to subbands according to CB or by the Cochlear Filter Bank (CFB [52]) proposed by Baumgarte in 2002.

Delay-Attenuation Network.

This is the same as that in Figure 6. After the Time-to-Place Transform, external audio signals change into spike trains of auditory nerve signals, which arrive at the corresponding delay-attenuation networks. Then the networks output ITD, ILD, and IC for each critical band. From the location of the maximum inhibition (lowest excitation, the trough of the neural excitation surface in Figure 11), we can derive ITD and ILD. From the gradient of the trough, we can derive IC: faster descending or larger gradient implies larger IC (); slower descending or smaller gradient implies smaller IC ().

Figure 11
figure 11

An example of auditory nerve excitaton surface with  ms and  dB, adapted from [ 42 ].

Effective Channel Noises.

The effective channel noise for ITD, ILD, and IC (, , and in Figure 10) is a simplified method to model the limited precision, intrinsic noises, and intersource interference in our hearing system. Part of the noise comes directly from grains of delay and attenuation ( and in Figure 6). For example, if , . Generally, and are functions of frequency. A related concept is Just Noticeable Difference (JND) in psychoacoustics, indicating the overall sensitivity of our auditory system. On the other hand, ITD, ILD, and IC are not independent, there are interactions among them. The effective channel noise should also incorporate the interactions.

3. Computing Spatial Perceptual Entropy (SPE) Based on BCPPM

In this section, we will define SPE using the BCPPM and then discuss in detail the computational implementation of BCPPM, including 3 core components: the CB filterbank, binaural cues computation, and perceptible information computation (Figure 12).

Figure 12
figure 12

SPE calculation.

3.1. SPE Definition

From the information theory viewpoint, we see BCPPM as a double-in-multiple-out system (Figure 10). The double-in is the left ear entrance sound and the right ear entrance sound. The multiple-out consists of 75 effective ITDs, ILDs, and ICs (25 CBs, each with a tuple of ITD, ILD, and IC).

Like in computing PE, we view each path that leads to an output as a lossy subchannel. Then there are 75 such subchannels. Unlike PE, what a subchannel conveys is not a subband spectrum but one of ITD, ILD, and IC of the subband corresponding to the sub-channel.

In each sub-channel, there are intrinsic channel noises (resolution of spatial hearing), and among sub-channels, there are interchannel interferences (interaction of binaural cues). Then there is an effective noise for each sub-channel.

Under this setting, each sub-channel will have a channel capacity. We denote SPE(c), SPE(t), and SPE(l) for the capacity of IC, ITD, and ILD sub-channels respectively. Then SPE is defined as the overall capacity of these sub-channels, or the sum of capacities of all the sub-channels:

(4)

To derive SPE(), SPE(), and SPE(), we need probability models for IC, ITD, and ILD. Although the binaural cues are continuous, the effective noise quantizes them into discrete values. Let , , and denote the discrete ILD, ITD, and IC source probability spaces:

(5)

where , , and are the th discrete values of ILD, ITD, and IC, respectively, and , , and the corresponding probabilities. Then we have

(6)
(7)
(8)

For some probability distributions, say uniform distribution, (5), (6), and (7) can be readily calculated.

3.2. CB Filterbank

We use the same method as that in PE to implement the CB filterbank. Audio signals are first transformed to the frequency domain by DFT of 2048 points with 50% overlap between adjacent transform blocks. Then a DFT spectrum is partitioned into 25 CBs according to Table 2 [41]. Then basic processing unit is the subspectra of each CB.

Table 2 Critical Bands for 2048-point DFT, sampling frequency 48 kHz [40].

3.3. Binaural Cues Computation

ILD is the ratio of left ear entrance signal intensity to right ear entrance signal intensity. Since DFT preserves signal energy, we can use DFT subspectra energy ratio to compute ILD on each CB [53]:

(9)

where is the indexes of CB, and the starting DFT spectral index of and (Table 2), and the th spectral lines from left and right ear entrance signals.

Time shift corresponds to linear phase shift in the frequency domain. Therefore, we can use group delay (slope of phase-frequency curve) of subband signal to derive ITD on each subband:

(10)

where is the bandwidth of , and arg represents the phase of a complex number. A more reliable but also more complex method is to use least square fitting to find the group delays and then ITD:

(11)

The summation range, to , is left out for simplicity.

Due to the property that time-domain normalized correlation is equivalent to the real part of correlation in the frequency domain, IC of each CB can be derived as the following:

(12)

where the summation range is also to , and "*" represents conjugate.

3.4. Effective Spatial Perception Data

The resolutions or quantization steps of the binaural cues (Figure 12) can be determined by JND experiments. Denote by ,, and the resolutions of ITD, ILD, and IC, respectively. Generally, they are signal dependent and frequency dependent. For simplicity, we use constant values [44, 54]:  ms, dB, and .

IC has different impacts on ITD and ILD perception. In 2001, Hartmann Constan reported that the difference of JND of ILD for correlated noises and uncorrelated noises is only 0.5 dB [55]. This can be explained by the fact that signal power is independent phase, which influences correlation, and lower IC is partly the result of increasing phase noise. This is illustrated in Figure 13: when IC decreases, the gradient along the ILD axis keeps almost unchanged, but the gradient along the ITD significantly decreases.

Figure 13
figure 13

The different effects of IC on ITD and ILD perception.

Larger IC usually implies higher ITD perception precision or equivalently morespatial information. When IC approaches 1, the activity surface will have a very sharp decreasing toward the point with the lowest auditory nerve activity. In this case, the uncertainty of ITD is very small and is determined precisely. When IC decreases to 0, the surface becomes flatter, leading to larger uncertainty or lower precision of ITD. In the extreme case, when , the gradient along the IC axis will be constantly 0, there is no well defined trough point and ITD is completely indeterminable.

By the above analysis, we ignore the effect of IC on ILD and only consider the effect of IC on ITD for SPE computation. Lower IC leads to lower resolution of ITD. This is equivalent to higher JND of ITD. Then the effective JND on subband , denoted as , can be formulated as the following:

(13)

From (13) we see that when IC(b)=1, assumes the minimum and the auditory system has the highest resolution for ITD; when , , the resolution of ITD is lower but there is still spatial information from ITD; when , , the resolution of ITD is 0 and there is no spatial information in ITD.

Then we have the following effective perception data , , and of ILD, ITD, and IC, respectively by quantization:

(14)

where represents the round down function.

Suppose that , , and are uniformly distributed by (6), (7), and (8), the SPE of IC, ITD, and ILD are

(15)

where N is the number of spectral lines in one transform, or 1024 in this case; , , and can be found from (9), (10), and (11), respectively; , , and are the JNDs of ILD, ITD, and IC on CB b, respectively, obtained from subjective listening experiments; and is the amplitude compression factor, assuming 0.6 [5].

4. Experiments

We evaluate SPE of 126 stereo sequences from 3GPP and MPEG, which are classified into speech, single instrument, simple mixture, and complex mixture, all sampled at 44.1 kHz. For comparison, we also evaluate PE of these sequences.

Figure 14 gives the computational procedure of SPE: stereo audio signals are windowed and block transformed to the frequency domain using 2048-point DFT; then on the 25 CBs, binaural cues are derived before transformed into effective spatial perception data, the entropy of which is SPE.

Figure 14
figure 14

Flowchart of SPE Computation.

In the following experiments, , , and assume constant and conservative values, and their frequency dependency is also ignored. The overall SPE is the sum of entropy of effective IC, ILD, and ITD perception data, shown in (4).

4.1. Perceptual Spatial Information of Stereo Sequences

In this experiment, we compute perceptual spatial information by SPE for 4 classes of stereo sequences (Figure 15): each class consists of 12 sequences, sampled at 44.1 kHz; each data point is average of SPE over one sequence, measured by kbps.

Figure 15
figure 15

Perceptual spatial information of stereo sequences sampled at 44. 1 kHz.

From Figure 15 we find that speech sequences generally have the lowest spatial information rate, mean 2.75 kbps, this is in accordance with the recording practice that voices usually stay in direct front of the sound field; single instrument sequences and simple mixture sequences have similar spatial information rate, mean 3.49 kbps and 3.66 kbps, respectively; complex mixture sequences generally have the highest spatial information rate, mean 6.90 kbps, this can be explained by multiple sound sources at diverse sound field locations in this type of sequences.

In Parametric Stereo (PS [56]) coding, it is reported that 7.7 kbps of spatial parameter bitrate is sufficient for transparent spatial audio quality, agreeing very well with our SPE computation.

4.2. Temporal Variation of Spatial Information Rate in a Single Senescence

In this experiment, we choose two sequences es02 of German male speech and sc03 of contemporary pop music from MPEG and compute their SPE frame by frame (Figure 16).

Figure 16
figure 16

SPE of es02 (speech) and sc03(pop). (a): waveform of es02; (b): SPE curve of es02; (c): waveform of sc03; (d): SPE curve of sc03.

The test data show that for es02 with stable voice from the front, SPE stays at 1-2 kbps; for sc03 with multiple instruments and strong spatial impression, SPE stays at about 7 kbps. But within either sequence, the SPE changes little.

4.3. Overall Perceptual Information in Stereo Sequences

Using PE to evaluate the perceptual information, only intrachannel redundancy and irrelevancy are exploited; the overall PE is simply the sum of PE of the left and right channels. Using SPE based on BCPPM, interchannel redundancy and irrelevancy are also exploited; the overall perceptual information is about one normal audio channel plus some spatial parameters, which has significantly lower bitrate.

For the above reason, PE gives much higher bitrate bound than SPE (Figure 17). PE is compatible with the traditional perceptual coding schemes, such as MP3 and AAC, in which channels are basically processed individually (except the mid/side stereo and the intensity stereo). So PE gives meaningful bitrate bound for them. But in Spatial Audio Coding (SAC [52, 54, 5759]), multichannel audio signals are processed as one or two core channels plus spatial parameters. SPE is necessary in this case and generally gives much lower bitrate bound (1/2). This agrees to the sharp bitrate reduction of SAC.

Figure 17
figure 17

Perceptual Information of stereo sequences sampled at 44. 1 kHz, evaluated using PE and SPE.

5. Conclusion

We have developed the Binaural Cues Physiological Perceptual Model (BCPPM) to measure the perceptible information, or Spatial Perceptual Entropy (SPE), in multichannel audio signals and have given a lower bitrate bound in multimedia communications for this type of contents. BCPPM models the physical and physiological processing of human spatial hearing into a parallel of lossy communication subchannels with inter-subchannel interference, and SPE is the overall channel capacity. Each of these subchannels carries ITD, ILD, or IC with addictive noises, resulted from intrinsic noises of binaural cues perception and interferences among the cues within the same CB. Experiments on stereo signals of different types have confirmed that SPE is compatible with the spatial parameter bitrate and spatial impression in SAC.

Nevertheless, SPE gives only the lower bitrate bound for transparent quality. We will extend SPE to give the bound for given subjective quality in the future. Then in mobile, internet, and other communications networks conveying multichannel audio signals, we can use the estimated bound to allocate bandwidth for a particular Quality of Service (QoS), transparent or degraded and thus save bandwidth or improve the overall QoS. On the other hand, current SAC may benefit from SPE—dynamically allocating bitrate to accommodate varying spatial contents—thus improving quality and reducing overall bitrate.