# Expanded three-channel mid/side coding for three-dimensional multichannel audio systems

- 1.8k Downloads
- 1 Citations

## Abstract

Three-dimensional (3D) audio technologies are booming with the success of 3D video technology. The surge in audio channels makes its huge data unacceptable for transmitting bandwidth and storage media, and the signal compression algorithm for 3D audio systems becomes an important task. This paper investigates the conventional mid/side (M/S) coding method and discusses the signal correlation property of three-dimensional multichannel systems. Then based on the channel triple, a three-channel dependent M/S coding (3D-M/S) method is proposed to reduce interchannel redundancy and corresponding transform matrices are presented. Furthermore, a framework is proposed to enable 3D-M/S compress any number of audio channels. Finally, the masking threshold of the perceptual audio core codec is modified, which guarantees the final coding noise to meet the perceptual threshold constraint of the original channel signals. Objective and subjective tests with panning signals indicate an increase in coding efficiency compared to Independent channel coding and a moderate complexity increase compared to a PCA method.

### Keywords

Original Signal Switching Condition Virtual Source Decoder Side Multichannel System## Introduction

Recently, 3D audio has attracted more attention and developed fast following the booming market of 3D movie. Many 3D audio technologies are now introduced into audio-involved applications to replace the surround sound system to provide superior sound localization and an immersive feeling. Wave field synthesis (WFS), Ambisonics and vector-based amplitude panning (VBAP) are the three most well-developed technologies. WFS generally follows Huygens principle to reconstruct the original sound field [1]. Research institutions such as IDMT of Fraunhofer and IRCAM in France have an intensive study in WFS, and attempt to bring WFS into theater and live transmission of concert. Ambisonics utilizes spherical harmonic functions to recording sound field and driving loudspeakers, its loudspeakers have rigorous configuration and give a good sound field reconstruction in the center [2]. VBAP follows the tangent law in a three-dimensional space using three adjacent loudspeakers to form a sound vector. For its simplicity, VBAP is the most common algorithm in 3D signal panning [3]. A 3D system like 22.2 multichannel system proposed by NHK in Japan utilizes VBAP to generate 3D sound image [4]. The 22.2 multichannel system is also included in the developing MPEG-H standard for rendering 3D audio scene.

There is a clear trend that 3D audio technology will become mature gradually and replace stereo and surround sound [5]. However, a main and common feature of 3D audio technologies is the great number of sound channels. For instance, WFS system always contains dozens and even hundreds of audio channels. The 22.2 system has three layers and 24 audio channels. Although the Ambisonics system can have flexible order and channel number, it usually uses dozens of channels because fewer channels will cause quality deterioration. Comparing with a two-channel stereo and a 5.1 surround sound, the increasing of audio channel causes a dramatical 3D audio data increase. A report from Fraunhofer shows 37 Mbps is needed for live transmission of WFS [6]. For the 22.2 multichannel system, uncompressed data also reaches 28 Mbps [7]. Currently, storage media and transmission bandwidth can hardly afford those huge data size. So the compression of 3D multichannel audio signals becomes an important subject.

The well-known Spatial Audio Coding (SAC) models the signals as virtual sound sources in the frequency domain, extracts the interchannel level difference (ICLD) and interchannel time difference (ICTD) and interchannel coherence (IC) to represent the direction and width of virtual sound source and downmixes the multichannels to reduce redundancy [8, 9, 10, 11]. The idea of using downmixed sources with spatial parameters was later developed into Spatial Audio Object Coding (SAOC) for efficiently coding the multiple input spatial audio objects with interactive and personalized rendering ability [12]. Recently, some other investigations have been published to increase the compression efficiency for multichannel 3D audio signals. In 2007, Goodwin and Jot proposed a PCA-based multichannel compression framework for parametric coding [13], which can enhance specific audio scenarios and provide robust spatial audio coding. In 2008, Cheng et al. proposed the Spatially Squeezed Surround Audio Coding (S^{3}AC) for parametrically compressing the Ambisonics signal [14]. In 2009, Hellerud used an inter-channel prediction-based coding method to remove the redundancy between Ambisonics channels [15], which has low algorithm delay but high computational complexity. Tzagkarakis used a sinusoidal model and linear prediction to parameterize the separate spot microphone channels, then downmixed the residual signals. This coding scheme is more suitable for multichannel signals with weak correlation, and such scenarios require Independent channel decoding [16]. In 2010, Pinto et al. utilized a space/time-frequency transform to decompose the WFS signals into plane waves and evanescent waves. By discarding the evanescent waves and perceptually coding the plane wave signals, coding gain is obtained. Coding efficiency increases along with the number of audio channels, because the transform decomposition accuracy depends on the spatial resolution which is the number of WFS channels [17, 18]. In 2013, Cheng further proposed a Spatial Localization Quantization Point (SLQP) codec using localization cues to compress the 3D audio signals [19, 20]. Since SLQP extracts the spatial cues and downmixes the channels, it achieved high compression ratio for SLQP signals and other 3D audio systems.

In order to increase the coding efficiency at high bitrates, some non-parametric coding schemes were developed. Yang proposed a scalable multichannel codec, using the Karhunen-Loeve Transform (KLT) to remove the interchannel redundancy to realize scalable multichannel audio coding [21]. Mid/side (M/S) coding was introduced by J.D. Johnston [22] and adopted by many audio codec such as MPEG2-Layer III and MPEG4-AAC. In 2003, Liu et al. proposed a bit allocation method for M/S coding based on allocation entropy, which increases the objective quality by allocating more bits to high energy channel in M/S coding [23]. In 2008, Derrien et al. proposed an error model for M/S coding. The error model enables tuning of the quantizer used for channels M and S at the encoder with respect to the distortion of L and R at the decoder side, which increased the coding efficiency of M/S without much complexity [24]. Since M/S coding works as the simplest interchannel prediction, Krueger generalized it using linear prediction instead of M/S transformation and residual signal instead of difference signal [25]. In 2012, Schafer further developed Krueger’s method, the multichannel case, which has low algorithmic delay [26]. Recently, M/S coding was combined with parametric stereo coding at low bitrates in the MPEG-USAC standard [27] by predicting the residual channel using spatial cue-based parameters, which aimed to bridge the stereo quality gap between low bitrates and high bitrates [28]. M/S coding also works alone at high bitrates utilizing a novel complex prediction to achieve better performance [29].

The above model-based codec and parametric codec can offer a considerable compression ratio. However, those methods need to know the direction of the real audio source to do objective-oriented coding, or estimated a virtual source direction to do downmixing and parametric coding. In practice, such as live recording, it is very difficult to obtain the real audio source direction. Downmixing and parametric coding will cause interchannel interference such as ‘tone leakage’ artifacts when channel signals differ greatly [30]. Furthermore, the computational complexity of an audio codec should be acceptable while maintaining enough coding efficiency, and parametric coding can only achieve a performance gain at low bitrates. This paper focuses on the situation that only the multichannel signals of audio sources are recorded, instead of their directions. And we consider high-quality/high-bitrate application and focus on the non-parametric coding method. Section ‘M/S coding in 3D space’ describes the conventional M/S coding process and presents a three-channel Dependent M/S coding (3D-M/S) method. The main idea is to expand M/S coding to three-dimensional audio by designing a new transform matrix, which remove the redundancy of three channels in 3D space rather than just two channels in the horizontal plane. Section ‘3D-M/S psychoacoustic model’ discusses the psychoacoustic model for transformed 3D-M/S signals. Section ‘Framework for general channel configuration’ specifies a new framework enables 3D-M/S to be applied to a more general channel configuration. Section ‘Experiment’ gives a comparison of 3D-M/S coding with PCA coding and Independent channel coding to justify the performance of compression ratio and computational complexity. Section ‘Conclusion’ summarizes and concludes this paper.

## M/S coding in 3D space

### Conventional M/S coding

**V**

_{ 0 }= (

*C*

_{ L },

*C*

_{ R }) where

*S*is the virtual audio source,

*θ*is the stereo panning angle and $\theta \in \left[0,\frac{\pi}{2}\right]$. The M/S coding can be denoted as two transform matrices

**M**

_{ 0 }and

**M**

_{ 1 }, the summation vector of

**M**

_{ 1 }is denoted as ${\mathbf{V}}_{\mathbf{1}}=\left(\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2}\right)$

**M**

_{ 1 }as (

*C*

_{ M },

*C*

_{ S })=

**M**

_{ 1 }

**V**

_{ 0 }, where

*C*

_{ L },

*C*

_{ R }in

**V**

_{ 0 }. Only when two channels is sufficiently correlated, for example energy difference is less than a threshold Thr=2 dB as shown in (4), will the M/S mode be used to avoid being frequently transformed and recalculating the masking threshold [22].

**V**

_{ 0 }and

**V**

_{ 1 }. Given $\frac{{C}_{L}}{{C}_{R}}=\text{tan}\theta $, (4) can be denoted as

**V**

_{ 0 }is close to

**V**

_{ 1 }, where $\text{cos}\theta \approx \frac{1}{\sqrt{2}}$ and $\theta \approx \frac{\pi}{4}$, the switching condition (4) will be satisfied. The difference signal has less amplitude than the original signals, and M/S coding will be used. Since

*θ*is the angle of

**V**

_{ 0 }, switching condition (4) can be represented by the inner product between signal vector

**V**

_{ 0 }and summation vector

**V**

_{ 1 }

_{ v }is the corresponding switching threshold of Thr in vectorial distance and

### M/S coding in three-dimensional space

*C*

_{1},

*C*

_{2}and

*C*

_{3}formed a virtual source. If

*C*

_{1}and

*C*

_{2}were grouped for parametric or M/S coding,

*C*

_{3}would be grouped with another less correlated channels, which decreases the coding performance. And if we dynamically group the signals, we have to use complex correlation analysis algorithm to analyze its six adjacent channels. Because the channel pair grouping is based on frequency subbands, it will not only increase the codec complexity dramatically but will also be unable to reduce the overall redundancy that exist in more than two channels. In brief, the conventional channel pair unit should be redesigned for 3D audio systems.

**V**

_{ 0 }= (

*C*

_{1},

*C*

_{2},

*C*

_{3}) is calculated following the tangent model in 3D space as

where $\theta ,\phi \in \left[0,\frac{\pi}{2}\right]$, which determine the gain factor of the three channels.

**M**

_{ 0 }will be used. Second, the virtual source is located between two channels. This situation corresponds to the source panned mainly using two channels, or two channels form a virtual source with one other channel out of the current three channels. This situation is similar to conventional stereo audio, and M/S coding can be applied. However, the M/S transform matrix must be modified to adapt to the three channels condition which is expressed in Equation 9. Third, the source is panned using all the three channels. This is a new situation that stereo audio never contains. To remove the interchannel redundancy, a new transform matrix

**M**

_{ 4 }is designed following the rule of conventional M/S coding. The first vector is the summation of three channels, and the rest vectors are orthogonal with the first vector. To guarantee the conservation of energy after transformation, unit vectors are used. This matrix realizes the sum-difference processing for 3D channel, and guarantees that when three channel signals are nearly the same, two channels primarily contain the difference signal.

An example is shown in Figure 2. It can be observed that when the source is close to the center of two or three channels, a corresponding matrix can produce difference signals with lower dynamic range compared to the original channel signals. Under a certain masking threshold, far less bits are required for quantizing the difference signals which brings the coding gain.

**V**

_{ 4 }($\text{cot}\theta \approx \frac{\sqrt{2}}{2}$ and $\phi \approx \frac{\pi}{4}$), matrix

**M**

_{ 4 }will be chosen to give two difference channels. Secondly, if there are only two channels satisfying the conventional M/S switching condition and the projection of input vector is close to

**V**

_{ 1 }(cot

*θ*≈ sin

*φ*) or

**V**

_{ 2 }(cot

*θ*≈ cos

*φ*) or

**V**

_{ 3 }($\phi \approx \frac{\pi}{4}$), 3D-M/S will select the matrix having the nearest distance from the input vector. The distance is measured by vector distance following expression (6). Compared with conventional switching condition, it can be seen that conventional M/S coding works on the two-dimensional space and has two switching areas. 3D-M/S switching condition is an expansion of M/S coding, where its input vector works in three-dimensional space and has five switching areas. Following the vector distance switching condition, the switching rule of 3D-M/S can be denoted as

**Transformed channel signals with five matrices**

| | | |
---|---|---|---|

| | | |

| | $\frac{\sqrt{2}}{2}S\text{sin}\theta (\text{cot}\theta +\text{sin}\phi )$ | $\frac{\sqrt{2}}{2}S\text{sin}\theta (\text{cot}\theta -\text{sin}\phi )$ |

| $\frac{\sqrt{2}}{2}S\text{sin}\theta (\text{cot}\theta +\text{cos}\phi )$ | | $\frac{\sqrt{2}}{2}S\text{sin}\theta (\text{cot}\theta -\text{cos}\phi )$ |

| $S\text{sin}\theta \text{sin}(\phi +\frac{\pi}{4})$ | $S\text{sin}\theta \text{sin}(\phi -\frac{\pi}{4})$ | |

| $\frac{\sqrt{6}}{3}S\text{sin}\theta (\text{cot}\theta +\text{sin}(\phi +\frac{\pi}{4}\left)\right)$ | $S\text{sin}\theta \text{sin}(\phi -\frac{\pi}{4})$ | $\frac{\sqrt{6}}{3}S\text{sin}\theta \left(\frac{\sqrt{2}}{2}\text{sin}\right(\phi +\frac{\pi}{4})-\text{cot}\theta )$ |

where *i*,*j*∈{1,2,3}, **V**_{ 01 }= (0,*C*_{2},*Y* *C*_{3}), **V**_{ 02 }= (*C*_{1},0,*C*_{3}), **V**_{ 03 }= (*C*_{1},*C*_{2},0) are the two channel projections of input vector **V**_{ 0 }. ${\mathbf{V}}_{\mathbf{1}}=\left(0,\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2}\right)$, ${\mathbf{V}}_{\mathbf{2}}=\left(\frac{\sqrt{2}}{2},0,\frac{\sqrt{2}}{2}\right)$, ${\mathbf{V}}_{\mathbf{3}}=\left(\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2},0\right)$, ${\mathbf{V}}_{\mathbf{4}}=\left(\frac{\sqrt{3}}{3},\frac{\sqrt{3}}{3},\frac{\sqrt{3}}{3}\right)$ are the summation vectors of each transform matrix.

## 3D-M/S psychoacoustic model

*C*

_{ M }is the first signal after transformation and

*C*

_{ S }and

*C*

_{ T }are the second and third signals, respectively. After core codec quantization, independent noise is introduced into the three signals which is denoted as

*N*

_{ M },

*N*

_{ S }and

*N*

_{ T }. So at the decoder side:

The same results can be deduced for other matrices.

## Framework for general channel configuration

*C*corresponds to a loudspeaker. Here, a framework is proposed based on 3D-M/S coding as shown in Figure 5. Because all spatially placed loudspeakers can be decomposed into basic triangle units, this structure will enable 3D-M/S coding to work for arbitrary channel configurations. The framework processes the audio channels triangle by triangle until all channels are coded.

*C*

_{ M }is the summation channel and

*C*

_{ S }and

*C*

_{ T }are the second and third channels, respectively. Every 3D-M/S unit shares two channels with the previous unit and only one new channel is added in. So, it only needs to compress the channel which contains the signal of the new channel. For all matrices

*M*

_{0},

*M*

_{1},

*M*

_{2},

*M*

_{3}and

*M*

_{4},

*C*

_{ T }is the third channel after 3D-M/S transform. Because every unit outputs only one that contains a new input channel, the whole coding framework keeps the number of channels exactly the same as original input signals. And because the output channel contains either the difference signal or original signal, coding gain can be obtained. The original signals can be obtained by multiplying 3D-M/S inverse transform matrix subband by subband at the decoder side. This framework is also suitable for other methods. For example, replacing the 3D-M/S with PCA, the codec can achieve better interchannel redundancy removing performance.

## Experiment

The experiment used five channels (*C*_{1}, *C*_{2}, *C*_{3}, *C*_{4}, *C*_{5}) in spherical 22.2 multichannel configuration as shown in Figure 4. Considering that PCA is the best decorrelation transform theoretically and Independent channel coding is widely used for 22.2 multichannel compression, the experiment compared the proposed 3D-M/S method with PCA and Independent channel coding in bitrate, complexity and objective quality. Three MPEG test sequences (es01 voice signal, sc03 symphony music signal, si02 castanets transient signal, mono 48-kHz sampling) were used as the moving virtual sources following the VBAP rule, four sequences (si03, si01, sc01, es02) were used as the discrete fixed-position virtual sources. The virtual sources and respective azimuth and altitude panning angle are generated on a per-frame basis. Here, only point virtual sources were used to test the best performance of three methods, as subband signals can be regarded as point sources in subband coding when bandwidths are small enough. Signals with decorrelated elements are beyond the scope of VBAP model and will decrease the coding performance, for its difference signals retains high energy which depends on the correlation and the energy of the decorrelated elements. Uncorrelated signals with independent audio content is tested in the end.

**Discrete virtual source setting and objective results**

| | Average | ||||
---|---|---|---|---|---|---|

$\frac{7\mathit{\pi}}{32}$ | $\frac{5\mathit{\pi}}{32}$ | $\frac{3\pi}{32}$ | $\frac{\mathit{\pi}}{32}$ | |||

Ind | $\frac{11\pi}{36}$ | 20.76 | – | – | – | 20.76 |

3D-M/S | $\frac{11\pi}{36}$ | 21.45 | – | – | – | 21.45 |

PCA | $\frac{11\pi}{36}$ | 21.95 | – | – | – | 21.95 |

Ind | $\frac{13\pi}{36}$ | 15.37 | 21.14 | – | – | 18.26 |

3D-M/S | $\frac{13\pi}{36}$ | 16.07 | 21.88 | – | – | 18.98 |

PCA | $\frac{13\pi}{36}$ | 16.54 | 22.74 | – | – | 19.64 |

Ind | $\frac{15\pi}{36}$ | 23.48 | 23.28 | 22.08 | – | 22.95 |

3D-M/S | $\frac{15\pi}{36}$ | 23.77 | 22.93 | 22.54 | – | 23.08 |

PCA | $\frac{15\pi}{36}$ | 23.86 | 23.78 | 23.49 | – | 23.71 |

Ind | $\frac{17\pi}{36}$ | 21.82 | 19.06 | 18.93 | 19.36 | 19.79 |

3D-M/S | $\frac{17\pi}{36}$ | 21.88 | 18.52 | 19.02 | 19.51 | 19.73 |

PCA | $\frac{17\pi}{36}$ | 21.92 | 21.36 | 19.18 | 20.57 | 20.76 |

The 3D-M/S and PCA was used in each subband in the frequency domain. The three encoders were realized based on FAAC-1.28, and decoders were based on FAAD2-2.7. AAC-LC was used as the core codec and only the long window was enabled for simplification. To avoid the influence of dynamic bandwidth setting of the FAAC, the experiment fixed the bandwidth at 12 kHz with 35 subbands.

Independent channel coding: Audio signals were sent into the core codec and compressed directly.

3D-M/S: The vector was calculated using the subband energy of three channels from AAC psychoacoustic module with no extra energy computation. Then 3D-M/S matrix switching was performed and 3 bits were used per mode parameter. The transformed signals were sent into the core codec, and the masking threshold was modified accordingly.

PCA: The eigenvectors were calculated for each subband. Subband signals were transformed using eigenvector matrix and then sent into core codec. The covariance matrix was quantized and transmitted to the decoder following a previous KLT-based multichannel audio coding scheme [21], with 4 bits per non-redundant element.

### Objective evaluation

*p*(

*ω*) is the sound pressure,

*G*is a proportionality coefficient, $k=\frac{\omega}{c}$ is the wave number and

*c*is the wave speed.

*r*is the distance from the loudspeaker to the listening point, and in the spherical 22.2 multichannel, all loudspeakers have the same distance. And the SNR is calculated by

*C*

_{1},

*C*

_{2}around the 200th frame in Mov. 1, between

*C*

_{3},

*C*

_{4}for all frames in Mov. 2, between

*C*

_{3},

*C*

_{4}around the start frames, between

*C*

_{3},

*C*

_{5}around the 100th frame and between

*C*

_{4},

*C*

_{5}around the end frames in Mov. 3), 3D-M/S gets a higher SNR than the Independent channel coding and close to PCA. Moveover, around the 200th frame in Mov. 1 and Mov. 2, where all two and three channels are nearly the same,

**M**

_{ 3 }and

**M**

_{ 4 }can remove redundancy to the largest extent and outperform the PCA method. This is because some transformed subband signals came below the masking threshold and more bits were reserved for summation channel. The same results can be seen in the discrete virtual sources in Table 2, where $\phi =\frac{7\pi}{32}$ ($\phi \approx \frac{\pi}{4}$) and ($\theta =\frac{11\pi}{36},\phi =\frac{7\pi}{32}$) ($\theta =\frac{13\pi}{36},\phi =\frac{5\pi}{32}$) ($\theta =\frac{15\pi}{36},\phi =\frac{3\pi}{32}$) ($\theta =\frac{17\pi}{36},\phi =\frac{\pi}{32}$) (cot

*θ*≈ sin

*φ*). But when the virtual source located beyond the middle of two channels, such as ($\theta =\frac{15\pi}{36},\phi =\frac{5\pi}{32}$), M/S coding cannot bring coding gain. In conclusion, if the input signals are located in one of the five switching areas, the coding gain can be obtained by transforming them into summation and difference signals.

**Bitrate setup and overall SNR**

Quality | Bitrate/channel (kbps) | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

(SNR) | | | | | | | | | ||

Ind | 16.08 | 61.6 | 60.8 | 60.3 | 58.7 | 58.7 | – | – | – | 61.2 |

3D-M/S | 18.21 | 229.5 | 21.7 | 57.9 | 4.9 | 4.9 | 4.9 | 64.7 | ||

PCA | 18.87 | 133.2 | 25.2 | 42.8 | 39.3 | 39.3 | 39.3 | 63.8 |

Secondly, the PCA parameter bitrate of 39.3 kbps/channel is considerably higher than 3D-M/S. If the three channels have little correlation (e.g. channels with different contents or ambient sound), the transformed signals will not save any bits and cause the decrease of coding efficiency. To test the three methods under such condition, the virtual sources of three different signals were fixed at three channels and coded all at 64 kbps. The experimental result is shown in Figure 6. We can see Independent channel coding achieves the best performance in this case; meanwhile, 3D-M/S degrades about 1 dB and PCA degrades nearly 7 dB. It is because, for PCA requirement, too many bits are used for parameters which now cannot bring any coding gain. But for 3D-M/S, parameter bits for modes are only 4.9 kbps/channel. It will not reduce the coding efficiency much for medium and high bitrate conditions, which is the main application scenario of M/S coding. Although the high bitrate for PCA can be alleviated by reducing the refresh rate of PCA parameters, but it will decrease the coding performance on VBAP signals at the same time.

**Time complexity**

Complexity (s) | |||
---|---|---|---|

Encoder | Decoder | Ratio | |

Ind | 2.382 | 0.223 | 100.0% |

3D-M/S | 2.604 | 0.306 | 111.7% |

PCA | 2.977 | 0.416 | 130.2% |

### Subjective evaluation

From the above results on three point sources and uncorrelated signals, it can be observed that both PCA and 3D-M/S method get about 13% SNR improvement for each channel. But the complexity of 3D-M/S is much lower than PCA to achieve similar performance. It can be explained that the fixed matrix transform can be regarded as some special vectors in PCA. The special vectors are chosen based on the assumption that channel signals are either quite similar or quite different. This assumption may not be always true for the diversity of subband signals, but it makes a good compromise between coding efficiency and complexity.

## Conclusion

This paper proposed a 3D-M/S coding method, which inherits the low complexity of conventional M/S coding. Moreover, 3D-M/S performs the sum and difference coding triple by triple, rather than couple by couple of the conventional method. This structure is more suitable for a 3D multichannel audio configuration, because adjacent three channels form a triangle and will have the maximum redundancy in spatial configured 3D audio channels. Besides, it is also convenient to unfold 3D audio multichannel structure into plane triangles. Combining the proposed framework, 3D-M/S and PCA methods can be applied to more than three channels. An experiment on VBAP signals indicates the performance of proposed method with relatively low complexity, comparing to the PCA and independent channel coding. Considering the development of 3D audio technology and its requirement for compression efficiency, a low complexity 3D audio codec will be promising and preferable for practical application.

## Notes

### Acknowledgements

This work was supported by the National Natural Science Foundation of China (nos. 61231015, 61102127, 61201340, 61201169) and Natural Science Foundation of Hubei (nos. 2011CDB451, 2012FFB04205).

## Supplementary material

### References

- 1.Berkhout AJ, de Vries D, Vogel P: Acoustic control by wave field synthesis.
*J. Acoust. Soc. Am*1993, 93(5):2764-2778. 10.1121/1.405852CrossRefGoogle Scholar - 2.Gerzon MA: Ambisonics in multichannel broadcasting and video.
*J. Audio Eng. Soc*1985, 33(11):859-871.Google Scholar - 3.Cooperstock J: Multimodal telepresence systems.
*IEEE Signal Process. Mag*2011, 28: 77-86.CrossRefGoogle Scholar - 4.Staff A: Multichannel audio systems and techniques.
*J. Audio Eng. Soc*2005, 53(4):329-335.Google Scholar - 5.Rumsey F: Cinema sound for the 3-D era.
*J. Audio Eng. Soc*2013, 61(5):340-344.Google Scholar - 6.Nettingsmeier J: Birds on the wire - WFS live transmission project report.
*Tech. rep., Fraunhofer 2008*Google Scholar - 7.Sakaida S, Iguchi K, Nakajima N, Nishida Y, Ichigaya A, Nakasu E, Kurozumi M, Gohshi S: The super hi-vision codec.
*IEEE International Conference on Image Processing, 2007. ICIP 2007, Volume 1*2007, I-21–I-24.Google Scholar - 8.Baumgarte F, Faller C: Binaural cue coding-part I: psychoacoustic fundamentals and design principles.
*IEEE Trans. Speech Audio Process*2003, 11(6):509-519. 10.1109/TSA.2003.818109CrossRefGoogle Scholar - 9.Faller C, Baumgarte F: Binaural cue coding-part II: schemes and applications.
*IEEE Trans. Speech Audio Process*2003, 11(6):520-531. 10.1109/TSA.2003.818108CrossRefGoogle Scholar - 10.Oomen W, Schuijers E, Brinker den B, Breebaart J: Advances in parametric coding for high-quality audio.
*Audio Engineering Society Convention 114*2003.Google Scholar - 11.Breebaart J, van de Par S, Kohlrausch A, Schuijers E: Parametric coding of stereo audio.
*EURASIP J. Adv. Sig. Pr*2005, 2005(9):561917.Google Scholar - 12.Herre J, Disch S: New concepts in parametric coding of spatial audio: from SAC to SAOC.
*2007 IEEE International Conference on Multimedia and Expo*2007, 1894-1897.CrossRefGoogle Scholar - 13.Goodwin M, Jot J: Primary-ambient signal decomposition and vector-based localization for spatial audio coding and enhancement.
*IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007, Volume 1*2007, I-9–I-12.Google Scholar - 14.Cheng B, Ritz C, Burnett I: A spatial squeezing approach to ambisonic audio compression.
*IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008*2008, 369-372.CrossRefGoogle Scholar - 15.Hellerud E, Solvang A, Svensson U: Spatial redundancy in Higher Order Ambisonics and its use for lowdelay lossless compression.
*IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009*2009, 269-272.CrossRefGoogle Scholar - 16.Tzagkarakis C, Mouchtaris A, Tsakalides P: A multichannel sinusoidal model applied to spot microphone signals for immersive audio.
*IEEE Trans. Audio Speech Lang. Process*2009, 17(8):1483-1497.CrossRefGoogle Scholar - 17.Pinto F, Vetterli M: Wave field coding in the spacetime frequency domain.
*IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008*2008, 365-368.CrossRefGoogle Scholar - 18.Pinto F, Vetterli M: space-time-frequency processing of acoustic wave fields: theory, algorithms, and applications.
*IEEE Trans. Signal Process*2010, 58(9):4608-4620.MathSciNetCrossRefGoogle Scholar - 19.Cheng B: Spatial squeezing techniques for low bit-rate multichannel audio coding.
*PhD thesis*. University of Wollongong 2011Google Scholar - 20.Cheng B, Ritz C, Burnett I, Zheng X: A general compression approach to multi-channel three-dimensional audio.
*IEEE Trans. Audio Speech Lang. Process*2013, 21(8):1676-1688.CrossRefGoogle Scholar - 21.Yang D, Ai H, Kyriakakis C, Kuo CC: High-fidelity multichannel audio coding with Karhunen-Loeve transform.
*IEEE Trans. Speech Audio Process*2003, 11(4):365-380. 10.1109/TSA.2003.814375CrossRefGoogle Scholar - 22.Johnston J, Ferreira A: Sum-difference stereo transform coding.
*1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992. ICASSP-92, Volume 2*1992, 569-572.CrossRefGoogle Scholar - 23.Liu CM, Lee WC, Hsiao YH: M/S coding based on allocation entropy.
*Proceedings of the 6th International Conference on Digital Audio Effects (DAFx-03)*2003.Google Scholar - 24.Derrien O, Richard G: A new model-based algorithm for optimizing the MPEG-AAC in MS-Stereo.
*IEEE Trans. Audio Speech Lang. Process*2008, 16(8):1373-1382.CrossRefGoogle Scholar - 25.Krueger H, Vary P: A new approach for low-delay joint-stereo coding.
*2008 ITG Conference on Voice Communication (SprachKommunikation)*2008, 1-4.Google Scholar - 26.Schafer M, Vary P: Hierarchical multi-channel audio coding based on time-domain linear prediction.
*2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO)*2012, 2148-2152.Google Scholar - 27.Neuendorf M, Multrus M, Rettelbach N, Fuchs G, Robilliard J, Lecomte J, Wilde S, Bayer S, Disch S, Helmrich C, Lefebvre R, Gournay P, Bessette B, Lapierre J, Kjorling K, Purnhagen H, Villemoes L, Oomen W, Schuijers E, Kikuiri K, Chinen T, Norimatsu T, Seng CK, Oh E, Kim M, Quackenbush S, Grill B: MPEG unified speech and audio coding-the ISO/MPEG standard for high-efficiency audio coding of all content types. In
*Audio Engineering Society Convention 132*. Audio Engineering Society, 2012);Google Scholar - 28.Multrus M, Neuendorf M, Lecomte J, Fuchs G, Bayer S, Robilliard J, Nagel F, Wilde S, Fischer D, Hilpert J, Rettelbach N, Helmrich C, Disch S, Geiger R, Grill B: MPEG unified speech and audio coding - bridging the gap. In
*Microelectronic Systems*. Edited by: Heuberger A, Elst G, Hanke R. Berlin, Heidelberg: (Springer Berlin Heidelberg; 2011:351-362.CrossRefGoogle Scholar - 29.Helmrich C, Carlsson P, Disch S, Edler B, Hilpert J, Neusinger M, Purnhagen H, Robilliard J, Villemoes L, RettelbachN: Efficient transform coding of two-channel audio signals by means of complex-valued stereo prediction.
*2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*2011, 497-500.CrossRefGoogle Scholar - 30.Liu CM, Hsu HW, Lee WC: Compression artifacts in perceptual audio coding.
*IEEE Trans. Audio Speech Lang. Process*2008, 16(4):681-695.CrossRefGoogle Scholar - 31.Zotter F, Frank M: All-round ambisonic panning and decoding.
*J. Audio Eng. Soc*2012, 60(10):807-820.Google Scholar - 32.Ando A, Sugimoto T, Irie K: Coding of 22.2 multichannel audio signal by MPEG-AAC.
*IEICE Tech. Rep., Volume 113 of EA2013-46*2013, 75-80.Google Scholar - 33.Ando A: Conversion of multichannel sound signal maintaining physical properties of sound in reproduced sound field.
*IEEE Trans. Audio Speech Lang. Process*2011, 19(6):1467-1475.CrossRefGoogle Scholar - 34.ITU-T:
*Method for the subjective assessment of intermediate sound quality (MUSHRA)*. 2001.Google Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.