# Estimation of Interchannel Time Difference in Frequency Subbands Based on Nonuniform Discrete Fourier Transform

- 1.2k Downloads

**Part of the following topical collections:**

## Abstract

Binaural cue coding (BCC) is an efficient technique for spatial audio rendering by using the side information such as interchannel level difference (ICLD), interchannel time difference (ICTD), and interchannel correlation (ICC). Of the side information, the ICTD plays an important role to the auditory spatial image. However, inaccurate estimation of the ICTD may lead to the audio quality degradation. In this paper, we develop a novel ICTD estimation algorithm based on the nonuniform discrete Fourier transform (NDFT) and integrate it with the BCC approach to improve the decoded auditory image. Furthermore, a new subjective assessment method is proposed for the evaluation of auditory image widths of decoded signals. The test results demonstrate that the NDFT-based scheme can achieve much wider and more externalized auditory image than the existing BCC scheme based on the discrete Fourier transform (DFT). It is found that the present technique, regardless of the image width, does not deteriorate the sound quality at the decoder compared to the traditional scheme without ICTD estimation.

### Keywords

Discrete Fourier Transform Side Information Interaural Time Difference Image Width Audio Quality## 1. Introduction

Since 1990, joint stereo coding algorithm has been widely used in the two-channel audio coding. Various techniques have been developed for compressing stereo or multichannel audio signals. Recently, the ISO/MPEG standardization group has published a new audio standard, that is, MPEG Surround, which is a feature-rich open standard compression technique for multichannel audio signals [1]. MPEG Surround coding can be regarded as an enhancement of the joint stereo coding and an extension of BCC [[2, 5]]. BCC exploits binaural cue parameters for capturing the spatial image of multichannel audio and enables low-bit-rate transmission by transmitting mono signals plus side information in relation to binaural perception.

For BCC scheme applied to loudspeaker playback or amplitude panning signals, the use of time difference cue hardly plays an important role in widening and externalizing the auditory image. Furthermore, the introduction of ICTD may result in poor audio quality if it is operated improperly. Thus, the ICTD panning is not commonly used compared to the ICLD. However, for binaural recordings or signals filtered with head-related transfer function (HRTF), time difference cues contribute much to a higher audio quality [8]. Especially, at frequencies below about 1–1.5 kHz, the ICTD is an important binaural cue for headphone playback [7]. It is validated in the subjective test of Section 5 that the spatial image width could be widened significantly and a better overall quality could be achieved compared to the BCC scheme without using the time difference cue.

Generic BCC scheme estimates ICTD in frequency subbands partitioned according to psychoacoustic critical bands [9]. When DFT is used to implement time-to-frequency transform, the subband bandwidth in the range of low frequency is much narrower than that in the high frequency range due to the uniform sampling. However, to account for human auditory perception, spatial cues contained in low-frequency subbands are more important than those in high-frequency subbands. The DFT method may not analyze subband properties properly so that the BCC scheme with the ICTD estimation is unable to improve the audio quality and even deteriorates it.

An alternative solution is to employ the nonuniform discrete Fourier transform (NDFT). The advantage of the NDFT is that localization of frequency bins can be adjusted as requested. In this paper, we propose a novel NDFT-based method to estimate ICTD more accurately than in the DFT-based solutions. Firstly, a subband factor is calculated to evaluate the coherence degree of two channels and then decide whether it is necessary to estimate ICTD. A new subjective testing is designed to assess the proposed BCC scheme from many references to [8] and results are in accordance with expectations. The rest of this paper is organized as follows. Section 2 introduces the concept of NDFT. Section 3 discusses ICTD estimation based on DFT. Section 4 presents the improved ICTD estimation. Subjective tests are described in Section 5. Finally, a brief conclusion is drawn in Section 6.

## 2. Concept of NDTF

### 2.1. Introduction

Traditional DFT is obtained by sampling the continuous frequency domain at *N* points evenly spaced around the unit circle in the Open image in new window -plane. Therefore, if the temporal sample rate and Open image in new window are fixed, all the frequency points are uniformly distributed from zero to the temporal sample rate. From the point of view of the human auditory perception, the main drawback of this approach is the use of an equally spaced frequency range which leads into a transform of the whole frequency spectrum for the sampling rate as a constraint.

Different from the uniform DFT, NDFT enables the analysis of arbitrary frequency ranges with irregular intervals. *N* frequency points of NDFT are nonuniformly spaced around the unit circle in the Open image in new window -plane. By choosing the frequency points appropriately, NDFT can change the distribution of frequency points in different subbands. It is possible to increase the frequency points in low-frequency bands and accordingly decrease those in high-frequency bands. Improved frequency accuracy may be helpful for spatial hearing.

### 2.2. Definition of NDFT

where Open image in new window and Open image in new window are the number of frequency sampling points and temporal sampling points, respectively. Open image in new window may be any real number between 0 and Open image in new window . It is known that the difference between DFT and NDFT is mostly the manner of frequency sampling, that is, the selection of Open image in new window .

*N*is defined as [11]

*N*distinct points nonuniformly spaced around the unit circle in the Open image in new window -plane. The matrix form of NDFT is

The matrix *D* is a Vandermonde matrix and determined by the choice of the *N* points Open image in new window . As the *N* points Open image in new window are not the same, the determinant of *D* is not zero. Therefore, the inverse NDFT exists and is unique, and *x* can be calculated by Open image in new window

## 3. ICTD Estimation

### 3.1. Time-to-Frequency Transform

In our schemes, the value of Open image in new window is 896 and the value of Open image in new window is 64. Thus, a 1024-point DFT is carried out to get frequency data. It should be noted that all signals selected have the same sampling rate of 44.1 kHz.

After time-to-frequency transform, two-channel signals are downmixed into mono sum signal. Meanwhile, BCC cues are estimated in frequency subbands. According to the spatial hearing theory, a nonuniform partition of subbands is chosen. As the spectrum is symmetric, only the first Open image in new window (513 for 1024-point DFT) spectral bins are divided into subbands. In this paper, we use 27 subbands to approximate the psychoacoustic critical bands. Table 1 shows the number of spectral bins and the index of the first spectral bins in each subband.

(a) Number of spectral bins in each subband

Subbands | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 |
---|---|---|---|---|---|---|---|---|---|

Bins number | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 4 |

Subbands | B10 | B11 | B12 | B13 | B14 | B15 | B16 | B17 | B18 |

Bins number | 4 | 4 | 4 | 4 | 6 | 6 | 8 | 8 | 12 |

Subbands | B19 | B20 | B21 | B22 | B23 | B24 | B25 | B26 | B27 |

Bins number | 16 | 16 | 20 | 28 | 36 | 64 | 64 | 80 | 112 |

| |||||||||

Subbands | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 |

Index | 0 | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 |

Subbands | B10 | B11 | B12 | B13 | B14 | B15 | B16 | B17 | B18 |

Index | 20 | 24 | 28 | 32 | 36 | 42 | 48 | 56 | 64 |

Subbands | B19 | B20 | B21 | B22 | B23 | B24 | B25 | B26 | B27 |

Index | 76 | 92 | 108 | 128 | 156 | 192 | 256 | 320 | 400 |

### 3.2. ICTD Estimation

Using (9) and (11), we can estimate the ICTD.

## 4. NDTF-Based ICTD Estimation

From Table 1 in Section 3, it is noted that the number of spectral bins differs greatly between subbands. There are more spectral bins in high-frequency subbands than those in low frequency subbands. Thus, the estimated ICTD in low-frequency subbands may not be correct because of few spectral bins obtained. Moreover, when left and right channels are not fully coherent, that is, no time difference between the two channels, the ICTD may be estimated as a nonzero value. Here, an NDFT-based method is proposed to improve the ICTD estimation.

For the convenience of comparison, our NDFT-based scheme is also based on a 1024-point transform. In the NDFT method, spectral bins are also partitioned into 27 subbands as shown in Table 2(a). Obviously, the number of spectral bins in each subband is different from the DFT scheme. Correspondingly, the index of the first spectral bin in each subband is given in Table 2(b).

(a) Number of spectral bins in each subband

Subbands | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 |
---|---|---|---|---|---|---|---|---|---|

Bins number | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 16 |

Subbands | B10 | B11 | B12 | B13 | B14 | B15 | B16 | B17 | B18 |

Bins number | 16 | 16 | 16 | 16 | 24 | 24 | 32 | 32 | 12 |

Subbands | B19 | B20 | B21 | B22 | B23 | B24 | B25 | B26 | B27 |

Bins number | 16 | 16 | 20 | 14 | 18 | 32 | 32 | 40 | 56 |

| |||||||||

Subbands | B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 |

Index | 0 | 8 | 16 | 24 | 32 | 40 | 48 | 56 | 64 |

Subbands | B10 | B11 | B12 | B13 | B14 | B15 | B16 | B17 | B18 |

Index | 80 | 96 | 112 | 128 | 144 | 168 | 192 | 224 | 256 |

Subbands | B19 | B20 | B21 | B22 | B23 | B24 | B25 | B26 | B27 |

Index | 268 | 284 | 300 | 320 | 334 | 352 | 384 | 416 | 456 |

where Open image in new window is chosen as 1/4 in low-frequency bands, 1 in middle frequency bands and 2 in high-frequency bands, respectively. Then we can estimate the ICTD in each subband using (10).

An empiric value 0.5 is chosen for Open image in new window . If Open image in new window is larger than 0.5, ICTD is calculated as side information. Otherwise, ICTD is not considered in the NDFT-based scheme.

## 5. Subjective Test

### 5.1. Test Design

Subjective tests are conducted using the guideline by ITU-R 1116 and ITU-R 1534 [11, 12]. There are 12 persons (including 3 females and 9 males, who are all volunteers in our group) participating as subjects in the test. Being trained, most of them are experienced listeners. The playback we used is TAKSTAR TS-610 headphone connected to an external sound card (Creative 24-bit Sound Blaster Live).

Eight different kinds of 2-channel stereo audio excerpts are selected as the test material. All of them present a wide auditory image, and most of them are binaural or 3D audio. If the auditory images are changed, it is easy for subjects to perceive. Each excerpt is processed in 4 different ways containing the reference audio which keeps unchanged as follows:

case A:reference, it is the same with the original excerpt but hidden in the test;

case B:DFT-based ICTD estimation, the BCC analysis and synthesis with ICLD, ICC, and DFT-based ICTD;

case C:NDFT-based ICTD estimation, the BCC analysis and synthesis with ICLD, ICC, and NDFT-based ICTD;

case D:without ICTD, the BCC analysis and synthesis only with ICLD and ICC.

Grades and scales.

Grade | Overall quality |
---|---|

5 | No difference |

4 | Slight difference not annoying |

3 | Slightly annoying |

2 | Annoying |

1 | Very annoying |

### 5.2. Image Width Evaluation

where Open image in new window is valued within the range from 0 to 5. As the reference has the widest auditory image, the value of Open image in new window is not 5, and Open image in new window is always less than Open image in new window . Moreover, as the corresponding auditory events are chosen at the same temporal point, the left and right channels would not be interchanged, which is confirmed in our tests. The use of (14) makes it convenient for image width assessment as well as the audio quality evaluation.

### 5.3. Results and Discussions

It can be seen from Figure 6 that the scheme without ICTD results in a worst auditory image for all excerpts with lowest scores, because the synthesizing process compresses the image width of original signals. The auditory image widths of excerpts are difficult for subjects to perceive due to the wider 95% confidence interval. Obviously, the scores of the NDFT-based scheme are the highest in the three processed schemes. These excerpts are most approximate to original signals with a wider auditory image rather than the other two kinds of excerpts. It means that NDFT-based ICTD estimation is more accurate than the DFT-based one as expected. The average scores in the four cases are depicted in the right part of Figure 6, where the value "9" in the abscissa represents "average." The average score for the case B, C, and D is 4.3, 3.9, and 3.2, respectively. It validates that the NDFT-based scheme is superior to the DFT-based scheme.

Results for the overall quality evaluation are shown in Figure 7. Generally, the schemes without ICTD may have the best audio quality disregarding image width. But it may change auditory image, and the decoded audio will not gain an ambient image, which affects the perception quality more or less and lead to a significant difference to original audios. Therefore ICTD estimation should be adopted in BCC schemes for improving the overall quality considering image width. It is clear from Figure 7 that the scheme without ICTD has the lowest scores and the average value is 2.3, whereas the BCC scheme with DFT-based or NDFT-based ICTD estimation has an advantage over it. Moreover, the NDFT-based scheme yields higher scores than the DFT-based scheme except for the excerpt sample 4. It is from the right part of Figure 7 that the average score for the DFT-based scheme and NDFT-based scheme are 3.6 and 4.1, respectively. Obviously, NDFT-based scheme is better than the DFT-based scheme, and it is the best choice in terms of the audio quality and image width.

## 6. Conclusion

This paper presents a novel algorithm to estimate the interchannel time difference by using the nonuniform discrete Fourier transform. The frequency bins can be adjusted as requested by integrating this algorithm with the binaural cue coding approach. Consequently, the decoded audio image width is improved compared to the traditional DFT-based method. On the other hand, the sound quality is not deteriorated by adding this algorithm module in the BCC scheme.

A subjective testing was designed and implemented. The evaluation result proves that this NDFT-based ICTD scheme is the optimal choice in terms of the audio image width and the audio quality.

## Notes

### Acknowledgment

This research was partially supported by the National Natural Science Foundations of China under Grants no. 10474115 and no. 60535030.

### References

- 1.
- 2.Herre J:
**From joint stereo to spatial audio coding—recent progress and standardization.***Proceedings of the 7th International Conference on Digital Audio Effects (DAFx '04), October 2004, Naples, Italy*157-162.Google Scholar - 3.Faller C, Baumgarte F:
**Efficient representation of spatial audio using perceptual parametrization.***Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '01), October 2001, New Paltz, NY, USA*199-202.Google Scholar - 4.Faller C, Baumgarte F:
**Binaural cue coding—part II: schemes and applications.***IEEE Transactions on Speech and Audio Processing*2003,**11**(6):520-531. 10.1109/TSA.2003.818108CrossRefGoogle Scholar - 5.Faller C:
**Coding of MPEG Surround compatible with different playback formats.***Proceedings of the 117th Convention of the Audio Engineering Society (AES '04), October 2004, San Francisco, Calif, USA*Google Scholar - 6.Blauert JP:
*Spatial Hearing*. MIT Press, Cambridge, Mass, USA; 1997.Google Scholar - 7.Faller C:
*Parametric coding of spatial audio, Ph.D. thesis*. Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; 2004.Google Scholar - 8.Tournery C, Faller C:
**Improved time delay analysis /synthesis for parametric stereo audio coding.***Proceedings of the 120th Convention of the Audio Engineering Society (AES '06), May 2006, Paris, France*Google Scholar - 9.Moore BCJ:
*An Introduction to the Psychology of Hearing*. 5th edition. Academic Press, London, UK; 2003.Google Scholar - 10.Bagchi S, Mitra SK:
*The Nonuniform Discrete Fourier Transform and Its Applications in Signal Processing*. Kluwer Academic Publishers, Boston, Mass, USA; 1999.CrossRefGoogle Scholar - 11.Rec. ITU-R BS.1116-1 :
**Methods for the subjective assessment of small impairments in audio systems including multi-channel sound systems.**1997.Google Scholar - 12.Rec. ITU-R BS.1534-1 :
**Method for the subjective assessment of intermediate quality levels of coding systems.**2003.Google Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.