Abstract
A brief description (comparison) of various video/image coding standards such as JPEG, MPEG and H.26x series besides, DIRAC (Chap. 7), AVS China (Chap. 3) and VC-1 (Chap. 8) are presented. Similarly brief description on audio coding followed by comparison metrics concludes the chapter.
Keywords
1.1 Popular Video and Audio Standards
The combination of multiple sources of video, audio, image and text is usually known as multimedia. The requirements for multimedia communications have increased rapidly in the last two decades in broad areas such as television, entertainment, interactive services, telecommunications, conference, Internet, consumer electronic devices, medicine, security, business, traffic, defense and banking. Usually, video and audio data have to be compressed before storage or transmission because the original data volume is too huge, and the compressed data should be decoded before display or for further processing. Compression is also referred to as encoding or coding, and decompression as decoding. Therefore the software or device to compress and decompress video/audio data is called as video/audio encoder (coder) and decoder respectively. Encoder and decoder are abbreviated as codec for convenience. Although plenty of video and audio coding algorithms have been developed, it is the video and audio coding standards [B7], which guarantee interoperability between software and hardware provided by multiple vendors, that make multimedia communications practical. Series of video and audio coding standards have been developed by Standards Development Organizations (SDO), including ISO/IEC (the International Standardization Organization and the International Electrotechnical Commission) [H53] [H54], ITU-T (the Telecommunication Standardization Sector of the International Telecommunication Union, formerly CCITT) [H55], SMPTE (Society of Motion Picture and Television Engineers) [C31], AVS China (the Audio and Video coding Standard of China) [AS1], DIRAC by BBC [D1] [D5], and well known companies, including Microsoft [C32], Real Networks [R3] and On 2 Technologies [P11].
ISO/IEC has developed several video and audio standards including MPEG-1 (ISO/IEC 11172) [S2], MPEG-2 (ISO/IEC 13818) [S3] and MPEG-4 (ISO/IEC 14496) [S8]. ITU-T has also developed several standards, but unlike MPEG, the video standards and audio standards are separate. ITU-T H.261 [S4], H.262 [S3],Footnote 1 H.263 [S5], H.263+ (H.263 Version 2) [S6] [S7], H.26L [PC1], and H.264 [S10] [H23] are designed for video, while ITU-T G.723.1 [O14] and G.729 [O15] are for audio. Besides these standards, Video Coder 1 (VC-1) [C11] [C14] and Video Coder 2 (VC-2) [C6] by SMPTE, Windows Media Video 9 [C1] [C2] by Microsoft, VP 6 [P3] [P5] [P6] and VP7 [P4] by On 2 Technologies, Dirac [D1] [D5] by BBC, Real Video 9 and Real Video 10 [R1] [R2] by Real Networks are also popularly used video standards on Internet and personal computers. In recent years, AVS China [A2, A10, A59–A66] has attracted great deal of attention from industries related to television, multimedia communications and even chip manufacturing from around the world. This new standard includes four main technical areas, which are systems, video, audio and digital rights management (DRM), and some supporting documents such as consistency verification. The second part of the standard known as AVS1-P2 [A2] (video–Jizhun) was approved as the national standard of China in 2006, and several final drafts of the standard have been completed, including AVS1-P1 (systems) [A1], AVS1-P2 (video–Jiaqiang) [A3], AVS1-P3 (audio) [A4], and AVS1-P7 (mobile video) [A74].
AVS China provides optimized coding performance with the lowest total cost, including transmission and storage cost, implementation cost and intellectual property rights (IPR) cost because the AVS Working Group took the technical IPR cost into account from the beginning of setting up the standard. There are two aspects of the IPR cost. One is the IPR for the content, which is out of the scope of audio and video coding standards. The other aspect is the IPR cost for the technologies used in the audio and video coding standards.
Some recent research results on AVS China are reported in a special issue of the journal “Signal Processing: Image Communication” in several papers [A58–A66]. An overview of the video part of AVS is given in [A59], which also describes the coding tools available and gives examples of the application-driven profiles defined in AVS. In [A60] two context-based entropy coding schemes for AVS video coding standard are presented. One is the context-based 2D variable length coding (C2DVLC) as a low-complexity entropy coding scheme for AVS Part-2 Jizhun (base) profile. The other is the context-based binary arithmetic coding (CBAC) as an enhanced entropy coding scheme for AVS Part-2 Jiaqiang (enhanced) profile. In [A61], a sub-pixel interpolation filter known as combined adaptive-fixed interpolation (CAFI) with multi-directional filters is proposed to obtain good coding efficiency with low computational complexity. In addition, implementations [A64] [A65], reconfigurable video coding (RVC) framework [A62], trick modes [A63] and robust dual watermarking algorithm [A66] are also discussed in this issue.
Popular video standards are listed in Table 1.1, and the algorithmic element comparisons of these video standards are listed in Table 1.2. Table 1.1 is adopted from T. Ebrahimi and M. Kunt, “Visual data compression for multimedia applications”, Proc. IEEE, vol. 86, pp. 1109–1125, June 1998 [G1]. Some minor changes have been made. Recent standards such as H.264/MPEG-4 Part 10, Dirac, AVS China, JPEG-LS, JPEG-XR, JBIG, VC-1 (SMPTE), VC-2, HEVC/NGVC and VP6 (now VC10) are now added.
1.2 Digital Representation of Video
As video is used to record and/or show moving objects, it is composed of a sequence of pictures taken at regular temporal intervals. The number of frames (pictures) per second is called as the frame rate. Frame rates below 10 frames per second (fps) are sometimes used for very low bit-rate (below 64 kbps) video communications. Between 10 and 20 fps is more typical for low bit-rate video communications. Sampling at 25 or 30 frames per second is standard for television pictures; 50 or 60 frames per second produces smooth apparent motion [B8].
Video can be divided into analog and digital types. Analog video is represented with an analog signal, which is captured by progressive or interlaced scanning using an analog camera. An example of analog video is the signal used in analog television systems such as PAL [G12] and NTSC [G13] [G14]. Digital video is often captured with a digital camera, although it can also be converted from analog video signal. The natural scene is projected onto a sensor, such as an array of charge coupled devices (CCD array) [G15], in a digital camera, which converts the brightness or color of the scene into digital data. Each image (picture) in a video sequence includes M by N picture elements (pixels), where M is the number of rows and N the number of columns. For color images, each pixel is usually composed of three color components, which are Red (R), Green (G) and Blue (B) (abbreviated as RGB). Each color component is separately filtered and projected onto a CCD array. Any color can be created by combining R, G and B in varying proportions. With all possible values of R, G and B, vectors of (R, G, B) form a space called as RGB color space. Each color component is represented with a K-bit integer. For ordinary use, 8 is the sufficient value of K. Larger K is needed for more accurate uses such as medical, broadcast, surveillance, and studio editing, etc.
Although the RGB color space is well-suited to capture and display color images, the YCbCr space, formed by vectors of (Y, Cb, Cr), is more efficient for compression, where Y represents luminance (brightness) of a pixel, Cb and Cr are the pixel’s chrominance components proportional to the color differences of B–Y and R–Y respectively. If Cb = 0.564(B–Y) and Cr = 0.713(R–Y) as defined in [B8], the mappings from the RGB space to the YCbCr space and vice versa can be realized as follows:
On the other hand, if Cb = B–Y and Cr = R–Y as in [G6], Eqs. (1.1) and (1.2) change to:
The human visual system (HVS) is more sensitive to the error of luminance than to that of chrominance. This property can be used to make more compression on video data. As a result, the Cb and Cr components may be represented with a lower resolution than Y. As an example, Cb and Cr can be down sampled to a 1/4 size of Y. This reduces the amount of data required to represent the chrominance components without having an obvious effect on visual quality. To the casual observer, there is no obvious difference between an RGB image and a YCbCr image with reduced chrominance resolution. An RGB image is converted to YCbCr before storage or encoding, and the YCbCr is usually necessary to convert back to RGB before display. A video standard usually supports several sampling patterns for Y, Cb and Cr. Typical patterns are 4:4:4, 4:2:2, and 4:2:0 as shown in Fig. 1.1.
The numbers in the ratio \( N_{1} : N_{2} : N_{3} \) indicate the relative sampling rate in the horizontal direction, where \( N_{1} \) represents the number of Y samples in both odd and even rows, \( N_{2} \) the number of Cb and the number of Cr samples in odd row and \( N_{3} \) the number of Cb and the number of Cr samples in even row. For example, in the 4:2:0 sampling pattern, \( N_{1} = 4,\,N_{2} = 2,\,N_{3} = 0 \). This means that for every four luminance samples in odd row there are two Cb samples and two Cr samples, but for every four luminance samples in even row, there are no Cb and Cr samples. In 4:2:2 sampling pattern, not only for every four luminance samples in odd row there are two Cb samples and two Cr samples, but for every four luminance samples in even row, there are also two Cb samples and two Cr samples.
Among the above patterns, the 4:2:0 sampling pattern is the most popular pattern which is widely used for consumer applications such as video conferencing, digital television and digital versatile disk (DVD) storage. But the 4:2:2 and 4:4:4 patterns are used for high-quality color reproduction. From Figs. 1.1 and 1.2, it is clear that the number of samples in the 4:2:0 pattern is only half the number of samples in the 4:4:4 pattern.
In video encoders, each picture is partitioned into fixed-size macro blocks (MB) that cover a rectangular area of 16 × 16 samples of the luma component and 8 × 8 samples of each chroma component (4:2:0 format). Figure 1.2 shows 3 formats known as 4:4:4, 4:2:2 and 4:2:0 video. 4:4:4 is full bandwidth YUV video, and each macroblock consists of 4 Y blocks, and 4 U/V blocks. Being full bandwidth, this format contains as much as data would if it were in the RGB color space. 4:2:2 contains half as much chrominance information as 4:4:4 and 4:2:0 contains one quarter of the chrominance information. 4:2:0 format is typically used in video streaming applications.
It can be summarized from the above discussions that: (1) a digital video is composed of a sequence of pictures, each of which includes M by N pixels; (2) the color of each pixel is determined by three components either in the RGB color space or in the YCbCr space; (3) the RGB color space is more suitable for acquisition and display while the YCbCr space is better for encoding and storage.
1.3 Basic Structure of Video Codec
Although differences exist between different video standards, some common tools, such as motion-based temporal redundancy reduction, transform-based spatial redundancy reduction, are used by many standards. Video coders formed by these tools can be divided into two types. One includes adaptive intra-prediction [H32], but the other does not have this mode. The two types of the basic video codec structure are shown in Figs. 1.3 and 1.4. Many video standards including H.261, H.263, MPEG-1 and MPEG-2 have similar coder structure (Fig. 1.3) although minor differences exist. The standards with similar codec structure as shown in Fig. 1.4 [H8] are usually new video standards such as H.264, AVS China and VC-1 [C10] [C14]. It is obvious from these figures that the tools of transform, inter-frame prediction, intra-frame prediction and entropy coding play very important roles in video codecs.
1.4 Performance Comparison Metrics for Video Codec
Several aspects need to be compared to evaluate the performances of video codecs. The specifications include bit rate (or compression ratio), computational cost (or complexity), quality (or distortion), scalability, error robustness, and interoperability. Compressed bit rate is the rate required to transmit a coded video sequence. The unit of bit rate is bits per second abbreviated as bps (or bits/s) and it is easy to calculate or measure the bit rate for a compressed video stream. Computational cost refers to the processing power required to code the video sequence.
Quality can mean either subjective or objective measured video quality. Mean opinion score (MOS) is one of the metrics to measure subjective quality. Subjective experiments are necessary to get MOS. In subjective experiments, a group of people (typically 15–30) are asked to watch a set of video clips and rate their quality. MOS represents the average rating over all viewers for a given clip. There are a wide variety of subjective testing methods. The ITU has formalized direct scaling methods in various recommendations [U1]–[U3]. Recommended testing procedures include implicit comparisons such as double stimulus continuous quality scale (DSCQS), explicit comparisons such as double stimulus impairment scale (DSIS), or absolute ratings such as single stimulus continuous quality evaluation (SSCQE) or absolute category rating (ACR). More details on subjective testing can be found in [B11]. Video quality is best assessed subjectively, i.e., by real viewers. However, assessments of subjective quality are time consuming and expensive because of the requirements for a large number of viewers and a large amount of video material to be rated; furthermore, they cannot be easily and routinely performed for real time systems.
The purpose of an objective image or video quality evaluation is to automatically assess the quality of images or video sequences in agreement with human quality judgments.
Objective metrics can be classified in different ways. For instance, Winkler and Mohandas [Q21] classify the objective metrics into Data Metrics, Picture Metrics and Packet- and Bitstream-Based Metrics in one way, but into Full-Reference Metrics, No-Reference Metrics and Reduced-Reference Metrics in another way according to the amount of information required about the reference video. The Data Metrics are based only on a byte-by-byte comparison of the data without considering spatial relationship of pixels. Examples of Data Metrics are mean square error (MSE) and peak-to-peak signal-to-noise ratio (PSNR). Differently, Picture Metrics specifically account for the effects of distortions and content on perceived quality. Packet- and Bitstream-Based Metrics are designed to measure the impact of network losses on video quality based on parameters that can be extracted from the transport stream and the bitstream with no or little decoding.
Full-reference methods require full access to both the original source sequence and its processed counterpart. They are appropriate for performance testing where there is sufficient time to measure quality and source video is available. Reduced-reference methods operate by extracting a parameter set from the original reference sequence and using this set in place of the actual reference video. Some means of transmitting the reference parameters for use with the reduced-reference method is required. No-reference methods operate only on the processed sequence and have no access to source information. Reduced-reference and no-reference methods are appropriate for live monitoring applications [Q21] [U10].
Over the past few decades, image and video quality assessments have been extensively studied and many different objective criteria have been set [U6]–[U8]. But PSNR is still the most popularly used quality metric especially in the rate-distortion performance analysis. To investigate the scope of validity of PSNR in image/video quality assessment, Huynh-Thu and Ghanbari [Q18] selected ten source (reference) video contents (named as SRC1 through SRC10 respectively) of 8 s duration at CIF resolution covering a wide range of spatio-temporal characteristics and encoded them with H.264 at several bit rates from 24 to 800 kbps. A subjective test was also conducted with 40 test sequences and with an experimental setup following international standards [U2]. The results showed that for a specified content, PSNR always monotonically increases with subjective quality as the bit rate increases as shown in Fig. 1.5a. In other words, the variation of PSNR is a reliable indicator of the variation of quality within a specified codec and fixed content. As a result, PSNR can be used as a performance metric in the context of codec optimization. On the other hand, Fig. 1.5b and c demonstrate that PSNR cannot be a reliable method for assessing the video quality across different video contents.
Mean square error (MSE), root mean square error (RMSE), normalized mean square error (NMSE), signal-to-noise ratio (SNR), and peak-to-peak signal-to-noise ratio (PSNR) are defined as [Q4]
where N is the number of rows and M the number of columns; \( x_{ij} \) is the original pixel value at the position of ith row and jth column; \( y_{ij} \) is the processed (such as decoded) pixel value at the position of ith row and jth column. Here 255 is the peak-to-peak signal value for 8 bit PCM.
In order to improve the measurement performance of the metric, Feghali et al. [Q17] proposed an effective quality metric named as QM that takes into account quantization errors, frame rate (FR), and motion speed. This metric QM is defined as
where PSNR is the peak-to-peak signal-to-noise ratio of the video sequence; a = 0.986 and b = 0.378 are constants; FR (≤30) represents the frame rate of the video; m is the parameter of motion speed, which is in fact the normalized average magnitude of large motion vectors. The resulting correlation coefficient (as listed in Table 1.3) between QM and the assessed subjective quality is as high as 0.93 on an average over five video sequences (Football, Ferris wheel, Mobile, Susie, and Autumn leaves). This is much better than PSNR only. According to (1.10), when FR = 30, QM = PSNR. This means that PSNR predicts well the subjective quality at the frame rate of 30 fps as shown in Figs. 1.6 and 1.7.
Different from the metrics based on PSNR, Liu et al. [Q24] proposed a full-reference metric that measures the overall quality degradation due to both packet losses and lossy compression. This quality metric is called as PDMOSCL (predicted degradation of differential mean opinion score due to coding-artifacts and packet losses) and defined as
with
where PDMOSCL is the predicated total quality degradation, and PDMOS C is predicted quality degradation contributed from coding artifacts of loss-free portion of encoded sequences; PDMOS L is the predicted quality degradation due to packet losses; the parameter λ is used to provide appropriate weighting between quality degradation caused by coding artifacts and that caused by loss artifacts; PD j is the PSNR drop for frame j, with j = 1 denoting the first lost frame; EL is the length (in terms of frame) of loss affected in the video segment (or error length); EL min is a minimal error length; D is the distance (in time by seconds) from last erroneous frame to the end of sequence, and r is a constant to be determined through least squares fitting of the subjective ratings to the model; L is the sequence length (number of frames); N L is the number of losses; L loss is called “loss span,” which is defined as the distance (in time, seconds) between the first lost frame to the beginning of the last loss (For single loss case, L loss is set to 0.); c and k are constants to be determined; s is the roll-off factor of sigmoid function; PSNR T is the transition value of that PSNR graph; DMOS C, max is maximum possible perceptual quality degradation due to coding artifacts.
Five videos (as described in Table 1.4) with different scene contents are used to generate a large set of test sequences. The scene contents cover indoor people interactions, outdoor sports games, with low to high motion, and plain to rich textures. In addition, these videos include a variety of camera motions. All the sequences are in QVGA (320 × 240) resolution, encoding frame rate is 12 fps or 15 fps, and clip durations are from 20 s to 40 s. Four tests as described in Table 1.5 are made to investigate the quality metric. Test 1 and Test 2 are designed to find out how is the perceptual quality affected by packet loss, whereas Test 3 concerns the impact of coding artifacts. The sequences used in the first three tests are all generated from the original video of “American Pie,” which are used for exploring and training the objective metric PDMOSCL. The sequences used in Test 4 are generated from all five video sources and contain both coding and loss artifacts, and they are used for verifying PDMOSCL. The results demonstrate that this metric correlates very well with subjective ratings, for a large set of video clips with different loss patterns, coding artifacts, and scene contents, as shown in Fig. 1.8.
In addition to the above Full-Reference metrics, Ninassi et al. [Q23] also described a full reference video quality assessment metric, which focuses on the temporal variations of the spatial distortions. The temporal variations of the spatial distortions are evaluated both at eye fixation level and on the entire video sequence.
Pinson and Wolf [Q14] proposed a general purpose video quality metric (VQM), which is defined as:
where si_loss detects a decrease or loss of spatial information (e.g., blurring); hv_loss detects a shift of edges from horizontal and vertical orientation to diagonal orientation, such as might be the case if horizontal and vertical edges suffer more blurring than diagonal edges; chroma_spread detects changes in the spread of the distribution of two-dimensional color samples. Thus, chroma_spread measures color impairments; si_gain measures improvements to quality that result from edge sharpening or enhancements; ct_ati_gain is computed as the product of a contrast feature, measuring the amount of spatial detail, and a temporal information feature, measuring the amount of motion present in the S-T region; chroma_extreme detects severe localized color impairments, such as those produced by digital transmission errors.
This model has been shown by the VQEG FR-TV (Video Quality Experts Group Full Reference Television) Phase II test [U9] to produce excellent estimates of video quality for both 525-line and 625-line video systems, as shown in Figs. 1.9 and 1.10.
No-reference video quality evaluation has been the topic of many studies in the field of visual quality metrics. Oelbaum et al. [Q25] proposed a no-reference metric using 7 features including blurriness, blockiness, spatial activity, temporal predictability, edge-continuity, motion-continuity, and color-continuity. If these feature values are denoted with \( p_{i} \,(i = { 1}, 2, \ldots , 7) \) respectively, and define a feature vector \( \textbf{p} \) as \(\textbf{p} = \left( {p_{1} ,p_{2} , \ldots ,p_{7} } \right)^{\text{T}} \), then the NR quality metric \( \hat{y} \) can be written as:
where \( \textbf{b} \) is a column vector that contains the individual estimation weights \( b_{i} \,(i = { 1}, 2, \ldots , 7) \) for each feature \( p_{i} \) and the scalar \( b_{0} \) is the model offset. Four different feature classes known as low rate, blur, blocking and general are defined to classify video sequences. The 4 different models for each video class differ only in the weights \( b_{i} \) (Table 1.6) for the features \( p_{i} \).
For a given video sequence V, an appropriate model is selected by analyzing the features of a low-quality version of the video Vlow which is produced by encoding the actual video V using a high quantization parameter (QP). Then the feature values of V are estimated and the selected metric model is used to calculate the no-reference quality metric. For more details, please refer to the paper [Q25].
Naccari et al. [Q26] proposed a no-reference video quality monitoring algorithm (called as NORM) to automatically assess the channel induced distortion in a video sequence decoded from a H.264/AVC compliant bitstream, which has been transmitted through a noisy channel affected by packet losses. But NORM only measures the distortion due to channel losses.
An objective metric that correlates well with subjective quality is the structural similarity index metric SSIM [Q13]. This is described in detail in Appendix C.
1.5 Digital Representation of Audio
The most common method to digitally represent an audio signal is to digitize an analog audio signal with the principle of pulse code modulation (PCM) [O1]. An audio signal of each channel is sampled at regular intervals and digitized into PCM codes, which are discrete numbers in fact, with an A/D converter [O7] [O4]. According to the Nyquist sampling theorem [B20], if a signal is sampled instantaneously at regular intervals and at a rate slightly higher than twice the highest signal frequency, then the samples contain all of the information of the original signal. The sampling frequency for audio data acquisition should be determined according to the characteristic of human auditory system. The human ear has a dynamic range of about 140 dB and a hearing bandwidth of up to 20 kHz [B15] [O32]. Therefore an audio signal should be sampled with a sampling rate of at least 40 kHz for high-quality. That is why the CD format has a sampling rate of 44.1 kHz (or 44,100 samples per second), which is slightly better than twice the highest frequency that we can hear. To avoid aliasing noise, the analog audio signals have to be band-limited by means of a lowpass filter located before the sample-and-hold [O3] circuit and A/D converter. Another important factor is the A/D converter’s resolution (the number of bits), which determines the dynamic range of the audio system. A resolution of 16 bits is only enough to reproduce sounds with about 96 dB in dynamic range, while 24 bits for a theoretical dynamic range of 144 dB. Typical sampling rates and resolutions include 44.1-, 48-, 96-, 192-kHz, and 16-, 20-, 24-bits respectively. The DVD mono and stereo audio supports all of these sampling rates and resolutions [O33], while the CD audio supports 44.1 kHz at 16 bits. A sampling rate of 48 kHz at 16 bit resolution yields a data rate of 768 kbit/s per channel, which means approximately 1.5 Mbit/s for stereo signal as shown in Fig. 1.11.
1.6 Basic Structure of Perceptual Audio Coding
The central objective in audio coding is to represent the signal with a minimum number of bits while achieving transparent signal reproduction, i.e., while generating output audio which cannot be distinguished from the original input, even by a sensitive listener (“golden ears”) [O18]. Perceptual audio coding plays an important role in audio coding. Many perceptual audio coding methods have been proposed [O13, O18, O20, O21, O26, O28-O30] and several standards have been developed [O8-O12, O17, O19, A4]. Most perceptual audio coders have a quite similar architecture as shown in Fig. 1.12.
The coders typically segment input signals into quasi-stationary frames ranging from 2 to 50 ms in duration. A time–frequency analysis section then decomposes each analysis frame. The time/frequency analysis approximates the temporal and spectral analysis properties of the human auditory system. It transforms input audio into a set of parameters which can be quantized and encoded according to a perceptual distortion metric. Depending on overall system objectives and design philosophy, the time–frequency analysis section might contain a
-
Unitary transform
-
Time-invariant bank of uniform bandpass filters
-
Time-varying (signal-adaptive), critically sampled bank of non-uniform bandpass filters
-
Hybrid transform/filterbank signal analyzer
-
Harmonic/sinusoidal analyzer
The choice of time–frequency analysis methodology always involves a fundamental tradeoff between time and frequency resolution requirements. Perceptual distortion control is achieved by a psychoacoustic signal analysis section which estimates signal masking power based on psychoacoustic principles, The psychoacoustic model delivers masking thresholds which quantify the maximum amount of distortion that can be injected at each point in the time–frequency plane during quantization and encoding of the time–frequency parameters without introducing audible artifacts in the reconstructed signal. The psychoacoustic model therefore allows the quantization and encoding section to exploit perceptual irrelevancies in the time–frequency parameter set. The quantization and encoding section can also exploit statistical redundancies through classical techniques such as differential pulse code modulation (DPCM) or adaptive DPCM (ADPCM). Quantization can be uniform or pdf-optimized (Lloyd-Max), and it can be performed on either scalars or vectors (VQ). Once a quantized compact parametric set has been formed, remaining redundancies are typically removed through run-length (RL) and entropy (e.g. Huffman, arithmetic, LZW (Lempel–Ziv-Welch) [DC2]) coding techniques. Since the psychoacoustic distortion control model is signal adaptive, most algorithms are inherently variable rate. Fixed channel rate requirements are usually satisfied through buffer feedback schemes, which often introduce encoding delays.
Shown in the Figs. 1.13 and 1.14 are the MPEG audio codec and AVS audio codec respectively. Similarities to the basic structure in Fig. 1.12 can be easily found in these standard codecs. More details about MPEG audio and AVS audio can be found in [O11, O12, O19] and [N2, N5, N8] respectively.
1.7 Performance Comparison Metrics for Audio Codec
Like the performance evaluation of video codecs, several aspects including bit rate for each channel, computational cost (or complexity) and quality should also be considered to evaluate the performances of audio codecs. The unit of bit rate is also bits per second (bps or bits/s) as for video.
The audio quality evaluation methods can be divided into two types: the subjective methods [L3-L6, L8, L16] and the objective methods [L7, L10, L11, L15, L17, L19, O27]. The ITU-R Recommendation BS.1116 [L4] defines a test procedure for subjective assessment of high-quality audio codecs. The test procedure is based on a “double-blind, triple stimulus with hidden reference” comparison method. In this method the listener is presented with three stimuli signals: the reference (original) signal, the test signals A and B. Among A and B, there is an impaired (coded) signal, and the other one is again the original signal (hidden reference). But both the test subject and the supervisor do not know which signal is the hidden reference among A and B. After listening to all the three signals, the subject must pick out the hidden reference from A and B, and then grade the other (coded signal) relative to the reference stimulus using the five grade impairment scale as shown in Fig. 1.15. From best to worst, the coding distortion is graded as “imperceptible (5),” “perceptible but not annoying (4.0–4.9),” “slightly annoying (3.0–3.9),” “annoying (2.0–2.9),” or “very annoying (1.0–1.9).”
If the symbols SDG, G i and G r are used to represent the subjective difference grade, the score assigned to the actual impaired signal and the score assigned to the actual hidden reference signal respectively, then the SDG is defined as
The default valued of G r is 5. Negative values of SDG are obtained when the subject identifies the hidden reference correctly, and positive values of SDG are obtained if the subject misidentifies the hidden reference. Over many subjects and many trials, mean impairment scores are calculated and used to evaluate codec performance relative to the ideal case.
In ITU-R recommendation BS.1534 [L8], a subjective method known as MUSHRA [L16] (MUltiple Stimulus with Hidden Reference and Anchors) is developed for the assessment of intermediate quality level of coding systems. MUSHRA is a double-blind multi-stimulus test method with one known reference, one hidden reference and one or more hidden anchors. At least one of the anchors is required to be a low-passed version of the reference signal. According to the MUSHRA guidelines the subjects are required to grade the stimuli with a continuous quality scale divided in five equal intervals labeled, from top to bottom, excellent, good, fair, poor and bad. The scores are then normalized to the range between 0 and 100, where 0 corresponds to the bottom of the scale (bad quality).
Since subjective tests are both time consuming and expensive, many objective audio quality evaluation methods have been developed for automatic assessment of audio quality. A review and recent developments of audio quality assessment are presented in [L19], which includes a brief technical summary of the perceptual evaluation of audio quality (PEAQ) algorithm developed in ITU standard BS.1387 [L7].
PEAQ is designed only to objectively grade signals with extremely small impairments. The block diagram shown in Fig. 1.16 describes PEAQ with two main parts: the psychoacoustic model and the cognitive model. The psychoacoustic model contains many different blocks that model the various individual parts of the human auditory system. It transforms the time domain input signals into a basilar membrane representation (i.e. a model of the basilar membrane in the human auditory system). The cognitive model simulates the cognitive processing of the human brain. It processes the parameters produced by the psychoacoustic ear models to form a quality score. Figure 1.16 also tells us that PEAQ is an intrusive algorithm that produces a single score as the quality metric by comparing two input signals: a reference (original) signal and a degraded (coded) signal.
PEAQ includes two versions: a basic version and an advanced version. The former is used in applications where computational efficiency is an issue, and the later is used where accuracy is of the utmost importance. The main structural difference between the basic version and the advanced version is that the basic version has only one peripheral ear model (FFT [B12] based ear model) whereas the advanced version has two peripheral ear models (FFT [B12] based and filter bank based ear models). The basic version produces 11 model output variables (MOVs) whereas the advanced version only produces 5 MOVs. The MOVs are output features based on loudness, modulation, masking and adaptation. The MOVs are the inputs to a neural network which is trained to map them to a single ODG (overall difference grade) score. The ODG score predicts the perceptual quality of the degraded signal. The ODG score can range from 0 to −4 where 0 represents a signal with imperceptible distortion and −4 represents a signal with very annoying distortion.
The FFT [B12] based ear model, which is used in both versions of PEAQ, processes frames of samples in frequency domain. The filter bank based ear model, which is only used in the advanced version of PEAQ, processes the data in the time domain. Both ear model outputs are involved in producing the MOVs which are mapped to a single ODG quality score using a neural network in the cognitive model as shown in Fig. 1.17. For more details about the FFT based ear model and the filter bank based ear model, please refer to [L7] and [L19].
The MOVs are based on a range of parameters such as loudness, amplitude modulation, adaptation and masking parameters. The MOVs also model concepts such as linear distortion, bandwidth, NMR, modulation difference and noise loudness. They are generally calculated as averages of these parameters, taken over the duration of the test and reference signals; typically, more than one MOV is derived from each class of parameters (modulation, loudness, bandwidth etc.). A detailed description of the MOVs is available in [L7] and [L19].
Segmental SNR [L1, L2, L12, L18] is a simple objective voice (speech) quality measure defined in (1.18) as an average of the SNR (signal-to-noise ratio) values of short segments.
where x(n) represents the original speech signal, d(n) represents the distorted speech signal, n is the sample index, Ns is the segment length, and MB is the number of segments in the speech signal. Mermelstein [L1] defined the segmental SNR in another way. Performance measure in terms of \( \text{SNR}_{\text{seg}} \) is a good estimator of voice quality of waveform codecs, but its performance is poor for vocoders where the aim is to generate the same speech sound rather than to produce the speech waveform itself. Furthermore, the correlation of segmental SNR with the subjective perceptual quality is low (only 0.531 as given in [L17]). Although segmental SNR is not a metric independently suitable for performance evaluation of perceptual audio coders, it can be used as one component to form a new well-performed perceptual quality metric by a linear combination with other components [L17].
Kandadai et al. [L14] applied the mean structural similarity to objective evaluation of perceptual audio quality. There are two approaches. In the first approach, the audio sequences are split into temporal frames of length 128 with 50 % overlap and then the structural similarity (SSIM) [Q13] is applied to each frame separately. The mean SSIM (MSSIM) is calculated by averaging the individual SSIM values for each frame. This method is referred to as the temporal MSSIM (T-MSSIM). In the second approach, a 256-point modified discrete cosine transform (MDCT) [O5, O6] with a 50 % overlapping window is used to analyze the audio sequence into a time–frequency representation, and the SSIM is applied to the two-dimensional blocks of the time–frequency representation. This method is referred to as the time–frequency MSSIM (TF-MSSIM). The correlation coefficients of T-MSSIM and TF-MSSIM with the MUSHRA [L8] subjective quality are 0.98 and 0.976 respectively. This indicates that the MSSIM and the subjective tests are highly correlated.
In [L9], the energy equalization quality metric (EEQM) was developed for quality assessment of highly impaired audio. In this method, the original audio spectrogram is truncated with a threshold T EEQM (The coefficients of the original audio spectrogram with magnitudes above T EEQM are retained, but the others are set to zero). For each specific value of T EEQM, the energy of the truncated spectrogram is evaluated and compared with the energy of the bandpass spectrogram of the reconstructed signal. The T EEQM is adjusted with an iterative optimization algorithm so that the truncated version of the original spectrum and the bandpass spectrogram of the reconstructed signal have equal energies and similar time–frequency characteristics. The T EEQM in the optimal case is then used as a measure of the impairment in the test signal. Furthermore, T EEQM is combined with the model output variables (MOVs) [L7] to create a simple and robust universal metric for audio quality [L13].
Spectral band replication (SBR) [O22, O23] is a new audio coding tool that significantly improves the coding gain of perceptual coders and speech coders. Currently, there are three different audio coders that have shown a vast improvement by the combination with SBR: MPEG-AAC, MPEG-Layer II and MPEG-Layer III (mp3), all three being parts of the open ISO-MPEG standard. The combination of AAC and SBR will be used in the standardized Digital Radio Mondiale (DRM) system, and SBR is currently also being standardized within MPEG-4. SBR is a so-called bandwidth extension technique, where a major part of a signal’s bandwidth is reconstructed from the lowband on the receiving side. The paper [O23] focuses on the technical details of SBR and in particular on the filter bank, which is the basis of the SBR process.
The combination of AAC and SBR, aacPlus, is the most efficient audio coder today, improving the already powerful AAC coder in coding efficiency by more than 30 %. The foundation of the SBR system is the complex modulated QMF bank. The complex valued representation permits modification of the subband samples without introducing excessive aliasing.
1.8 Summary
This chapter has presented a brief description (comparison) of various video/image coding standards such as JPEG, MPEG and H.26x series besides, DIRAC (Chap. 7), AVS China (Chap. 3) and VC-1 (Chap. 8). Similarly brief description on audio coding followed by comparison metrics concludes the chapter. These aspects are further amplified in Chap. 2.
Notes
- 1.
H.262 has also audio coding among the several parts.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Rao, K.R., Kim, D.N., Hwang, J.J. (2014). Introduction. In: Video coding standards. Signals and Communication Technology. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6742-3_1
Download citation
DOI: https://doi.org/10.1007/978-94-007-6742-3_1
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-6741-6
Online ISBN: 978-94-007-6742-3
eBook Packages: EngineeringEngineering (R0)