Video Coding Standards

Akramullah, Shahriar

doi:10.1007/978-1-4302-6713-3_3

Shahriar Akramullah¹

20k Accesses
3 Altmetric

Abstract

The ubiquity of video has been possible owing to the establishment of a common representation of video signals through international standards, with the goal of achieving common formats and interoperability of products from different vendors.

You have full access to this open access chapter, Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The ubiquity of video has been possible owing to the establishment of a common representation of video signals through international standards, with the goal of achieving common formats and interoperability of products from different vendors.

In the late 1980s, the need for standardization of digital video was recognized, and special expert groups from the computer, communications, and video industries came together to formulate practical, low-cost, and easily implementable standards for digital video storage and transmission. To determine these standards, the expert groups reviewed a variety of video data compression techniques, data structures, and algorithms, eventually agreeing upon a few common technologies, which are described in this chapter in some detail.

In the digital video field, many international standards exist to address various industry needs. For example, standard video formats are essential to exchange digital video between various products and applications. Since the amount of data necessary to represent digital video is huge, the data needs to be exchanged in compressed form—and this necessitates video data compression standards. Depending on the industry, standards were aimed at addressing various aspects of end-user applications. For example, display resolutions are standardized in the computer industry, digital studio standards are standardized in the television industry, and network protocols are standardized in the telecommunication industry. As various usages of digital video have emerged to bring these industries ever closer together, standardization efforts have concurrently addressed those cross-industry needs and requirements. In this chapter we discuss the important milestones in international video coding standards, as well as other video coding algorithms popular in the industry.

Overview of International Video Coding Standards

International video coding standards are defined by committees of experts from organizations like the International Standards Organization (ISO) and the International Telecommunications Union (ITU). The goal of this standardization is to have common video formats across the industry, and to achieve interoperability among different vendors and video codec related hardware and software manufacturers.

The standardization of algorithms started with image compression schemes such as JBIG (ITU-T Rec. T82 and ISO/IEC 11544, March 1993) for binary images used in fax and other applications, and the more general JPEG (ITU-T Rec. T81 and ISO/IEC 10918-1), which includes color images as well. The JPEG standardization activities started in 1986, but the standard was ratified in 1992 by ITU-T and in 1994 by ISO. The main standardization activities for video compression algorithms started in the 1980s, with the ITU-T H.261, ratified in 1988, which was the first milestone standard for visual telecommunication. Following that effort, standardization activities increased with the rapid advancement in the television, film, computer, communication, and signal processing fields, and with the advent of new usages requiring contributions from all these diverse industries. These efforts subsequently produced MPEG-1, H.263, MPEG-2, MPEG-4 Part 2, AVC/H.264, and HEVC/H.265 algorithms. In the following sections, we briefly describe the major international standards related to image and video coding.

JPEG

The JPEG is a continuous-tone still image compression standard, designed for applications like desktop publishing, graphic arts, color facsimile, newspaper wirephoto transmission, medical imaging, and the like. The baseline JPEG algorithm uses a DCT-based coding scheme where the input is divided into 8×8 blocks of pixels. Each block undergoes a two-dimensional forward DCT, followed by a uniform quantization. The resulting quantized coefficients are scanned in a zigzag order to form a one-dimensional sequence where high-frequency coefficients, likely to be zero-valued, are placed later in the sequence to facilitate run-length coding. After run-length coding, the resulting symbols undergo more efficient entropy coding.

The first DCT coefficient, often called the DC coefficient as it is a measure of the average value of the pixel block, is differentially coded with respect to the previous block. The run length of the AC coefficients, which have non-zero frequency, are coded using variable-length Huffman codes, assigning shorter codewords for more probable symbols. Figure 3-1 shows the JPEG codec block diagram.

H.261

H.261 is the first important and practical digital video coding standard adopted by the industry. It is the video coding standard for audiovisual services at p×64 kbps, where p = 1, . . ., 30, primarily aimed at providing videophone and video-conferencing services over ISDN, and unifying all aspects of video transmission for such applications in a single standard.^{Footnote 1}The objectives include delivering video at real time, typically at 15–30 fps, with minimum delay (less than 150 ms). Although a successful standard providing industry-wide interoperability, common format, and compression techniques, H.261 is obsolete and is rarely used today.

In H.261, it is mandatory for all codecs to operate at quarter-CIF (QCIF) video format, while the use of CIF is optional. Since the uncompressed video bit rates for CIF and QCIF at 29.97 fps are 26.45 Mbps and 9.12 Mbps, respectively, it is extremely difficult to transport these video signals using an ISDN channel while providing reasonable video quality. To accomplish this goal, H.261 divides a video into a hierarchical block structure comprising pictures, groups of blocks (GOB), macroblocks (MB), and blocks. A macroblock consists of four 8×8 luma blocks and two 8×8 chroma blocks; a 3×11 array of macroblocks, in turn, constitutes a GOB. A QCIF picture has three GOBs, while a CIF picture has 12. The hierarchical data structure is shown in Figure 3-2.

The H.261 source coding algorithm is a hybrid of intra-frame and inter-frame coding, exploiting spatial and temporal redundancies. Intra-frame coding is similar to baseline JPEG, where block-based 8×8 DCT is performed, and the DCT coefficients are quantized. The quantized coefficients undergo entropy coding using variable-length Huffman codes, which achieves bit-rate reduction using statistical properties of the signal.

Inter-frame coding involves motion-compensated inter-frame prediction and removes the temporal redundancy between pictures. Prediction is performed only in the forward direction; there is no notion of bi-directional prediction. While the motion compensation is performed with integer-pel accuracy, a loop filter can be switched into the encoder to improve picture quality by removing coded high-frequency noise when necessary. Figure 3-3 shows a block diagram of the H.261 source encoder.

MPEG-1

In the 1990s, instigated by the market success of compact disc digital audio, CD-ROMs made remarkable inroads into the data storage domain. This prompted the inception of the MPEG-1 standard, targeted and optimized for applications requiring 1.2 to 1.5 Mbps with video home system (VHS)-quality video. One of the initial motivations was to fit compressed video into widely available CD-ROMs; however, a surprisingly large number of new applications have emerged to take advantage of the highly compressed video with reasonable video quality provided by the standard algorithm. MPEG-1 remains one of the most successful developments in the history of video coding standards. Arguably, however, the most well-known part of the MPEG-1 standard is the MP3 audio format that it introduced. The intended applications for MPEG-1 include CD-ROM storage, multimedia on computers, and so on. The MPEG-1 standard was ratified as ISO/IEC 11172 in 1991. The standard consists of the following five parts:

1.
Systems: Deals with storage and synchronization of video, audio, and other data.
2.
Video: Defines standard algorithms for compressed video data.
3.
Audio: Defines standard algorithms for compressed audio data.
4.
Conformance: Defines tests to check correctness of the implementation of the standard.
5.
Reference Software: Software associated with the standard as an example for correct implementation of the encoding and decoding algorithms.

The MEPG-1 bitstream syntax is flexible and consists of six layers, each performing a different logical or signal-processing function. Figure 3-4 depicts various layers arranged in an onion structure.

MPEG-1 is designed for coding progressive video sequences, and the recommended picture size is 360×240 (or 352×288, a.k.a. CIF) at about 1.5 Mbps. However, it is not restricted to this format, and can be applied to higher bit rates and larger image sizes. The intended chroma format is 4:2:0 with 8 bits of pixel depth. The standard mandates real-time decoding and supports features to facilitate interactivity with stored bitstreams. It only specifies syntax for the bitstream and the decoding process, allowing sufficient flexibility for the encoder implementation. Encoders are usually designed to meet specific usage needs, but they are expected to provide sufficient tradeoffs between coding efficiency and complexity.

The main goal of the MPEG-1 video algorithm, as in any other standard, is to achieve the highest possible video quality for a given bit rate. Toward this goal, the MPEG-1 compression approach is similar to that of H.261: it is also a hybrid of intra- and inter-frame redundancy-reduction techniques. For intra-frame coding, the frame is divided into 8×8 pixel blocks, which are transformed to frequency domain using 8×8 DCT, quantized, zigzag scanned, and the run length of the generated bits are coded using variable-length Huffman codes.

Temporal redundancy is reduced by computing a difference signal, namely the prediction error, between the original frame and its motion-compensated prediction constructed from a reconstructed reference frame. However, temporal redundancy reduction in MPEG-1 is different from H.261 in a couple of significant ways:

MPEG-1 permits bi-directional temporal prediction, providing higher compression for a given picture quality than would be attainable using forward-only prediction. For bi-directional prediction, some frames are encoded using either a past or a future frame in display order as the prediction reference. A block of pixels can be predicted from a block in the past reference frame, from a block in the future reference frame, or from the averge of two blocks, one from each reference frame. In bi-directional prediction, higher compression is achieved at the expense of greater encoder complexity and additional coding delay. However, it is still very useful for storage and other off-line applications.
Further, MPEG-1 introduces half-pel (a.k.a. half-pixel) accuracy for motion compensation and eliminates the loop filter. The half-pel accuracy partly compensates for the benefit provided by the H.261 loop filter in that high-frequency coded noise does not propagate and coding efficiency is not sacrificed.

The video sequence layer specifies parameters such as the size of the video frames, frame rate, bit rate, and so on. The group of pictures (GOP) layer provides support for random access, fast search, and editing. The first frame of a GOP must be intra-coded (I-frame), where compression is achieved only in the spatial dimension using DCT, quantization, and variable-length coding. The I-frame is followed by an arrangement of forward-predictive coded frames (P-frames) and bi-directionally predictive coded frames (B-frames). I-frames provide ability for random access to the bitstream and for fast search (or VCR-like trick play, such as fast-forward and fast-rewind), as they are coded independently and serve as entry points for further decoding.

The picture layer deals with a particular frame and contains information of the frame type (I, P, or B) and the display order of the frame. The bits corresponding to the motion vectors and the quantized DCT coefficients are packages in the slice layer, the macroblock layer, and the block layer. A slice is a contiguous segment of the macroblocks. In the event of a bit error, the slice layer helps resynchronize the bitstream during decoding. The macroblock layer contains the associated motion vector bits and is followed by the block layer, which consists of the coded quantized DCT coefficients. Figure 3-5 shows the MPEG picture structure in coding and display order, which applies to both MPEG-1 and MPEG-2.

MPEG-2

MPEG-2 was defined as the standard for generic coding of moving pictures and associated audio. The standard was specified by a joint technical committee of the ISO/IEC and ITU-T, and was ratified in 1993, both as the ISO/IEC international standard 13818 and as the ITU-T Recommendation H.262.

With a view toward resolving the existing issues in MPEG-1, the standardization activity in MPEG-2 focused on the following considerations:

Extend the number of audio compression channels from 2 channels to 5.1 channels.
Add standardization support for interlaced video for broadcast applications.
Provide more standard profiles, beyond the Constrained Parameters Bitstream available in MPEG-1, in order to support higher-resolution video contents.
Extend support for color sampling from 4:2:0, to include 4:2:2 and 4:4:4.

For MPEG standards, the standards committee addressed video and audio compression, as well as system considerations for multiplexing the compressed audio-visual data. In MPEG-2 applications, the compressed video and audio elementary streams are multiplexed to construct a program stream; several program streams are packetized and combined to form a transport stream before transmission. However, in the following discussion, we will focus on MPEG-2 video compression.

MPEG-2 is targeted for a variety of applications at a bit rate of 2 Mbps or more, with a quality ranging from good-quality NTSC to HDTV. Although widely used as the format of digital television signal for terrestrial, cable, and direct-broadcast satellite TV systems, other typical applications include digital videocassette recorders (VCR), digital video discs (DVD), and the like. As a generic standard supporting a variety of applications generally ranging from 2 Mbps to 40 Mbps, MPEG-2 targets a compression ratio in the range of 30 to 40. To provide application independence, MPEG-2 supports a variety of video formats with resolutions ranging from source input format (SIF) to HDTV. Table 3-1 shows some typical video formats used in MPEG-2 applications.

Table 3-1. Typical MPEG-2 Paramters

Full size table

The aim of MPEG-2 is to provide better picture quality while keeping the provisions for random access to the coded bitstream. However, it is a rather difficult task to accomplish. Owing to the high compression demanded by the target bit rates, good picture quality cannot be achieved by intra-frame coding alone. Contrarily, the random-access requirement is best satisfied with pure intra-frame coding. This dilemma necessitates a delicate balance between the intra- and inter-picture coding. And this leads to the definition of I, P, and B pictures, similar to MPEG-1. I-frames are the least compressed, and contain approximately the full information of the picture in a quantized form in frequency domain, providing robustness against errors. The P-frames are predicted from past I- or P-frames, while the B-frames offer the greatest compression by using past and future I- or P-frames for motion compensation. However, B-frames are the most vulnerable to channel errors.

An MPEG-2 encoder first selects an appropriate spatial resolution for the signal, followed by a block-matching motion estimation to find the displacement of a macroblock (16×16 or 16×8 pixel area) in the current frame relative to a macroblock obtained from a previous or future reference frame, or from their average. The search for the best matching block is based on the mean absolute difference (MAD) distortion criterion; the best matching occurs when the accumulated absolute values of the pixel differences for all macroblocks are minimized. The motion estimation process then defines a motion vector representing the displacement of the current block’s location from the best matched block’s location. To reduce temporal redundancy, motion compensation is used both for causal prediction of the current picture from a previous reference picture and for non-causal, interpolative prediction from past and future reference pictures. The prediction of a picture is constructed based on the motion vectors.

To reduce spatial redundancy, the difference signal—that is, the prediction error—is further compressed using the block transform coding technique that employs the two-dimensional orthonormal 8×8 DCT to remove spatial correlation. The resulting transform coefficients are ordered in an alternating or zigzag scanning pattern before they are quantized in an irreversible process that discards less important information. In MPEG-2, adaptive quantization is used at the macroblock layer, allowing smooth bit-rate control and perceptually uniform video quality. Finally, the motion vectors are combined with the residual quantized coefficients, and are transmitted using variable-length Huffman codes. The Huffman coding tables are pre-determined and optimized for a limited range of compression ratios appropriate for some target applications. Figure 3-6 shows an MPEG-2 video encoding block diagram.

The bitstream syntax in MPEG-2 is divided into subsets known as profiles, which specify constraints on the syntax. Profiles are further divided into levels, which are sets of constraints imposed on parameters in the bitstream. There are five profiles defined in MPEG-2:

Main: Aims at the maximum quality of standard definition pictures.
Simple: Is directed to memory savings by not interpolating pictures.
SNR scalable: Aims to provide better signal-to-noise ratio on demand by using more than one layer of quantization.
Spatially scalable: Aims to provide variable resolution on demand by using additional layers of weighted and reconstructed reference pictures.
High: Intended to support 4:2:2 chroma format and full scalability.

Within each profile, up to four levels are defined:

Low: Provides compatibility with H.261 or MPEG-1.
Main: Corresponds to conventional TV.
High 1440: Roughly corresponds to HDTV, with 1,440 samples per line.
High: Roughly corresponds to HDTV, with 1,920 samples per line.

The Main profile, Main level (MP @ ML) reflects the initial focus of MPEG-2 with regard to entertainment applications. The permitted profile-level combinations are: Simple profile with Main level, Main profile with all levels, SNR scalable profile with Low and Main levels, Spatially scalable profile with High 1440 level, and High profile with all levels except Low level.

The bitstream syntax can also be divided as follows:

Non-scalable syntax: A super-set of MPEG-1, featuring extra compression tools for interlaced video signals along with variable bit rate, alternate scan, concealment motion vectors, intra-DCT format, and so on.
Scalable syntax: A base layer similar to the non-scalable syntax and one or more enhancement layers with the ability to enable the reconstruction of useful video.

The structure of the compressed bitstream is shown in Figure 3-7. The layers are similar to those of MPEG-1. A compressed video sequence starts with a sequence header containing picture resolutions, picture rate, bit rate, and so on. There is a sequence extension header in MPEG-2 containing video format, color primaries, display resolution, and so on. The sequence extension header may be followed by an optional GOP header having the time code, which is subsequently followed by a frame header containing temporal reference, frame type, video buffering verifier (VBV) delay, and so on. The frame header can be succeeded by a picture coding extension containing interlacing, DCT type and quantizer-scale type information, which is usually followed by a slice header to facilitate resynchronization. Inside a slice, several macroblocks are grouped together, where the macroblock address and type, motion vector, coded block pattern, and so on are placed before the actual VLC-coded quantized DCT coefficients for all the blocks in a macroblock. The slices can start at any macroblock location, and they are not restricted to the beginning of macroblock rows.

Since the MPEG-2 base layer is a super-set of MPEG-1, standard-compliant decoders can decode MPEG-1 bitstreams providing backward compatibility. Furthermore, MPEG-2 is capable of selecting the optimum mode for motion-compensated prediction, such that the current frame or field can be predicted either from the entire reference frame or from the top or bottom field of the reference frame, thereby finding a better relationship of the fields. MPEG-2 also adds the alternate scanning pattern, which suits interlaced video better than the zigzag scanning pattern. Besides, a choice is offered between linear and nonlinear quantization tables, and up to 11 bits DC precision is supported for intra macroblocks. These are improvements on MPEG-1, which does not support nonlinear quantization tables and provides only 8 bits of intra-DC precision. At the same bit rate, MPEG-2 yields better quality than MPEG-1, especially for interlaced video sources. Moreover, MPEG-2 is more flexible for parameter variation at a given bit rate, helping a smoother buffer control. However, these benefits and improvements come at the expense of increased complexity.

H.263

H.263 defined by ITU-T is aimed at low-bit-rate video coding but does not specify a constraint on video bit rate; such constraints are given by the terminal or the network. The objective of H.263 is to provide significantly better picture quality than its predecessor, H.261. Conceptually, H.263 is network independent and can be used for a wide range of applications, but its target applications are visual telephony and multimedia on low-bit-rate networks like PSTN, ISDN, and wireless networks. Some important considerations for H.263 include small overhead, low complexity resulting in low cost, interoperability with existing video communication standards (e.g., H.261, H.320), robustness to channel errors, and quality of service (QoS) parameters. Based on these considerations, an efficient algorithm is developed, which gives manufacturers the flexibility to make tradeoffs between picture quality and complexity. Compared to H.261, it provides the same subjective image quality at less than half the bit rate.

Similar to other standards, H.263 uses inter-picture prediction to reduce temporal redundancy and transform coding of the residual prediction error to reduce spatial redundancy. The transform coding is based on 8×8 DCT. The transformed signal is quantized with a scalar quantizer, and the resulting symbol is variable length coded before transmission. At the decoder, the received signal is inverse quantized and inverse transformed to reconstruct the prediction error signal, which is added to the prediction, thus creating the reconstructed picture. The reconstructed picture is stored in a frame buffer to serve as a reference for the prediction of the next picture. The encoder consists of an embedded decoder where the same decoding operation is performed to ensure the same reconstruction at both the encoder and the decoder.

H.263 supports five standard resolutions: sub-QCIF (128×96), QCIF (176×144), CIF (352×288), 4CIF (704×576), and 16CIF (1408×1152), covering a large range of spatial resolutions. Support for both sub-QCIF and QCIF formats in the decoder is mandatory, and either one of these formats must be supported by the encoder. This requirement is a compromise between high resolution and low cost.

A picture is divided into 16×16 macroblocks, consisting of four 8×8 luma blocks and two spatially aligned 8×8 chroma blocks. One or more macroblocks rows are combined into a group of blocks (GOB) to enable quick resynchronization in the event of transmission errors. Compared to H.261, the GOB structure is simplified; GOB headers are optional and may be used based on the tradeoff between error resilience and coding efficiency.

For improved inter-picture prediction, the H.263 decoder has a block motion compensation capability, while its use in the encoder is optional. One motion vector is transmitted per macroblock. Half-pel precision is used for motion compensation, in contrast to H.261, where full-pel precision and a loop filter is used. The motion vectors, together with the transform coefficients, are transmitted after variable-length coding. The bit rate of the coded video may be controlled by preprocessing or by varying the following encoder parameters: quantizer scale size, mode selections, and picture rate.

In addition to the core coding algorithm described above, H.263 includes four negotiable coding options, as mentioned below. The first three options are used to improve inter-picture prediction, while the fourth is related to lossless coding. The coding options increase the complexity of the encoder but improve picture quality, thereby allowing tradeoff between picture quality and complexity.

Unrestricted motion vector (UMV) mode: In the UMV mode, motion vectors are allowed to point outside the coded picture area, enabling a much better prediction, particularly when a reference macroblock is partly located outside the picture area and part of it is not available for prediction. Those unavailable pixels would normally be predicted using the edge pixels instead. However, this mode allows utilization of the complete reference macroblock, producing a gain in quality, especially for the smaller picture formats when there is motion near the picture boundaries. Note that, for the sub-QCIF format, about 50 percent of all the macroblocks are located at or near the boundary.
Advanced prediction (AP) mode: In this optional mode, the overlapping block motion compensation (OBMC) is used for luma, resulting in a reduction in blocking artifacts and improvement in subjective quality. For some macroblocks, four 8×8 motion vectors are used instead of a 16×16 vector, providing better prediction at the expense of more bits.
PB-frames (PB) mode: The principal purpose of the PB-frames mode is to increase the frame rate without significantly increasing the bit rate. A PB-frame consists of two pictures coded as one unit. The P-picture is predicted from the last decoded P-picture, and the B-picture is predicted both from the last and from the current P-pictures. Although the names “P-picture” and “B-picture” are adopted from MPEG, B-pictures in H.263 serve an entirely different purpose. The quality of the B-pictures is intentionally kept low, in particular to minimize the overhead of bi-directional prediction, while such overhead is important for low-bit-rate applications. B-pictures use only 15 to 20 percent of the allocated bit rate, but result in better subjective impression of smooth motion.
Syntax-based arithmetic coding (SAC) mode: H.263 is optimized for very low bit rates. As such, it allows the use of optional syntax-based arithmetic coding mode, which replaces the Huffman codes with arithmetic codes for variable-length coding. While Huffman codes must use an integral number of bits, arithmetic coding removes this restriction, thus producing a lossless coding with reduced bit rate.

The video bitstream of H.263 is arranged in a hierarchical structure composed of the following layers: picture layer, group of blocks layer, macroblock layer, and block layer. Each coded picture consists of a picture header followed by coded picture data arranged as group of blocks. Once the transmission of the pictures is completed, an end-of-sequence (EOS) code and, if needed, stuffing bits (ESTUF) are transmitted. There are some optional elements in the bitstream. For example, temporal reference of B-pictures (TRB) and the quantizer parameter (DBQUANT) are only available if the picture type (PTYPE) indicates a B-picture. For P-pictures, a quantizer parameter PQUANT is transmitted.

The GOB layer consists of a GOB header followed by the macroblock data. The first GOB header in each picture is skipped, while for other GOBs, a header is optional and is used based on available bandwidth. Group stuffing (GSTUF) may be necessary for a GOB start code (GBSC). Group number (GN), GOB frame ID (GFID), and GOB quantizer (GQUANT) can be present in the GOB header.

Each macroblock consists of a macroblock header followed by the coded block data. A coded macroblock is indicated by a flag called COD; for P-pictures, all the macroblocks are coded. A macroblock type and coded block pattern for chroma (MCBPC) are present when indicated by COD or when PTYPE indicates an I-picture. A macroblock mode for B-pictures (MODB) is present for non-intra macroblocks for PB-frames. The luma coded block pattern (CBPY), and the codes for the differential quanitizer (DQUANT) and motion vector data (MVD or MVD_2-4 for advanced prediction), may be present according to MCBPC. The CBP and motion vector data for B-blocks (CBPB and MVDB) are present only if the coding mode is B (MODB). As mentioned before, in the normal mode a macroblock consists of four luma and two chroma blocks; however, in PB-frames mode a macroblock can be thought of as containing 12 blocks. The block structure is made up of intra DC followed by the transform coefficients (TCOEF). For intra macroblocks, intra DC is sent for every P-block in the macroblock. Figure 3-8 shows the structure of various H.263 layers.

MPEG-4 (Part 2)

MPEG-4, formally the standard ISO/IEC 14496, was ratified by ISO/IEC in March 1999 as the standard for multimedia data representation and coding. In addition to video and audio coding and multiplexing, MPEG-4 addresses coding of various two- or three-dimensional synthetic media and flexible representation of audio-visual scene and composition. As the usage of multimedia developed and diversified, the scope of MPEG-4 was extended from its initial focus on very low bit-rate coding of limited audio-visual materials to encompass new multimedia functionalities.

Unlike pixel-based treatment of video in MPEG-1 or MPEG-2, MPEG-4 supports content-based communication, access, and manipulation of digital audio-visual objects, for real-time or non-real-time interactive or non-interactive applications. MPEG-4 offers extended functionalities and improves upon the coding efficiency provided by previous standards. For instance, it supports variable pixel depth, object-based transmission, and a variety of networks including wireless networks and the Internet. Multimedia authoring and editing capabilities are particularly attractive features of MPEG-4, with the promise of replacing existing word processors. In a sense, H.263 and MPEG-2 are embedded in MPEG-4, ensuring support for applications such as digital TV and videophone, while it is also used for web-based media streaming.

MPEG-4 distinguishes itself from earlier video coding standards in that it introduces object-based representation and coding methodology of real or virtual audio-visual (AV) objects. Each AV object has its local 3D+T coordinate system serving as a handle for the manipulation of time and space. Either the encoder or the end-user can place an AV object in a scene by specifying a co-ordinate transformation from the object’s local co-ordinate system into a common, global 3D+T co-ordinate system, known as the scene co-ordinate system. The composition feature of MPEG-4 makes it possible to perform bitstream editing and authoring in compressed domain.

One or more AV objects, including their spatio-temporal relationships, are transmitted from an encoder to a decoder. At the encoder, the AV objects are compressed, error-protected, multiplexed, and transmitted downstream. At the decoder, these objects are demultiplexed, error corrected, decompressed, composited, and presented to an end user. The end user is given an opportunity to interact with the presentation. Interaction information can be used locally or can be transmitted upstream to the encoder.

The transmitted stream can either be a control stream containing connection setup, the profile (subset of encoding tools), and class definition information, or be a data stream containing all other information. Control information is critical, and therefore it must be transmitted over reliable channels; but the data streams can be transmitted over various channels with different quality of service.

Part 2 of the standard deals with video compression. As the need to support various profiles and levels was growing, Part 10 of the standard was introduced to handle such demand, which soon became more important and commonplace in the industry than Part 2. However, MPEG-4 Part 10 can be considered an independent standardization effort as it does not provide backward compatibility with MPEG-4 Part 2. MPEG-4 Part 10, also known as advanced video coding (AVC), is discussed in the next section.

MPEG-4 Part 2 is an object-based hybrid natural and synthetic coding standard. (For simplicity, we will refer to MPEG-4 Part 2 simply as MPEG-4 in the following discussion.) The structure of the MPEG-4 video is hierarchical in nature. At the top layer is a video session (VS) composed of one or more video objects (VO). A VO may consist of one or more video object layers (VOL). Each VOL consists of an ordered time sequence of snapshots, called video object planes (VOP). The group of video object planes (GOV) layer is an optional layer between the VOL and the VOP layer. The bitstream can have any number of the GOV headers, and the frequency of the GOV header is an encoder issue. Since the GOV header indicates the absolute time, it may be used for random access and error-recovery purposes.

The video encoder is composed of a number of encoders and corresponding decoders, each dedicated to a separate video object. The reconstructed video objects are composited together and presented to the user. The user interaction with the objects such as scaling, dragging, and linking can be handled either in the encoder or in the decoder.

In order to describe arbitrarily shaped VOPs, MPEG-4 defines a VOP by means of a bounding rectangle called a VOP window. The video object is circumscribed by the tightest VOP window, such that a minimum number of image macroblocks are coded. Each VO consists of three main functions: shape coding, motion compensation, and texture coding. In the event of a rectangular VOP, the MPEG-4 encoder structure is similar to that of the MPEG-2 encoder, and shape coding can be skipped. Figure 3-9 shows the structure of a video object encoder.

The shape information of VOP is referred to as the alpha plane in MPEG-4. The alpha plane has the same format as the luma and its data indicates the characteristics of the relevant pixels, whether or not the pixels are within a video object. The shape coder compresses the alpha plane. Binary alpha planes are encoded by modified content-based arithmetic encoding (CAE), while gray-scale alpha planes are encoded by motion-compensated DCT, similar to texture coding. The macroblocks that lie completely outside the object (transperant macroblocks) are not processed for the motion or texture coding; therefore, no overhead is required to indicate this mode, since this transperancy information can be obtained from shape coding.

Motion estimation and compensation are used to reduce temporal redundancies. A padding technique is applied on the reference VOP that allows polygon matching instead of block matching for rectangular images. Padding methods aim at extending arbitrarily shaped image segments to a regular block grid by filling in the missing data corresponding to signal extrapolation such that common block-based coding techniques can be applied. In addition to the basic motion compensation technique, unrestricted motion compensation, advanced prediction mode, and bi-directional motion compensation are supported by MPEG-4 video to obtain a significant improvement in quality at the expense of very little increased complexity.

The intra and residual data after motion compensation of VOPs are coded using a block-based DCT scheme, similar to previous standards. Macroblocks that lie completely inside the VOP are coded using a technique identical to H.263; the region outside the VOP within the contour macroblocks (i.e., macroblocks with an object edge) can either be padded for regular DCT transformation or can use shape adaptive DCT (SA-DCT). Transperant blocks are skipped and are not coded in the bitstream.

MPEG-4 supports scalable coding of video objects in spatial and temporal domains, and provides error resilience across various media. Four major tools, namely video packet re-synchronization, data partitioning, header extension code, and reversible VLC, provide loss-resilience properties such as resynchronization, error detection, data recovery, and error concealment.

AVC

The Advanced Video Coding (AVC), also known as the ITU-T H.264 standard (ISO/IEC 14496-10), is currently the most common video compression format used in the industry for video recording and distribution. It is also known as MPEG-4 Part 10. The AVC standard was ratified in 2003 by the Joint Video Team (JVT) of the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) standardization organizations. One of the reasons the AVC standard is so well known is that it is one of the three compression standards for Blu-ray (the others being MPEG-2 and VC-1), and it is also widely used by Internet streaming applications like YouTube and iTunes, software applications like Flash Player, software frameworks like Silverlight, and various HDTV broadcasts over terrestrial, cable, and satellite channels.

The AVC video coding standard has the same basic functional elements as previous standards MPEG-4 Part 2, MPEG-2, H.263, MPEG-1, and H.261. It uses a lossy predictive, block-based hybrid DPCM coding technique. This involves transform for reduction of spatial correlation, quantization for bit-rate control, motion-compensated prediction for reduction of temporal correlation, and entropy encoding for reduction of statistical correlation. However, with a goal of achieving better coding performance than previous standards, AVC incorporates changes in the details of each functional element by including in-picture prediction, a new 4×4 transform, multiple reference pictures, variable block sizes, and a quarter-pel precision for motion compensation, a deblocking filter, and improved entropy coding. AVC also introduces coding concepts such as generalized B-slices, which supports not only bidirectional forward-backward prediction pair but also forward-forward and backward-backward prediction pairs. There are several other tools, including direct modes and weighted prediction, defined by AVC to obtain a very good prediction of the source signal so that the error signal has a minimum energy. These tools help AVC perform significantly better than prior standards for a variety of applications. For example, compared to MPEG-2, AVC typically obtains the same quality at half the bit rate, especially for high-resolution contents coded at high bit rates.

However, the improved coding efficiency comes at the expense of additional complexity to the encoder and decoder. So, to compensate, AVC utilizes some methods to reduce the implementation complexity—for example, multiplier-free integer transform is introduced where multiplication operations for the transform and quantization are combined. Further, to facilitate applications on noisy channel conditions and error-prone environments such as the wireless networks, AVC utilizes some methods to exploit error resilience to network noise. These include flexible macroblock ordering (FMO), switched slice, redundant slice methods, and data partitioning.

The coded AVC bitstream has two layers, the network abstraction layer (NAL) and video coding layer (VCL). The NAL abstracts the VCL data to help transmission on a variety of communication channels or storage media. A NAL unit specifies both byte-stream and packet-based formats. The byte-stream format defines unique start codes for the applications that deliver the NAL unit stream as an ordered stream of bytes or bits, encapsulated in network packets such as MPEG-2 transport streams. Previous standards contained header information about slice, picture, and sequence at the start of each element, where loss of these critical elements in a lossy environment would render the rest of the element data useless. AVC resolves this problem by keeping the sequence and picture parameter settings in the non-VCL NAL units that are transmitted with greater error-protection. The VCL unit contains the core video coded data, consisting of video sequence, picture, slice, and macroblock.

Profile and Level

A profile is a set of features of the coding algorithm that are identified to meet certain requirements of the applications. This means that some features of the coding algorithm are not supported in some profiles. The standard defines 21 sets of capabilities, targeting specific classes of applications.

For non-scalable two-dimensional video applications, the following are the important profiles:

Constrained Baseline Profile: Aimed at low-cost mobile and video communication applications, the Constrained Baseline Profile uses the subset of features that are in common with the Baseline, Main, and High Profiles.
Baseline Profile: This profile is targeted for low-cost applications that require additional error resiliency. As such, on top of the features supported in the Constrained Baseline Profile, it has three features for enhanced robustness. However, in practice, Constrained Baseline Profile is more commonly used than Baseline Profile. The bitstreams for these two profiles share the same profile identifier code value.
Extended Profile: This is intended for video streaming. It has higher compression capability and more robustness than Baseline Profile, and it supports server stream switching.
Main Profile: Main profile is used for standard-definition digital TV broadcasts, but not for HDTV broadcasts, for which High Profile is primarily used.
High Profile: It is the principal profie for HDTV broadcast and for disc storage, such as the Blu-ray Disc storage format.
Progressive High Profile: This profile is similar to High profile, except that it does not support the field coding tools. It is intended for applications and displays using progressive scanned video.
High 10 Profile: Mainly for premium contents with 10-bit per sample decoded picture precision, this profile adds 10-bit precision support to the High Profile.
High 4:2:2 Profile: This profile is aimed at professional applications that use interlaced video. On top of the High 10 Profile, it adds support for the 4:2:2 chroma subsampling format.
High 4:4:4 Predictive Profile: Further to the High 4:2:2 Profile, this profile supports up to 4:4:4 chroma sampling and up to 14 bits per sample precision. It additionally supports lossless region coding and the coding of each picture as three separate color planes.

In addition to the above profiles, the Scalable Video Coding (SVC) extension defines five more scalable profiles: Scalable Constrained Baseline, Scalable Baseline, Scalable High, Scalable Constrained High, and Scalable High Intra profiles. Also, the Multi-View Coding (MVC) extension adds three more profiles for three-dimensional video—namely Stereo High, Multiview High, and Multiview Depth High profiles. Furthermore, four intra-frame-only profiles are defined for professional editing applications: High 10 Intra, High 4:2:2 Intra, High 4:4:4 Intra, and CAVLC 4:4:4 Intra profiles.

Levels are constraints that specify the degree of decoder performance needed for a profile; for example, a level designates the maximum picture resolution, bit rate, frame rate, and so on that the decoder must adhere to within a profile. Table 3-2 shows some examples of level restrictions; for full description, see the standard specification.^{Footnote 2}

Table 3-2. Examples of Level Restrictions in AVC

Full size table

Picture Structure

The video sequence has frame pictures or field pictures. The pictures usually comprise three sample arrays, one luma and two chroma sample arrays (RGB arrays are supported in High 4:4:4 Profile only). AVC supports either progressive-scan or interlaced-scan, which may be mixed in the same sequence. Baseline Profile is limited to progressive scan.

Pictures are divided into slices. A slice is a sequence of a flexible number of macroblocks within a picture. Multiple slices can form slice groups; there is macroblock to slice group mapping to determine which slice group includes a particular macroblock. In the 4:2:0 format, each macroblock has one 16×16 luma and two 8×8 chroma sample arrays, while in the 4:2:2 and 4:4:4 formats, the chroma sample arrays are 8 ×16 and 16 ×16, respectively. A picture may be partitioned into 16×16 or smaller partitions with various shapes such as 16×8, 8×16, 8×8, 8×4, 4×8, and 4×4. These partitions are used for prediction purposes. Figure 3-10 shows the different partitions.

Coding Algorithm

In the AVC algorithm, the encoder may select between intra and inter coding for various partitions of each picture. Intra coding (I) provides random access points in the bitstream where decoding can begin and continue correctly. Intra coding uses various spatial prediction modes to reduce spatial redundancy within a picture. In addition, AVC defines inter coding that uses motion vectors for block-based inter-picture prediction to reduce temporal redundancy. Inter coding are of two types: predictive (P) and bi-predictive (B). Inter coding is more efficient as it uses inter prediction of each block of pixels relative to some previously decoded pictures. Prediction is obtained from a deblocked version of previously reconstructed pictures that are used as references for the prediction. The deblocking filter is used in order to reduce the blocking artifacts at the block boundaries. Motion vectors and intra prediction modes may be specified for a variety of block sizes in the picture. Further compression is achieved by applying a transform to the prediction residual to remove spatial correlation in the block before it is quantized. The intra prediction modes, the motion vectors, and the quantized transform coefficient information are encoded using an entropy code such as context-adaptive variable length codes (CAVLC) or context adaptive binary arithmetic codes (CABAC). A block diagram of the AVC coding algorithm, showing the encoder and decoder blocks, is presented in Figure 3-11.

Intra Prediction

In the previous standards, intra-coded macroblocks were coded independently without any reference, and it was necessary to use intra macroblocks whenever a good prediction was not available for a predicted macroblock. As intra macroblocks use more bits than the predicted ones, this was often less efficient for compression. To alleviate this drawback—that is, to reduce the number of bits needed to code an intra picture—AVC introduced intra prediction, whereby a prediction block is formed based on previously reconstructed blocks belonging to the same picture. Fewer bits are needed to code the residual signal between the current and the predicted blocks, compared to the coding the current block itself.

The size of an intra prediction block for the luma samples may be 4×4, 8×8, or 16×16. There are several intra prediction modes, out of which one mode is selected and coded in the bitstream. AVC defines a total of nine intra prediction modes for 4×4 and 8×8 luma blocks, four modes for a 16×16 luma block, and four modes for each chroma block. Figure 3-12 shows an example of intra prediction modes for a 4×4 block. In this example, [a, b, . . ., p] are the predicted samples of the current block, which are predicted from already decoded left and above blocks with samples [A, B, . . ., M]; the arrows show the direction of the prediction, with each direction indicated as a intra prediction mode in the coded bitstream. For mode 0 (vertical), the prediction is formed by extrapolation

of the samples above, namely [A, B, C, D]. Similarly for mode 1 (horizontal), left samples [I, J, K, L] are extrapolated. For mode 2 (DC prediction), the average of the above and left samples are used as the prediction. For mode 3 (diagonal down left), mode 4 (diagonal down right), mode 5 (vertical right), mode 6 (horizontal down), mode 7 (vertical left), and mode 8 (horizontal up), the predicted samples are formed from a weighted average of the prediction samples A—M.

Inter Prediction

Inter prediction reduces temporal correlation by using motion estimation and compensation. As mentioned before, AVC partitions the picture into several shapes from 16×16 down to 4×4 for such predictions. The motion compensation results in reduced information in the residual signal, although for the smaller partitions, an overhead of bits is incurred for motion vectors and for signaling the partition type.

Intra prediction can be applied to blocks as small as 4×4 luma samples with up to a quarter-pixel (a.k.a. quarter-pel) motion vector accuracy. Sub-pel motion compensation gives better compression efficiency than using integer-pel alone; while quarter-pel is better than half-pel, it involves more complex computation. For luma, the half-pel samples are generated first and are interpolated from neighboring integer-pel samples using a six-tap finite impulse response (FIR) filter with weights (1, -5, 20, 20, -5, 1)/32. With the half-pel samples available, quarter-pel samples are produced using bilinear interpolation between neighboring half- or integer-pel samples. For 4:2:0 chroma, eighth-pel samples correspond to quarter-pel luma, and are obtained from linear interpolation of integer-pel chroma samples. Sub-pel motion vectors are differentially coded relative to predictions from neighboring motion vectors. Figure 3-13 shows the location of sub-pel predictions relative to full-pel.

For inter prediction, reference pictures can be used from a list of previously reconstructed pictures, which are stored in the picture buffer. The distance of a reference picture from the current picture in display order determines whether it is a short-term or a long-term reference. Long-term references help increase the motion search range by using multiple decoded pictures. As a limited size of picture buffer is used, some pictures may be marked as unused for reference, and may be deleted from the reference list in a controlled manner to keep the memory size practical.

Transform and Quantization

The AVC algorithm uses block-based transform for spatial redundancy removal, as the residual signal from intra or inter prediction is divided into 4×4 or 8×8 (High profile only) blocks, which are converted to transform domain before they are quantized. The use of 4×4 integer transform in AVC results in reduced ringing artifacts compared to those produced by previous standards using fixed 8×8 DCT. Also, multiplications are not necessary at this smaller size. AVC introduced the concept of hierarchical transform structure, in which the DC components of neighboring 4×4 luma transforms are grouped together to form a 4×4 block, which is transformed again using a Hadamard transform for further improvement in compression efficiency.

Both 4×4 and 8×8 transforms in AVC are integer transforms based on DCT. The integer transform, post-scaling, and quantization are grouped together in the encoder, while in the decoder the sequence is inverse quantization, pre-scaling, and inverse integer transform. For a deeper understanding of the process, consider the matrix H below.

A 4×4 DCT can be done using this matrix and the formula: X = HFH ^T , where H ^T is the transpose of the matrix H, F is the input 4×4 data block, and X is the resulting 4×4 transformed block. For DCT, the variables a, b, and c are as follows:

The AVC algorithm simplifies these coefficients with approximations, and still maintains orthogonaity property by using:

Further simplification is made to avoid multiplication by combining the transform with the quantization step, using a scaled transform X = ĤFĤ ^T ⊗ SF, where,

and SF is a 4×4 matrix representing the scaling factors needed for orthonormality, and ⊗ represents element-by-element multiplication. The transformed and quantized signal Y with components Yi,j is obtained by appropriate quantization using one of the 52 available quantizer levels (a.k.a. quantization step size, Qstep) as follows:

In the decoder, the received signal Y is scaled with Qstep and SF as the inverse quantization and a part of inverse transform to obtain the inverse transformed block X′ with components X′i ,j:

The 4×4 reconstructed block is: where the integer inverse transform matrix is given by:

In addition, in the hierarchical transform approach for 16×16 intra mode, the 4×4 luma intra DC coefficients are further transformed using a Hadamard transform:

In 4:2:0 color sampling, for the chroma DC coefficients, the transform matrix is as follows:

In order to increase the compression gain provided by run-length coding, two scanning orders are defined to arrange the quantized coefficients before entropy coding, namely zigzag scan and field scan, as shown in Figure 3-14. While zigzag scanning is suitable for progressively scanned sources, alternate field scanning helps interlaced contents.

Entropy Coding

Earlier standards provided entropy coding using fixed tables of variable-length codes, where the tables were predefined by the standard based on the probability distributions of a set of generic videos. It was not possible to optimize those Huffman tables for specific video sources. In contrast, AVC uses different VLCs to find a more appropriate code for each source symbol based on the context characteristics. Syntax elements other than the residual data are encoded using the Exponential-Golomb codes. The residual data is rearranged through zigzag or alternate field scanning, and then coded using context-adaptive variable length codes (CAVLC) or, optionally for Main and High profiles, using context-adaptive binary arithmetic codes (CABAC). Compared to CAVLC, CABAC provides higher coding efficiency at the expense of greater complexity.

CABAC uses an adaptive binary arithmetic coder, which updates the probability estimation after coding each symbol, and thus adapts to the context. The CABAC entropy coding has three main steps:

Binarization: Before arithmetic coding, a non-binary symbol such as a transform coefficient or motion vector is uniquely mapped to a binary sequence. This mapping is similar to converting a data symbol into a variable-length code, but in this case the binary code is further encoded by the arithmetic coder prior to transmission.
Context modeling: A probability model for the binarized symbol, called the context model, is selected based on previously encoded syntax element.
Binary arithmetic coding: In this step, an arithmetic coder encodes each element according to the selected context model, and subsequently updates the model.

Flexible Interlaced Coding

In order to provide enhanced interlaced coding capabilities, AVC supports macroblock-adaptive frame-field (MBAFF) coding and picture-adaptive frame-field (PAFF) coding techniques. In MBAFF, a macroblock pair structure is used for pictures coded as frames, allowing 16×16 macroblocks in field mode. This is in contrast to MPEG-2, where field mode processing in a frame-coded picture could only support 16×8 half-macroblocks. In case of PAFF, it is allowed to mix pictures coded as complete frames with combined fields with those coded as individual single fields.

In-Loop Deblocking

Visible and annoying blocking artifacts are produced owing to block-based transform in intra and inter prediction coding, and the quantization of the transform coefficients, especially for higher quantization scales. In an effort to mitigate such artifacts at the block boundaries, AVC provides deblocking filters, which also prevents propagation of accumulated coded noise.

A deblocking filter is not new; it was introduced in H.261 as an optional tool, and had some success in reducing temporal propagation of coded noise, as integer-pel accuracy in motion compensation alone was insufficient in reducing such noise. However, in MPEG-1 and MPEG-2, a deblocking filter was not used owing to its high complexity. Instead, the half-pel accurate motion compensation, where the half-pels were obtained by bilinear filtering of integer-pel samples, played the role of smoothing out the coded noise.

However, despite the complexity, AVC uses a deblocking filter to obtain higher coding efficiency. As it is part of the prediction loop, with the removal of the blocking artifacts from the predicted pictures, a much closer prediction is obtained, leading to a reduced-energy error signal. The deblocking filter is applied to horizontal or vertical edges of 4×4 blocks. The luma filtering is performed on four 16-sample edges, and the chroma filtering is performed on two 8-sample edges. Figure 3-15 shows the deblocking boundaries.

Error Resilience

AVC provides features for enhanced resilience to channel errors, which include NAL units, redundant slices, data partitioning, flexible macroblock ordering, and so on. Some of these features are as follows:

Network Abstraction Layer (NAL): By defining NAL units, AVC allows the same video syntax to be used in many network environments. In previous standards, header information was part of a syntax element, thereby exposing the entire syntax element to be rendered useless in case of erroneous reception of a single packet containing the header. In contrast, in AVC, self-contained packets are generated by decoupling information relevant to more than one slice from the media stream. The high-level crucial parameters, namely the Sequence Parameter Set (SPS) and Picture Parameter Set (PPS), are kept in NAL units with a higher level of error protection. An active SPS remains unchanged throughout a coded video sequence, and an active PPS remains unchanged within a coded picture.
Flexible macroblock ordering (FMO): FMO is also known as slice groups. Along with arbitrary slice ordering (ASO), this technique re-orders the macroblocks in pictures, so that losing a packet does not affect the entire picture. Missing macroblocks can be regenerated by interpolating from neighboring reconstructed macroblocks.
Data partitioning (DP): This is a feature providing the ability to separate syntax elements according to their importance into different packets of data. It enables the application to have unequal error protection (UEP).
Redundant slices (RS): This is an error-resilience feature in AVC that allows an encoder to send an extra representation of slice data, typically at lower fidelity. In case the primary slice is lost or corrupted by channel error, this representation can be used instead.

HEVC

The High Efficiency Video Coding (HEVC), or the H.265 standard (ISO/IEC 23008-2), is the most recent joint video coding standard ratified in 2013 by the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG) standardization organizations. It follows the earlier standard known as AVC or H.264, also defined by the same MPEG and VCEG Joint Collaborative Team on Video Coding (JCT-VC), with a goal of addressing the growing popularity of ever higher resolution videos, high-definition (HD, 1920 × 1080), ultra-high definition (UHD, 4k × 2k), and beyond. In particular, HEVC addresses two key issues: increased video resolution and increased use of parallel processing architectures. As such, HEVC algorithm has a design target of achieving twice the compression efficiency achievable by AVC.

Picture Parititioning and Structure

In earlier standards, macroblocks were the basic coding building block, which contains a 16×16 luma block, and typically two 8×8 chroma blocks for 4:2:0 color sampling. In HEVC, the analogous structure is the coding tree unit (CTU), also known as the largest coding unit (LCU), containing a luma coding tree block (CTB), corresponding chroma CTBs, and syntax elements. In a CTU, the luma block size can be 16×16, 32×32, or 64×64, specified in the bitstream sequence parameter set. CTUs can be further partitioned into smaller square blocks using a tree structure and quad-tree signaling.

The quad-tree specifies the coding units (CU), which forms the basis for both prediction and transform. The coding units in a coding tree block are traversed and encoded in Z-order. Figure 3-16 shows an example of ordering in a 64×64 CTB.

A coding unit has one luma and two chroma coding blocks (CB), which can be further split in size and can be predicted from corresponding prediction blocks (PB), depending on the prediction type. HEVC supports variable PB sizes, ranging from 64×64 down to 4×4 samples. The prediction residual is coded using the transform unit (TU) tree structure. The luma or chroma coding block residual may be identical to the corresponding transform block (TB) or may be further split into smaller transform blocks. Transform blocks can only have square sizes 4×4, 8×8, 16×16, and 32×32. For the 4×4 transform of intra-picture prediction residuals, in addition to the regular DCT-based integer transform, an integer transform based on a form of discrete sine transform (DST) is also specified as an alternative. This quad-tree structure is generally considered the biggest contributor for the coding efficiency gain of HEVC over AVC.

HEVC simplies coding and does not support any interlaced tool, as interlaced scanning is no longer used in displays and as interlaced video is becoming substantially less common for distribution. However, interlaced video can still be coded as a sequence of field pictures. Metadata syntax is available in HEVC to allow encoders to indicate that interlace-scanned video has been sent by coding one of the following:

Each field (i.e., the even or odd numbered lines of each video frame) of interlaced video as a separate picture
Each interlaced frame as an HEVC coded picture

This provides an efficient method of coding interlaced video without inconveniencing the decoders with a need to support a special decoding process for it.

Profiles and Levels

There are three profiles defined by HEVC: the Main profile, the Main 10 profile, and the Still picture profile, of which currently Main is the most commonly used. It requires 4:2:0 color format and imposes a few restrictions; for instance, bit depth should be 8, groups of LCUs forming rectangular tiles must be at least 256×64, and so on (tiles are elaborated later in this chapter in regard to parallel processing tools). Many levels are specified, ranging from 1 to 6.2. A Level-6.2 bitstream could support as large a resolution as 8192×4320 at 120 fps.^{Footnote 3}