1 Introduction

Video quality assessment has been an important topic during the last decades, notably in the television broadcasting domain. The continuous spread of coverage and the transmission bit-rate increase in Internet connectivity at home led not only to the expected increase of IPTV services and consumption but also to an unexpected number of added-value services, both from commercial providers, such as downloadable video content and video-on-demand offers as well as customer-to-customer offers such as video sharing platforms.

Customer satisfaction, notably in terms of transmitted video quality, is the key to the success of these services. Topics which had limited application previously are gaining more interest now. One example in which automated measurement of perceived video quality plays an important role is the inter-channel bitrate allocation. It is well known that fast moving content such as sports requires higher bitrates than slow content, such as a single person presentation of the news in order to provide a comparable perceived quality. On satellite channels and in Digital Video Broadcasting (DVB), often only a limited number of content channels could be bundled, two to eight being a typical limitation. The combined bit-rate was fixed. Splitting the bit-rate unequally between the various programs was therefore limited to quantized values that did not vary significantly over time. This is different when the Internet is used as transmission channel. Each household with its DSL subscriber line may demand several video streams from different server locations. Head-end stations accumulate several subscriber lines and the combined traffic volume may be limited. In a spatially local area, several hundred video bit-streams may be streamed at the same time, accounting to and benefitting from a high temporal flexibility in individual bitrate allocation. In the ideal case, all subscribers will receive the desired Quality of Experience (QoE) for their service which is continuously monitored by automated measurements taking into consideration the transmitted bitstream features and the decoded video’s quality instead of randomly dropping packets.

The objective measurement of QoE in the network or at the receiving side is the key aspect of this scenario. QoE is often used with ambiguous meaning. In this paper, we understand QoE in the way that has recently been defined as “the degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and / or enjoyment of the application or service in the light of the user’s personality and current state” [35].

This paper focuses on the prediction of one aspect of QoE: perceived video quality. It has important interactions with other aspects of QoE such as audio, environmental setup, or interactivity which are outside of the scope of this paper.

Objective video quality models can be divided into different categories depending on whether they use reference information or not, as well as if they are using bit-stream and/or video data. Historically, as most video quality models worked only on the video data, the terms Full-Reference (FR), Reduced-Reference (RR) and No-Reference (NR) were used (and are still used) for describing the amount of information from the reference video. In this case, the reference video corresponds to a high quality uncompressed version of the video. FR models will then require access to the complete reference or original video. An RR model, on the other hand, will extract key information from both the reference and the degraded video, and compare these. As such, only a reduced amount of data needs to be transferred either to the sending point or the receiving point in order to compare the videos and predict perceived quality. An NR model is working only on the degraded video and does not require a reference. They are also sometimes called zero reference.

As per the ITU standardization activities, the objective quality measurement methods have been classified into the following five main categories depending on the type of input data that is being used for quality assessment: media-layer models, parametric packet-layer models, parametric planning models, bitstream-layer models and hybrid models [13, 55]. Parametric packet-layer models inspect only the packet header such as IP-packets e.g. Real-time Transmission Protocol (RTP) or User Datagram Protocol (UDP), and from this information make predictions of the quality [25]. Such models are particularly interesting when video traffic is encrypted, or in the case of Digital Rights Management (DRM). Bit-stream models go further and also analyze the encoded (impaired) bit-stream itself and extract the information needed to make a quality prediction without fully decoding and processing the actual video [26]. Hybrid models use both video pixel information in combination with the bit-stream information eventually also doing a full decoding of the video payload.

Media-layer models use the speech or video signal to compute the Quality of Experience (QoE). These models do not require any information about the system under testing. Parametric planning models make use of quality planning parameters for networks and terminals to predict the QoE. As a result they require a priori knowledge about the system that is being tested. In [13] the media-layer models are classified into traditional point-based metrics (e.g., MSE and PSNR), Natural Visual Characteristics oriented metrics, and Perceptual (HVS) oriented metrics. The Natural Visual Characteristics metrics are further classified into Natural Visual Statistics and Natural Visual Features based methods. Similarly, the HVS methods are further classified into DCT domain, DWT domain and pixel domain models. The objective methods can also be classified in terms of their usability in the context of adaptive streaming solutions as out-of-service methods and in-service methods. In the out-of-service methods, no time constraints are imposed and the original sequence can be available.

We can generalize the concept of FR, RR and NR by also including packet header models, bit-stream models and hybrid models together with the pure video based models, providing a classification based on the amount of reference information used by the models, such as identifying a hybrid Bitstream-NR model.

The usefulness and applicability of these models are quite different. It has been considered that the most accurate models are the FR models; the VQEG phase II test did show that RR models can be at least as accurate [61]. The FR models and RR models can therefore be used for offline tuning of encoders and comparison between them. They can also be useful for generating training data for the development of NR models [11]. RR models are often argued to be an alternative for quality monitoring purposes in the network, but practically they are not. The reason is that the reference information that is processed at the sender side to generate the auxiliary RR information requires in most cases to be based on high quality, preferably uncompressed, versions of the videos. In practice, the network provider is not in possession of the same video as the content provider sends a coded video signal for distribution. This version of the video is often of quite low quality in itself and most RR models are not designed to compare towards a low quality reference.

Research on Hybrid-NR perceptual measurements faces two main categories of problems. The first category relates to technical details, such as capturing a transmitted video bit-stream, parsing the information from the current complex video compression standards, accessing the decoded video’s pixel data, etc. The second category relates to the complexity of algorithmically modeling the Human Visual System (HVS). This comprises some prominent questions and fundamental research such as contrast sensitivity, spatial and motion masking, visual saliency, etc.

The first category can be tackled by centralizing the effort and managing technical tools for the research community in a collaborative effort. This has been recently established by the Joint Effort Group (JEG) of the Video Quality Experts Group, notably in their Hybrid project [52, 60]. Within JEG-Hybrid, a number of tools are made freely available for, amongst other, automatic monitoring and gathering of information at the video and network level. All information is gathered in structured XML files, which enable easy processing of the captured data. Furthermore, instead of having to parse the received encoded video bit-streams, JEG-Hybrid proposed the use of an XML-based file structure for representing the content of the (impaired) video bit-stream. These Hybrid Input XML files (HMIX files) contain information in a human readable format and enable quick and efficient processing of the video instead of having to write a complete parser. All software tools can be downloaded from VQEG’s Tools and Subjective Labs Setup website [54].

The second category, the development of algorithmic prediction of the behavior of the HVS, is subject of this paper. In most publications, either an isolated aspect of the HVS is analyzed in detail, or a (complex) measurement algorithm is presented and verified on a video database annotated with subjective voting. The first approach lacks the verification on general content; the second approach misses to prove that the HVS has been modeled successfully.

A recent review of objective video quality measurement algorithms has been published in [13]. The authors investigated into a classification of the general methodology for predicting video quality in categories such as natural visual statistics or frequency domain HVS based. They also compared the performance of several models on a subjective database.

The approach of this paper is to first identify typical perceptual degradations that may be annotated by naive human observers, for example in Open Profiling approaches [53]. Publications tackling these isolated perceptual degradations will be referenced. In a second step, existing published (complex) models are analyzed with respect to their capability of predicting these perceptual artifacts.

The paper is limited in the sense that it cannot and will not exhaustively cover one or the other. It shall provide an incomplete list of algorithms to measure isolated degradations that were previously used in order to stimulate comparison. It shall further show which types of degradations measurement have previously been used in combined measurement algorithms. A continuous update of this list is expected to take place within the VQEG JEG group. The goal of this paper is to help in the identification of the artifacts and their measurement algorithms that may form the advantage of a Hybrid-NR measurement, providing access to both bit-stream measurement indicators and pixel based indicators.

The paper is organized as follows. In Section 2, the development of objective video quality prediction algorithms is introduced. For one of the identified components, the perceptual features and their algorithmic indicators existing methods are reviewed in Section 3. Complex algorithms for video quality measurement are reviewed in Section 4 with particular emphasis on their indicators. Section 5 proposes methods to identify the scope of the indicators and suggests optimization criteria. Currently ongoing collaborative efforts are documented in Section 6 before the paper is concluded in Section 7.

2 Development of objective video quality prediction

The structure of a video quality measurement algorithm is often a mixture of different concepts. The concepts originate from different methods to tackle the algorithmic prediction process.

The first method consists of mimicking the processing steps of the HVS. This is mostly applicable in the pixel domain. Typically, the reconstruction of the visual scene from the sender’s side by the display device with all its limits is modeled first. The required steps include knowledge about the display device properties such as size, maximum illumination, and gamma curve, the position of the observer relative to the display device and environmental factors such as surrounding light sources that may lead to reflections from the screen and that may change the observer’s pupil size. The technical background may be found in [42, 45, 47] and a practical example of the required steps is provided in [57]. Depending on the targeted precision of the model, the eye optics is respected, and biomedical modeling of retinal processes is considered. Further modeling of higher level processes may be taken into consideration provided that corresponding models exist. Most of these models are based on an explanatory theory of the outcome of several psychophysical experiments. The interested reader is referred to [8, 22]. Due to the complexity of the HVS and the close interaction with scene interpretation from memory, the processing usually only covers the first few steps of perception. This method is most efficient for degradations which are close to perception thresholds, normalizing the video input data to the human’s detection threshold may also improve performance for supra-threshold, i.e. immediately visible degradations [40].

The second approach also takes into consideration isolated aspects that are recognizable by the human observer but does not aim at a constructive reproduction of the HVS’s processing chain. Instead, psychophysical experiments are conducted in order to model the relationship to the quality prediction for a complex but isolated aspect. One example is the required frame-rate for video transmission. Psychophysical experiments on the contrast sensitivity threshold using sinusoidal stimuli have been conducted in [15]. The appearance for natural video sequences and more importantly the influence on perceived quality may be different. Therefore, several experiments have been conducted with natural video sequences in order to evaluate the quality degradation. This led to algorithms that may predict this particular perceptual degradation which depends also on the video resolution as will be explained later in Section 5. This approach is usually used when the artifacts are too complex for integrating them in the HVS’s processing chain and require interpretation by human observers.

For its measurement, two different approaches exist to estimate the impact on the human observer: The first approach is that the required technical parameter is known a priori or can be extracted easily from the received data such as the spatial resolution. The second situation occurs when the parameter is not directly accessible but needs to be estimated either from bit-stream data or from the decoded video sequence. An example for this type of analysis would be blockiness and blurriness which can be either measured using a spatial frequency analysis of the decoded video or by estimation using the QP value and the distribution of the residual Discrete Cosine Transform (DCT) coefficients.

A later step for the development of a video quality measurement algorithm is to take into consideration the interaction of different artifacts, including the training on specifically designed subjective experiments that use natural content in high quality as reference and that perform typical processing steps for the planned application area of the objective quality prediction. The combinations of different artifacts that result from these processing steps are then evaluated in a subjective experiment to measure the impact of the complex combination of artifacts.

The typical development cycle of a measurement method is depicted in Fig. 1. The top of the diagram contains the subjective assessment experiments that are required for developing a perceptual measurement. On the left hand side, the isolated subjective experiments for individual quality degradations are shown with the three steps of isolating a psychophysical effect such as contrast sensitivity, designing appropriate stimuli, such as Gabor patterns, and finally conducting the subjective experiment using appropriate setup and methodology. Commonly used subjective assessment methodologies have been standardized in International Telecommunications Union (ITU)—T Recommendations P.910 and P.911, and in ITU-R Rec. BT.500. On the right hand side, the creation of the datasets for combined artifact measurement is detailed as consisting of the selection of the Source Reference Sequences (SRCs), the choice of the Hypothetical Reference Circuits (HRCs) such as a specific video encoder and transmission chain, and finally the subjective experiment which eventually requires different setup and methodology.

Fig. 1
figure 1

Development steps for an objective video quality measurement algorithm

Each of these experiments usually leads to three outcomes. The first and most immediate result is the video data that was shown to the observers, mostly in the form of processed video sequences (PVS) and eventually bit-stream data. The second result is the set of votes that were given by the observers of the same. Finally, the third outcome is the analysis of the votes with respect to the degradations presented in the video files that may be modeled and implemented in an algorithm to automatically predict the outcome of the subjective experiment. Often, verification is also performed showing the performance on the same data as was used for the development and training. Seldom, validation is performed, i.e. measuring performance by using a different data set, preferably created by a different process and/or research group.

The lower part of the diagram shows the processing performed in an algorithm, starting by individual indicators that may interact in terms of weighting during the spatial and temporal pooling stage. A typical example being visual saliency that may be employed for learning about the size of the most interesting object shown on the screen but which may also be used to weight degradations with information on relative spatial importance.

It should be noted that the conceptual diagram representation may differ from the implementation. In particular, the integration of required and eventually shared preprocessing steps such as color space conversions into each indicator allows for easier modularization. In implementations, redundancy may be removed for speed gains. For example, caching structures in object oriented frameworks may be employed.

From the diagram, it becomes obvious that the prediction performance of a video quality measurement algorithm depends on the quality of its indicators and the combination of the indicators. This may be helped by repeated verification procedures during the development using subjectively evaluated datasets and rigorous evaluation of the performance of each indicator within the complete framework.

It is evident that the combination of a larger set of indicators and the training of their combinations poses stability issues with more complex training methods such as neural networks while simpler training methods may result in suboptimal prediction. For similar reasons, the prediction performance depends on the size, quality, and feature spread of the training and verification databases.

The literature currently lists, on the one hand, a large number of possible indicator algorithms and combination methods. On the other hand, there is a long list of complete algorithms for video quality prediction. The two following sections will review a few examples of each category.

3 Perceptual features and their algorithmic indicators

The term perceptual feature will be used in the following to identify and describe an isolated aspect of perception, explicitely or implicitely experienced by the human observer. The list of perceptual features mentioned in the literature is endless. The following list aims at listing important indicators on the one hand side and on demonstrating the variety of perceptual indicators on the other side. While only a short excerpt will be given in this paper, a longer and continuously updated list will be established and made available by the VQEG-JEG group on their homepage [60].

3.1 Overall still image based features

In most cases, video quality assessment algorithms analyze first individual images and then perform temporal pooling for determining the quality of the complete video sequence. The overall quality of this processing step may be identified.

Still Image Difference to Reference: For each individual image, most FR measurement algorithms estimate an individual quality value that corresponds to the perceived difference between the original and the degraded version. For RR and NR measurements, a difference to an optimal reference image is often predicted. Human observers may be asked to rate this difference in an image based double stimulus paired comparison experiment on a 7-point adjectival categorical judgement scale on selected images of the video sequence.

3.2 Classification of content related perception

Without taking into consideration the degraded image, several properties of the content itself may be taken into consideration. These factors play either an important role in the weighting of the degradation perception, such as masking effects, or they relate to a different kind of interpretation of the perceived content, for example when cartoons or drawings are evaluated.

3.2.1 Contrast sensitivity

It has been clear for a long time that the human visual system is differentially sensitive to patterns containing different spatio-temporal frequencies usually referred to contrast sensitivity, which is described by the contrast sensitivity function (CSF) [6]. A standard model for contrast detection based on the spatial CSF was developed by [63]. The CSF is different for achromatic and chromatic channels, where the achromatic CSF has higher sensitivity for high frequencies as compared to the chromatic, which has also been successfully utilized by standard colour format e.g. YUV 4:2:0 [46], where the achromatic channel Y has double resolution in each direction as compared to the chromatic channels U and V. In principle the spatio-temporal CSF is not space-time separable as pointed out by e.g. [33], but others have later found reasonable approximations of the CSF separable in space and time [12]. Also the contrast sensitivity can be model by the inspiration of the connection between the ganglion cells and the lateral geniculate nucleus (LGN), where the processing is divided in two channels, Parvo and Magno, where Parvo is loosely taking care of the high spatial frequencies and low temporal frequencies, whereas Magno is taking care of the low spatial frequencies and high temporal frequencies [1, 10].

3.2.2 Masking

The visibility of a pattern is affected by the pattern surrounding it. Legge and Foley defined masking as “any destructive or interference among transient stimuli that are closely coupled in space and time” [19, 36]. The effect is mostly negative, that is the effect is that the detectability of pattern is decreased. However, this is not always the case for some low contrast masker the detectability may actually increase. Masking has been exploited in several image discrimination models e.g. [1, 39, 64] and video discrimination models e.g. [9, 59, 66].

3.2.3 Spatial masking

The presence of a texture pattern in an image region may mask the visibility of similar patterns in its neighborhood. This has been studied with various psychophysical stimuli and led to different models [36]. A computational model for natural image content at typical luminance levels for broadcast TVs has been discussed in [44].

3.2.4 Temporal masking

Moving textured regions not only attract attention but also mask surrounding textures [20]. A study of a moving edge masking has been analyzed in [21]. A special case of temporal masking in video quality analysis is scene cuts, a detailed analysis may be found in [56].

3.3 Classification of spatial degradations

Technical constraints such as maximal allowed bit-rate allocation often introduce a perceptually complex pattern of degradations. The degradations have been isolated and categorized. This allows for studying the origin of each such degradation, for example in a video coding algorithm. It also allows for developing individual algorithms that measure each such degradation.

3.3.1 Blockiness

Blockiness or block artifacts are often caused by the use of lossy compression. This stems from independent transform coding of “NxN” pixel blocks (usually 4 × 4 or 8 × 8 pixels) in most of the currently used video coding algorithms including H.261-H.265, MPEG-4 Part 2 and Part 10 or MPEG-2. These algorithms use a quantization of the DCT coefficients for each block separately, which causes noise shaping that leads to coding artifacts in the form of discontinuities for coded block boundary [65]. Sudden color intensity changes are most evident in uniform areas of an image and are caused by the removal of the least significant coefficients of DCT.

Block artifacts can be calculated locally for each coding block. For example, the absolute difference in luminance pixels in pairs of adjacent pixels within a coding block and the pair of pixels of neighboring blocks may be calculated. Then the total value of differences within the blocks and between the blocks may be calculated and averaged over the entire frame [51].

3.3.2 Blurriness

Blur is mostly caused by the removal of DCT coefficients of high frequency or by introduction of loop-filters to counteract on blockiness, which both lead to low-pass filtering. This effect can be seen as a loss of detail in the image, reducing sharp edges and texture of objects. Moderate blur effects may occur due to loop-filters in current encoding standards or due to the combination of image patches from bi-directionally predicted coding-blocks. While these effects usually lead to perceived smoothness for luminance signals, the same effects on chrominance coding may lead to smearing on the edges of areas with contrasting color values.

Measurement of this artifact may be based on the cosine of the angle between perpendiculars to planes in adjacent pixels [51].

3.3.3 Geometric distortions

Geometric distortions may be caused by various types of image adaptations such as re-scaling due to aspect ratio conversion. For objective quality assessment, this artifact requires not only the measurement of the perceptual impact on quality but also a sophisticated adaptation to a non-distorted image [14].

3.3.4 Deinterlacing artifacts

Since the start of television broadcast, video sequences have been transmitted in line interlaced format. Several different perceptual artifacts are related to this technology, ranging from the inversion of the top-field and the bottom-field, to the de-interlacing algorithms used in current display technologies. A technical overview has been provided in [16] while de-interlacing techniques have been subjectively compared in [72].

3.3.5 Spatial error concealment

Packet losses or bit inversions lead to missing content at the receiver side which is often replaced by previously transmitted content or by in-painting of surrounding regions. The result is often an isolated image region which does not fit to the surrounding perceptual information. Due to the prediction used in video coding, this artifact has a temporal duration as the image region tends to grow as its blocks are used for further image prediction. This is particularly visible during a scene cut when mixture of different contents may occur.

3.3.6 Channel switches

Voluntary and involuntary channel switches form a special kind of artifacts because they may introduce content mixtures similar to error concealment artifacts or annoy the user by pauses in the transmission [34].

3.4 Classification of temporal degradations

3.4.1 Frame rate

The number of frames per second in video transmission is often less than the perceptual maximum for a given viewing angle. In [41] a model is proposed to estimate the normalized quality as a function of the frame rate. The model consists of an inverted exponential function and incorporates also a parameter that characterizes how fast the perceived quality reduces as the frame rate decreases, that depends on the content characteristics, such as the motion or the resolution. At the same spatial resolution, faster motion sequences have a higher falling rate. Also, for the same sequence, the QCIF resolution shows a higher falling rate than CIF.

3.4.2 Frame freezing and frame skipping

Video transmission over networks is often impaired by delays or outages. When video content cannot be decoded at a given time, the playback pauses and when it resumes, it may either continue with the next frame or skip a few frames to compensate for the previously paused delay time. This type of distortion may appear both in TCP and UDP transmission. In [50] the authors propose a metric to compute MOS as a function of the number of pauses, the average length of pauses that happened in the same temporal segment, a weighting factor which represents the degree of degradation that each segment adds to the total video degradation, the time period in seconds of each segment and the number of temporal segments of a video. It is interesting to note that they divide the video into segments and find the weighting factors for each segment solving an equations system. It was found that the initial video segment is more relevant, or it has more impairment weight in relation to other video temporal segments. Pauses at the beginning of the video have a higher negative effect on user QoE. In [71] the authors conclude that the perceptual impact of frame freeze and frame skip is highly content dependent and that viewers prefer a scenario in which a single but long freeze event occurs to a scenario in which frequent short freezes occur.

3.5 Classification of attention related indicators

3.5.1 Visual attention and saliency

Determining the most important image regions in a video content simplifies the spatial pooling but may also help in determining whether the observer is immersively attracted by a single object or whether he is freely scanning the complete image region. A typical example is an interview situation, when faces attract attention and degradations in surroundings are less noticeable. Predictions may therefore be improved by using saliency awareness [17, 38].

3.5.2 Visibility duration of objects

Newly appearing objects in a video scene require time for their cognitive understanding. During this time, the perception of artifacts is reduced [3, 17].

3.5.3 Forgiveness effect

An article about the temporal characterization of forgiveness effect in which they introduced a degree of blockiness in the videos [23] shows that the forgiveness effect is initially greater when good quality material follows a high level degradation compared to a low level degradation. However, with increasing periods of good quality material, subjects are more forgiving of low level degradation.

4 Existing video quality estimation algorithms

The previous section listed isolated perceptual artifacts and models that were proposed to estimate the severity of the artifact. Some of the models were implemented as algorithms and in some cases the corresponding references were provided for measuring the influence of an isolated artifact in a particular condition.

On the opposite end of the scale, there are algorithms which are meant to measure a certain group of artifacts that may occur in a realistic transmission scenario. These algorithms are a compromise between execution speed, model accuracy, and prediction accuracy. The latter two are distinguished by the appropriateness to model an isolated aspect of the human perception and the prediction performance of the model in the given scenario. For example, correctly modeling the effect of forgiveness may be important but as in most cases only 8–12 s of video were considered, the influence may be negligible for the model’s performance.

Table 1 lists a selection of measurement methods from very simple models to complex models which underwent validation and standardization. As most of the validation has been performed within VQEG, the application areas or scopes were retained even outside of the standardization efforts as the same video sequences were used for performance evaluations. In VQEG-SD Phase I and Phase II, standard definition (SD) television was considered as scope without extreme low-bitrate coding conditions, as well as without frame rate reduction and without packet losses. In VQEG-MM Phase I, typical multimedia (MM) resolutions and conditions at that time were considered, i.e. Quarter Common Interchange Format (QCIF), CIF, and VGA. Very low bit-rate was included as well as frame rate reduction and network packet losses. In VQEG-HD Phase I High Definition (HD) content was degraded by similar conditions as used in VQEG-MM.

Table 1 Video quality measurement algorithm and their indicators

Most of the currently employed video quality measurement algorithms are FR or RR models as the performance of NR models may not be expected to be sufficient for industrial application.

For each of the models, the table shows to which extent a perceptual artifact has been taken into consideration. The scale follows the Absolute Category Rating scale. A value of five (5) indicates that the modeling is very elaborate, a value of four (4) indicates that substantial parts of the model are dedicated to measuring this particular artifact. In cases when the model takes the artifact into consideration but is not particularly targeted to its measure, a value of three (3) was given. Two (2) indicates that the model may take this degradation into consideration but there is no unique indicator included. One (1) indicates that the model handles the artifact while estimating the quality related to a different artifact, the prediction performance may be limited. A dash (-) signifies that the artifact was not considered for the model according to its description.

In the case of bitstream-based objective video quality metrics, quality is estimated by analyzing the received video bitstream. As such, no full decoding is performed and different parameters are extracted from the (impaired) encoded video bitstream. In [2], parameters are extracted at the frame, macroblock and motion vector level to continuously estimate visual quality. These parameters are used to calculate a number of indicators such as error duration, error propagation, and frame freezing. Finally, the indicators are combined and temporal pooling is applied. Yang et al. [70] propose an NR bitstream-based metric, which measures video quality using three key factors: picture distortions resulting from quantization, quality degradation caused by network impairments, and temporal effects of the Human Visual System (HVS). Characteristics of the HVS such as spatial and temporal masking are also taken into account when calculating the picture quality. Both spatial and temporal pooling are applied in their quality assessment framework. Recently, ITU-T Study Group 12 (SG12) also published recommendation P.1202.1 describing parametric non-intrusive bitstream assessment of low resolution video streaming quality. High resolution video applications, including IPTV, are still under investigation.

In order to reliably measure spatial degradations such as blockiness and blurring, access to pixel data is required. In the case of hybrid quality metrics, both the encoded (impaired) video stream and the decoded video data are made available. This approach has, for example been used in [18] and [37] for measuring video quality by detecting perceptual artifacts. The authors in [30, 31, 48] extract different parameters from the received and decoded video stream to model the influence of packet loss on impairment visibility. Features are extracted to identify the spatial and temporal extent of the loss and combined with the mean square error of the initial error averaged over the macroblocks initially lost. The V-Factor [67] objective video quality metric inspects the Transport Stream (TS) and Packetized Elementary Stream (PES) headers, and encoded and decoded video signal to estimate perceived visual quality. V-Factor also considers content characteristics, compression mechanism, bandwidth constraints and network impairment such as jitter, delay and packet loss to measure video quality in real-time. Yamagishi et al. [69] extend a packet-layer model with content-based information extracted from the decoded video stream in order to estimate video quality per content type. As such, information is combined from the packet headers and video signals. Similarly, results in [32] also show that the overall prediction accuracy of a pure bitstream based objective video quality model can be significantly increased when pixel-based features are taken into account. From the bitstream, features are obtained at the slice, macroblock and motion vector level. These features are then combined with pixel-based features such as blockiness, blurriness, activity, predictability, and motion, edge, and color continuity.

VQEG’s JEG is also particularly interested in the construction, validation and evaluation of hybrid bitstream-based objective video quality assessment.

5 Scope considerations for the combination of indicators

A critical question in the development of a video quality measurement algorithm concerns the choice of the indicators and their combination. While experts can usually perceptually distinguish between different artifact types as listed in Section 4, measurement algorithms are often not sufficiently discriminative. Each indicator algorithm usually has a certain artifact for which it was designed and trained on. This artifact or class of artifacts may be said to be “in scope” for the algorithm. Other artifact types may be sufficiently close to be also predicted but with less accuracy, those shall be termed as being “in extended scope”. The remaining artifacts should be considered as being “out of scope”. For an example of this definition, the reader is referred to [5].

In an ideal measurement algorithm, each considered artifact has exactly one single indicator. Each such indicator performs perfectly “in scope”, does not have an “in extended scope” range, and stays neutral, i.e. “no artifact present” when “out of scope” degradations occur.

In practice, these conditions do not hold true, and what is more important they have not been taken into consideration so far in model development. A comprehensive analysis of each indicator is often only available for “in scope” conditions, while the other conditions are not evaluated.

In some cases, it is algorithmically determined that an indicator will not measure a degradation, for example, an image based indicator will not measure temporal degradations. This is no longer true when the indicator is used as part of a complex algorithm because the algorithm may have side effects. In VQEG-MM for example, a PSNR variant [68] has been used on videos with temporal degradations such as pausing and skipping. The algorithm only estimated a constant time offset for the complete video sequence and thus non-matching frames were compared by PSNR in case of pauses or skips, enabling the algorithm to predict these degradations to some extent.

The determination of a particular scope for each indicator and for the overall model requires necessarily a validation process. This validation process usually uses data that was unknown to the model during its development. A possible alternative may be to evaluate each algorithm on the psychophysical databases that were used to develop the individual indicators and to learn about the behavior of the other “out of scope” or “in extended scope” indicators.

It should also be taken into consideration that some indicators may take binary precedence over others. Some may be useful only in a certain range of a wider application scope. For example, evaluating thresholds at contrast sensitivity may be used when the content is of very high quality while this indicator is turned off when a reduced frame rate has been detected.

Models often require intensive training and often the numerically optimized results of the training output are not comprehensible from a vision modeling point of view.

In Table 1 a list of algorithms was provided for which the design scope was fixed in testplan documents which were later used in standardization. The table itself lists the performance of the model with respect to the presence and the algorithmic description of each indicator. In order to determine the experimental scope of each indicator, extensive validation experiments would be required and the result of the individual indicators may be different from their combination due to the model’s processing framework and the training performed.

One issue in this effort is the development of a common objective scale. Usually, each objective model is allowed a fitting to a subjective dataset before evaluation, in most cases linear, third order monotonic or sigmoidal fittings are employed. However, when the influence of indicators shall be combined and the rating scale shall extend over the range that may be measured in one single subjective experiment with sufficient precision, the development of an objective scale is required. For example, the Just Noticeable Difference scale [62] would allow for a subjectively and objectively meaningful absolute scale by specifying a percentage of detection probability of differences between images. This scale is compatible with data obtained from Paired Comparison experiments when using the Thurstone-Mosteller or Bradley-Terry model [7, 58]. Their usage for video sequence would have to be analyzed. Several subjective experiments would be required to establish a link to other commonly used subjective scales and also for evaluating the combination of subjective experiments.

6 Towards a joint development of video quality measurement

In the video coding community, the most appropriate algorithms have long been identified and continuous work led to a succession of successful standards which often reduced the bit-rate by half at similar perceived quality. In video quality prediction, various standards and algorithms exist that have been developed in an isolated manner. Evolution is difficult to measure.

From the previous sections the following process for a joint development of video quality estimation may be derived:

  1. 1

    Creation of psychophysical stimuli databases which allow the measurement of isolated vision and/or quality aspects. Results both in terms of video sequences of the stimuli as well as subjective assessment results need to be made available.

  2. 2

    Creation of natural and synthetic video databases impaired by isolated perceptual artifacts also made publicly available.

  3. 3

    Development of a common objective quality scale and tools to realign currently employed subjective quality assessment scales to this common scale.

  4. 4

    Test, development and refinement of individual indicators, identifying each time their performance “in scope” with the previously developed databases, their performance “in extended scope” and their neutrality in “out of scope” conditions.

  5. 5

    Development of frameworks of combination of several indicators with respect to their identified scope performance.

This outline covers only the most important work packages. For example, the correct temporal, spatial and color alignment of video sequences is not mentioned as well as the complex influence from object recognition and visual saliency. The performance also depends on the temporal distribution of the artifacts, and content features such as spatial complexity, temporal complexity, camera and object motion, and scene changes. A major problem of the development of an algorithm is also the interdependency of the chosen indicators, some may be meant to measure second or third order effects which are partially already covered by “extended scopes” of others.

Bridging the gap between perception modeling and computer algorithms requires international cooperation and structured research.

Recently, the VQEG JEG Hybrid group has started work towards the development of a Hybrid NR video quality measurement that aims at establishing a collaborative way of developing objective video quality measurement. Their work will take into consideration the above-mentioned five steps. As previous work on various steps is available and iterations are required the work will advance in parallel. VQEG’s JEG is free and open to everyone, both from academia as well as private industries. No subscription fees are involved for joining VQEG JEG. Contributions can be made concerning every step involved in subjective and objective video quality assessment. VQEG JEG also collaborates with other entities such as the COST IC1003 “Qualinet” network.

VQEG JEG has chosen to develop a Hybrid algorithm. A toolchain is available that allows to parse the bitstream into an XML based file format that is human readable. All kinds of measurement algorithms and scopes are considered. The current main focus is on Hybrid NR model indicators.

As compared to FR models, hybrid NR allows the measurement at the client side and retrieval of bitstream information in conjunction with the analysis of the decoded video sequence may result in higher prediction accuracy. As an example, bitstream information may help localizing packet delays or packet losses in transmission while the decoded video may be used to estimate the efficiency of error concealment. Theoretical limits such as homogeneous and inhomogeneous transcoding will be considered. For example, a video sequence encoded at a low bitrate first and then transcoded to a higher bit-rate, thus encoding mostly the previous coding errors, is difficult to detect in a pure bitstream model. Industrial application difficulties such as encryption may be considered and simulated. While NR models may have difficulties in estimating quality of particular non-natural contents, notably cartoons or drawings, their quality may be estimated by the bitstream analysis. This interrelationship between bitstream analysis and NR estimation immediately leads to the above-mentioned integration of indicators with respect to their individual scope.

7 Conclusions

The research on subjective and objective video quality assessment is often considered an interdisciplinary task. A lot of effort has been spent by different research communities to identify and measure perceptual attributes of the HVS, to develop, train, verify, and validate individual indicators, or to develop perceptually meaningful spatial and temporal pooling algorithms. Data mining and machine learning has been applied to combine the different predictions into a global quality score. It has however been shown that some of the isolated indicators may not be suitable for the complete application range or they may behave erroneously when confronted with degradations outside of their scope. In this context, the importance of creating various subjectively evaluated datasets has been emphasized. Each such databases may contain either psychophysical stimuli or an isolated degradation type or an overall scope evaluation. The combination of the subjective voting given on each individual database needs to be tackled leading to the notion of an objective video quality scale. The individual indicators may then be analyzed for the degradations which they are supposed to measure, i.e. “in scope” degradations, which they may measure with less prediction performance “extended scope”, and their neutral behavior in all other conditions “out of scope”. VQEG JEG’s Hybrid group is working towards organizing collaboration towards these goals.