Barriers Towards No-reference Metrics Application to Compressed Video Quality Analysis: on the Example of No-reference Metric NIQE A. Zvezdakova1, D. Kulikov1,2, D. Kondranin3, D. Vatolin4 azvezdakova@graphics.cs.msu.ru|dkulikov@graphics.cs.msu.ru|denis.kondranin@graphics.cs.msu.ru| dmitriy@graphics.cs.msu.ru 1 Lomonosov Moscow State University, Moscow, Russia; 2 Dubna State University, Dubna, Russia This paper analyses the application of no-reference metric NIQE to the task of video-codec comparison. A number of issues in the metric behavior on videos was detected and described. The metric has outlying scores on black and solid-colored frames. The proposed averaging technique for metric quality scores helped to improve the results in some cases. Also, NIQE has low-quality scores for videos with detailed textures and higher scores for videos of lower bit rates due to the blurring of these textures after compression. Although NIQE showed natural results for many tested videos, it is not universal and currently can’t be used for video-codec comparisons. Keywords: video quality, no-reference metric, quality measuring, video-codec comparison. 1. Introduction DIIVINE (2011) [10], LBIQ (2011) [12], BRISQUE (2012) [9] and V-Bliinds (2012) [11] were trained on Today video content takes the biggest part of world LIVE data set. In 2015, a metric called IL-NIQE [15] Internet traffic (more than 70%). According to the was proposed. It was based on NIQE [8] metric, which is forecasts [1], its rate will grow up to 82% in 2022. studied in this paper, but used multivariate Gaus-sian This trend leads to the creation of new encoding stan- (MVG) model to predict the quality of image patches dards and improvements in existing encoders. There is instead of using a single global MVG model for an a number of video-codec comparisons which are image. conducted to find the best codecs for different tasks and use cases and to help users and customers to Another group contains metrics which weren’t find appropriate encoders for their needs. The tar-get trained on any data sets and use only data from a for video encoding is to deliver high visual qual-ity source image to estimate its quality. For example, with reduced file size, so the only reliable way to CORNIA (2012) [14] combined feature and regres- compare encoded videos quality is to perform a sub- sion training. Recently several approaches which use jective evaluation. It requires a proofed methodology neural networks architectures have been developed. and a high number of observers to achieve reasonable The authors of COME (2018) [13] proposed an ap- results. In general, subjective comparisons are still proach based on convolution neural network AlexNet very expensive to perform, however, there are some and multi-regression which outperformed V-Bliinds on a services which help researchers to perform qualitative number of video sets. subjective comparison [2]. This obstacle increases the No-reference metrics are created to approximate importance of objective metrics for video quality com- users perception of video quality, but in case of esti- parison. mating the quality of encoding and compression, they Objective quality metrics can be divided into three can be used only as an addition to reference metrics. general categories: full-reference metrics, no-reference No-reference metrics can’t become the main criteria metrics and reduced-reference metrics. Full-reference for encoders comparison because in the opposite way metrics are easy to interpret and useful in application to encoders could win the comparison producing a vi- video compression quality estimation. Unlike full- sually ideal result which has little common with the reference metrics which require source video to com- input video. The authors of this paper organize world- pare with compressed, no-reference metrics are useful wide video-codec comparisons for 16 years. Currently, when you don’t have a source and want to estimate the full-reference metric SSIM is used in these compar- quality of the compressed video. This case is usual for isons as the main metric supplemented with a number of example for cloud encoding when videos are uploaded additional metrics (PSNR, VMAF). At the same time, compressed by a built-in encoder in smartphones or several researchers and industry experts con-sider non-professional cameras. Reduced-reference metrics measuring and taking into account no-reference metrics require just some part of information about source in video-codec comparisons. This paper de-scribes the video and can also be used in some of the listed cases. authors’ experience of using no-reference metric NIQE (Natural Image Quality Evaluator) [8] created by 2. Related work Anish Mittal, Rajiv Soundararajan and Alan C. Bovik There is a number of no-reference metrics which in video-codec comparison. This met-ric is one of the were created using databases with subjective quality most popular nowadays and shows good results for scores. Such quality assessment models were trained to image quality assessment. estimate subjective quality, and so their scores We used NIQE to access the quality of encoded depend on training and testing sets. For example, video sequences during the video-codec comparison. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The main idea of NIQE metric is based on construct-ing 4.1 Cases with relevant results a collection of quality-aware features and fitting them According to the authors, NIQE is not applica-ble to a multivariate Gaussian (MVG) mode. NIQE score to unnatural distortions in scenes and scenes from represents the degree of distortions in the frame, and the unnatural source (e.g. computer graphics), as such lower score is, the higher quality is the frame. scenes were not used during the training. However, Accordingly, rate-distortion graphs for encoded videos we checked metric scores on cartoons from our video look unusually inverted, so on the plots in this paper set. NIQE scores are presented inverted to make the re- At Sita (part from the cartoon movie), rate- sults more familiar and interpreting. distortion curve looks inverted (Fig. 1a), NIQE shows There is an open implementation on MATLAB worse quality scores for high bit rates that for low bit provided by the authors [5]. In order to increase rates. This means that the metric is really not appli- computational speed, we used an implementation cable to this type of content. At Sintel (part from CGI from MSU Video Quality Measurement Tool (VQMT) movie trailer), NIQE showed non-monotonic scores for which is currently faster. The tool has a free version (it x265 encoder on fast use case bit rate map, but ac- includes NIQE) and can be downloaded [6]. Speed was ceptable results for universal and ripping use cases important in this case because the metric was used (Fig. 1b). Thus, the metric is said to be not applica-ble for video quality assessment. to cartoons, but we revealed that it works for some types of realistic animation, such as for video gaming (sequences Witcher3, Rust). 3. Experimental setup There were some examples, where the rate- distortion curve looked unnatural, but the metric cor- For the evaluation, 28 different FullHD video se- rectly ranked worse visual quality to higher bit rates. quences were used with number of frames per second For example, on Hera video sequence (a part of a mu-sic from 24 to 60 and which were generated by real users. clip with grain effects) NIQE showed worse score for The videos were chosen from MSU video collection x264 encoding on 4000 kbps than on 2000 kbps in fast which consists of 15,833 videos. The collection was use case (Fig. 2). The metric had better scores for almost divided into 28 clusters by spatiotemporal complex-ity all frames of the lower bit rate. It is shown in the [7] and one sequence from each cluster, which was example frame on Fig. 3, where x264 encoding of the close to the cluster center, was chosen for the final video on 4000 kbps produced worse visual quality and testing set. Each video was encoded by x264 and x265 more compression artifacts than on 2000 kbps. encoders. There were three encoding use cases (“fast”, “universal” and “ripping”) based on different encod-ing -4 -9.75 speed/quality ratios and 7 different bit rates from 1 NIQE (inversed), Y NIQE (inversed), Y -6 -10 Mbps to 12 Mbps. An overall number of encoded -8 x264 x265 -10.25 streams which were evaluated by NIQE is 1176. Better Better -10 -10.5 The final video set was used in 2018 Moscow State -12 -10.75 x264 x265 University (MSU) video-codec comparison [3]. The -14 -11 comparison results are available on the link, but the -16 5 10 -11.25 5 10 results of NIQE were not published on-line because of Bitrate, Mbps Bitrate, Mbps several issues found in NIQE application to video (a) Sita video sequence (b) Sintel video sequence quality measurement. Some of them were noted it the original article, the others were resolved with our Fig. 1. Rate-distortion graph for animation. proposed averaging technique which will be described in the article. Unfortunately, some issues can’t be fixed without the metric improvement (completing the training -4 set or other fixes). In this article, we suggest the method of metric results processing to solve the NIQE (inversed), Y -5 detected problems on metric application to videos. -6 Better 4. Metric behavior on videos -7 x264 x265 For most of the encoded videos, NIQE showed -8 the results which reflected the usual perceptual video -9 compression.ru/video/ quality on different bit rates. But there were some 2 4 6 8 10 12 cases in which NIQE showed the results with some Bitrate, Mbps issues; the following sections describe the detected is- Fig. 2. Rate-distorion graph for Hera sues and their reasons. -8 NIQE (inversed), Y -8.5 -9 Better -9.5 x264 x265 -10 compression.ru/video/ -10.5 2 4 6 8 10 Bitrate, Mbps Fig. 4. Rate-distortion graph for Fire. 0 0 NIQE (inversed), Y NIQE (inversed), Y -10 -200 Better -20 Better -400 -30 -600 -40 x264 x265 -800 bitrate: 2000 kbps bitrate: 4000 kbps -50 0 50 100 150 Frame number NIQE = 8.04 NIQE = 11.11 5 10 15 Bitrate, Mbps x265 (1 Mbps) x265 (2 Mbps) Fig. 3. Frame 208 from Hera video sequence, codec: (a) Rate-distortion graph (b) Per-frame NIQE scores x264, fast use case. According to NIQE, left image is visually better. Fig. 5. Music clip video sequence. The videos described above contained completely black or dark frames. In these videos, NIQE had large values mostly on these frames, which was the main reason 4.2 Cases with irrelevant results for the wrong overall quality score for the en-tire video. The following examples demonstrate an-other case in 4.2.1 Dark scenes which NIQE was not applicable to video quality estimation. The metric was said to be not applicable to the cartoons, but some other types of video content also 4.2.2 Noisy scenes/scenes with lots of details had inaccurate NIQE scores. One of the most fre- quent cases in video sequences with completely black A number of cases where the metric took wrong frames (for example, in the beginning). These frames, values appear in videos with noise or a lot of small and according to NIQE, are perceptually worse than the textured details, like sand, water waves and grass. For other frames and has an extremely high metric score. x265-encoded Bay time-lapse sequence, NIQE showed This might happen because of the absence of such kind of worse score on 2000 kbps than on 1000 kbps in uni- content in training data used for NIQE creation. versal use case (Fig. 6). This video contained a scene with water and grass, and the grass and waves on the For example, for x264 encoding NIQE showed water are smoother in a lower-bit rate video stream. worse score on 2000 kbps than on 1000 kbps at Fire In another example, NIQE showed worse score on video sequence (Fig. 4). It contains close shooting of a 4000 kbps than on 2000 kbps in ripping use case on fire in a dark. In this sequence, the metric showed Playground video sequence for both encoders. This better scores on a group of frames where the camera video contains a lot of bright frames with highly struc- started a slow movement. tural and detailed grass and sand. Such texture is Another example which demonstrates this issue is quite complicated for compression, and on low bit presented in Fig. 5. Music Clip video sequence was rates, there were visible compression artifacts, but quite complicated for many encoders in MSU compar- NIQE had a worse score on high bit rates (Fig. 7). ison. It consists of short scenes which quickly switch This happened due to NIQE perception of finely tex- and a lot of special effects, such as red sparkles and tured grass as noise, while blurred compressed grass grain. NIQE shows unnatural results on this sequence for was expected to be visually better by NIQE. This is all use cases: the rate-distortion curve is not mono-tonic why the rate-distortion curve looks inverted on bit because of an anomaly big values on dark frames. rates higher than 2000 kbps. -3.8 0 -5.5 -5 -5.75 NIQE (inversed), Y NIQE (inversed), Y NIQE (inversed), Y -4 -10 -6 -15 Better Better -6.25 -4.2 -20 Better x264 -6.5 -25 x264 x265 -6.75 -4.4 -30 x265 x264 -35 -7 x265 -40 -7.25 -4.6 2.5 5 7.5 10 5 10 compression.ru/video/ Bitrate, Mbps Bitrate, Mbps -4.8 2 4 6 8 10 (a) Original rate-distortion (b) Rate-distortion graph after Bitrate, Mbps graph. smart averaging. Fig. 6. Rate-distortion graph for Bay time lapse. Fig. 8. Forest dog video sequence. -4.2 -8 -8.5 NIQE (inversed), Y -4.4 NIQE (inversed), Y -9 -4.6 x264 Better x265 Better -9.5 -4.8 -10 -5 x264 -10.5 x265 -5.2 -11 compression.ru/video/ compression.ru/video/ -5.4 -11.5 2 4 6 8 10 2 4 6 8 10 Bitrate, Mbps Bitrate, Mbps Fig. 7. Rate-distortion graph for Playground. Fig. 9. Rate-distortion graph for Music clip after smart averaging. 4.3 Proposed processing technique 5. Correlation with subjective scores During the analysis of per-frame NIQE results, it was The obtained NIQE quality scores were compared to revealed, that values greater than 40 don’t usually appear the subjective scores on part of test videos. A in most of the video frames. Extreme values often occur pairwise subjective comparison was conducted as one of in solid-colored or dark frames. We pro-posed and applied the parts of 2018 MSU Video-Codec Compari-son, a special averaging technique to eliminate these cases. where a total of 22542 valid answers were re-ceived Our NIQE score for the video V was computed in the follo ∑ from 473 subjects. The detailed description and wing way: methodology can be found in the report [4]. Five videos mi ∗ ki ScoreV = i∑ , i ∈ [0, N ], were used in this comparison, and none of them i ki  contained animated scenes or black frames for which  1, mi ∈ [0, 15), (1) NIQE could show inaccurate results. In addition, ki = −0.04 ∗ mi + 1.6, mi ∈ [15, 40), several full-reference quality metrics were measured   0, mi ∈ [40, +∞), where (SSIM, PSNR, VMAF and their variations). The Pearson correlation coefficient was calculated for the mi – NIQE score for frame i, results on each video separately (Fig. 10). The av- ki – weighting coefficient for mi score, eraged correlation scores across all videos reveal that N – number of frames. NIQE has the lowest correlation with subjective scores The proposed averaging formula helped to im- (0.85) while VMAF v.0.6.1 for phones has the high- prove NIQE scores for some of the video sequences. est correlation (0.99). It should also be noted that at The following results demonstrate the corrected rate- the moment NIQE has even lower correlation to distortion curves, which can be compared to the orig- subjective quality than PSNR (0.98), which is long inal results presented above. considered to have low similarity to subjective quality With the proposed averaging technique rate- for compression algorithms comparison. distortion curve for Forest dog doesn’t contain out- The lowest correlation of NIQE with subjective lying points (Fig. 8b). Another example, where scores was obtained for Playground video sequence. the results were corrected by the proposed averag-ing As it was described above for this video sequence, for both encoders, is Music clip video sequence (Fig. NIQE showed worse scores for detailed textures (grass 9). The non-monotonic curve of x264 encoding was and sand) in this video sequence, which is illustrated in caused by high spatial complexity of this video. Fig. 11. 1 NIQE despite the high bit rate of the encoded video, which leads to incorrect results. At the same time, in the 0.9 Pearson corr. coef. original paper, NIQE was said to be not appli-cable to computer graphics, but in our investigation, it was found 0.8 that the metric works for some types of animation (particularly for a screen capture of video gaming). 0.7 0.6 7. Acknowledgments 0.5 NIQE, Y PSNR, Y VMAF SSIM, Y VMAF Special thanks to Georgiy Osipov who helped to v0.6.1, Y v0.6.1 analyze all detected issues and improved NIQE imple- Phone, Y mentation in MSU VQMT. This work was partially Crowd Run Ducks Take Off Mountain Mike Playground Red Kayak supported by the Russian Foundation for Basic Re- search under Grant 19-01-00785a. Fig. 10. Correlation between objective quality metrics and subjective scores. 8. References [1] Cisco Report VNI 2017-2022, 2018 update https://www. cisco.com/c/en/us/solutions/collateral/ service-provider/visual-networking- index-vni/white-paper-c11-741490.html [2] Crowd-sourced subjective quality evaluation platform subjectify.us [3] HEVC Video Codec Comparison 2018 (Thirteen MSU Video Codec Comparison) http://compression.ru/video/codec_ comparison/ hevc_2018/ [4] HEVC Video Codec Comparison 2018 (Thir- teen MSU Video Codec Comparison), Part II: FullHD Content, Subjective Evaluation http:// compression.ru/video/codec_ comparison/hevc_2018/ #subjective_ report MathWorks Documentation: Naturalness Image [5] Quality Evaluator (NIQE) no-reference image quality score https://www.mathworks.com/help/ bitrate: 2000 kbps bitrate: 4000 kbps images/ref/niqe.html NIQE = 3.24 NIQE = 4.40 [6] MSU Quality Measurement Tool: Download Page http://compression.ru/video/quality_ measure/ Fig. 11. Frame 58 from Playground video sequence, vqmt_download.html codec: x265, ripping use case. According to NIQE, left [7] C. Chen, S. Inguva, A. Rankin, and A. Kokaram, “A image is visually better. subjective study for the design of multi- resolution ABR video streams with the VP9 codec,” in Electronic Imaging, 2016(2), pp. 1-5. 6. Conclusion [8] A. Mittal, R. Soundararajan, and A. C. Bovik, During the experiments, NIQE showed good re- “Making a «completely blind» image quality an- sults for most of the videos. But still, there are many alyzer,” in IEEE Signal Processing Letters, 2012, cases for which the metric is not applicable. This is 20(3) pp. 209-212. why NIQE is not universal and can not be used in [9] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No- video-codec comparisons at the moment. The results of reference image quality assessment in the spatial this comparison show NIQE deficiencies that need to domain,” in IEEE Transactions on Image Process- be corrected, such as an application to animated ing, 2012, 21(12), pp. 4695-4708. cartoons, videos with completely black and solid- [10] A. K. Moorthy and A. C. Bovik, “Blind image colored frames, noise and highly detailed/textured quality assessment: From natural scene statistics to frames. For example, the abundance of fine details perceptual quality,” in IEEE Transactions on Image (grass, sand, grain effects) increases the values of Processing, 2011, 20(12), pp. 3350–3364. [11] M. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statis-tics approach in the DCT domain,” in IEEE Transactions on Image Processing, 2012, 21(8), pp. 3339–3352. [12] H. Tang, N. Joshi, and A. Kapoor, “Learning a blind measure of perceptual image quality,” in IEEE CVPR, 2011, pp. 305-312. [13] C. Wang, S. Li, and W. Zhang, “COME for No-Reference Video Quality Assessment,” in 2018 IEEE Conference on Multimedia Information Pro- cessing and Retrieval (MIPR), 2018. [14] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning framework for no- reference image quality assessment,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012, pp. 1098–1105. [15] L. Zhang, L. Zhang, and A. C. Bovik, “A feature- enriched completely blind image quality evalua- tor,” in IEEE Transactions on Image Processing, 2015, 24(8), pp. 2579-2591.