Hacking VMAF with Video Color and Contrast Distortion A. Zvezdakova1, S. Zvezdakov1, D. Kulikov1,2, D. Vatolin1 azvezdakova@graphics.cs.msu.ru|szvezdakov@graphics.cs.msu.ru|dkulikov@graphics.cs.msu.ru| dmitriy@graphics.cs.msu.ru 1 Lomonosov Moscow State University, Moscow, Russia; 2 Dubna State University, Dubna, Russia Video quality measurement takes an important role in many applications. Full-reference quality metrics which are usually used in video codecs comparisons are expected to reflect any changes in videos. In this article, we consider different color corrections of compressed videos which increase the values of full-reference metric VMAF and almost don’t decrease other widely-used metric SSIM. The proposed video contrast enhancement approach shows the metric in-applicability in some cases for video codecs comparisons, as it may be used for cheating in the comparisons via tuning to improve this metric values. Keywords: video quality, quality measuring, video-codec comparison, quality tuning, reference metrics, color correction. 1. Introduction At the moment, video content takes a significant part of worldwide network traffic and its share is ex- pected to grow up to 71% by 2021 [1]. Therefore, the quality of encoded videos is becoming increasingly im- portant, which leads to growing of an interest in the area of new video quality assessment methods devel- opment. As new video codec standards appear, the Fig. 1. The scheme of VMAF algorithm. existing standards are being improved. In order to choose one or another video encoding solution, it is necessary to have appropriate tools for video quality Despite increasing attention to this metric, many assessment. Since the best method of video quality video quality analysis projects, such as Moscow State assessment is a subjective evaluation, which is quite University’s (MSU) Annual Video Codec Comparison expensive in terms of time and cost of its implementa- [2], still use other common metrics developed many tion, all other objective methods are improving in an years ago, such as structural similarity (SSIM) and attempt to approach the ground truth-solution (sub- even peak signal-to-noise ratio (PSNR), which are jective evaluation). based only on difference characteristics of two im- ages. At the same time, many readers of the reports of Methods for evaluating encoded videos quality these comparisons send requests to use new metrics of can be divided into 3 categories [9]: full-reference, VMAF type. The main obstacle for the full tran-sition reduced-reference and no-reference. Full-reference to the use of VMAF metrics is non-versatility of this metrics are the most common, as their results are metric and not fully adequate results on some types of easily interpreted — usually as an assessment of the videos [4]. degree of distortions in the video and their visibility to The main goal of our investigation was to prove the observer. The only drawback of this approach the no-universality of the current version of VMAF compared to the others is the need to have the orig- algorithm. In this paper, we describe video color inal video for comparison with the encoded, which is and contrast transformations which increase VMAF- often not available. score with keeping SSIM score the same or better. The possibility to improve full-reference metric score One of the widely-used full-reference metrics which is after adding any transformations to distorted im-age gaining popularity in the area of video quality means that the metric can be cheated in some cases. assessment is Video Multimethod Fusion Approach Such transformations may allow competitors, for (VMAF)[5], announced by Netflix. It is an open- example, to cheat in video codecs comparisons, if they source learning-based solution. Its main idea is to “tune” their codecs for increasing VMAF qual-ity combine multiple elementary video quality features, scores. Types of video distortions that we were such as Visual Information Fidelity (VIF)[12], Detail looking for change the visual quality of the video, Loss Metric (DLM)[11] and temporal information (TI) – which should lead to a decrease in the value of any the difference between two neighboring frames, and then full-reference metric. The fact that they lead to an to train support vector machine (SVM) regres-sion on increase in the value of VMAF, is a significant obsta- subjective data. The resulting regressor is used for cle to using VMAF for all types of videos as the main estimating per-frame quality scores on new videos. The quality indicator and proves the need of modification of scheme [7] of this metric is shown in Fig. 1. the original VMAF algorithm. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2. Study Method SSIM and VMAF scores were calculated for each video processed with the considered color enhance- During testing of VMAF algorithm for video ment algorithms with different parameters. As it was codecs comparisons, we noticed that it reacts on con- mentioned before, after color correction the videos trast distortions, so we chose color and contrast ad- were compressed with medium preset of x264 encoder justments as basic types of the searched video trans- on 3 Mbps. Then, the difference between metric scores formations. Two famous and common approaches for of processed videos and original video were calculated color adjustments were tested to find the best strat-egy to compare, how color corrections influenced quality for VMAF scores increasing. Two cases of trans- scores. Fig. 2 shows this difference for SSIM metric formations application to the video were tested: ap- of Bay time-lapse video sequence for different param- plying transformation before and after video encod- eter values of unsharp mask algorithm. The similar- ing. In general, there was no significant difference be- ity scores for VMAF quality metric are presented in tween these options, because the compression step can be Fig. 3. omitted for increasing VMAF with color enhance-ment. Therefore, further we will describe only the first case with adjustment before compression, and we leave the compression step because in our work VMAF tuning is considered in case of video-codec compar-isons. We chose 4 videos which represent different spa- tial and temporal complexity [8], content and contrast to test transformations which may influence VMAF scores. All videos have FullHD resolution and high bit rate. Bay time-lapse and Red kayak were filmed in flat Fig. 2. SSIM scores for Fig. 3. VMAF scores for colors, which usually require color post-processing. different parameters of different parameters of Three of the videos (Crowd run, Red kayak and Speed unsharp mask on Bay unsharp mask on Bay bag) were taken from open video collection on me- time-lapse video sequence. time-lapse video sequence. dia.xiph.org and one was taken from MSU video col- lection used for selecting testing video sets for annual On these plots, higher values mean that the objec- video codecs comparison [2]. The description (and tive quality of the color-adjusted video was better ac- sources) of the first three videos can be found on site cording to the metric. VMAF shows better scores for [6], and the rest Bay time-lapse video sequence con- high radius and a medium amount of unsharp mask, tained a scene with water and grass and the grass and and SSIM becomes worse for high radius and high waves on the water. amount. The optimal values of the algorithm param- Three versions of VMAF were tested: 0.6.1, 0.6.2, eters can be estimated on the difference in these plots. 0.6.3. The implementations of all three metric ver- For another color adjustment algorithm (histogram sions from MSU Video Quality Measurement Tool [3] equalization), one parameter was optimized and the were used. The results did not differ much, so the fol- results are presented on Fig. 4 together with the re- lowing plots are presented for the latest (0.6.3) version of sults of unsharp mask. VMAF. 74 3. Proposed Tuning Algorithm 72 Y-VMAF(0.6.3) For color and brightness adjustment, two known and widespread image processing algorithms were cho- 70 sen: unsharp mask and histogram equalization. We 68 used the implementations of these algorithms which are available in open-source scikit-image [13] library. Histogram equalization 66 Unsharp mask In this library, unsharp mask has two parameters Without correction which influence image levels: radius (the radius of 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 Gaussian blur) and amount (how much contrast was Y-SSIM added at the edges). For histogram equalization, a parameter of clipping limit was analyzed. In order to Fig. 4. Comparison of VMAF and SSIM scores for find optimal configurations of equalization parame-ters, a different configurations of unsharp mask and histogram multi-objective optimization algorithm NSGA-II [10] equalization on Bay time-lapse video sequence. The was used. Only the limits for the parameters were set results in the second quadrant, where SSIM values weren't to the genetic algorithm, and it was applied to find the changed and VMAF values increased, are interesting for best parameters for each testing video. us. According to these results, for some configurations of 4. Results histogram equalization VMAF become significantly The following examples of frames from the test- better (from 68 to 74) and SSIM doesn’t change a lot ing videos demonstrate color corrections which in- (decrease from 0.88 to 0.86). The results slightly dif-fer creased VMAF and almost did not influence the val- for other videos. On Crowd run video sequence, ues of SSIM. Unsharp mask with radius = 2.843 VMAF was not increased by unsharp mask (Fig. 5a) and amount = 0.179 increased VMAF without sig- and was increased a little by histogram equalization. nificant decrease of SSIM for Bay time-lapse (Fig. 6a For Red kayak and Speed bag videos, unsharp mask and Fig. 6b). The images before and after masking could significantly increase VMAF and just slightly look equivalent (a comparison in a checkerboard view is decrease SSIM (Fig. 5b and Fig. 5c) in Fig. 7) and have similar SSIM score, while VMAF score is better after the transformation. Histogram equalization 54 Unsharp mask Without correction 52 Y-VMAF(0.6.3) 50 20000 20000 48 # of pixels 15000 15000 # of pixels 46 10000 10000 0.680 0.685 0.690 0.695 0.700 0.705 0.710 0.715 0.720 5000 5000 Y-SSIM 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Intensity Intensity (a) Color tuning results for Crowd run video sequence. 45 (a) Without color correction (b) After unsharp mask V M AF = 68.160, V M AF = 72.009, 44 SSIM = 0.879 SSIM = 0.878 43 Y-VMAF(0.6.3) 42 Fig. 6. Frame 5 from Bay time-lapse video sequence and its histogram with and without contrast correction. 41 Two images and their histograms look equivalent. 40 39 38 Unsharp mask Without correction 0.810 0.815 0.820 0.825 0.830 0.835 0.840 Y-SSIM (b) Color tuning results for Red kayak video sequence. 99 98 Y-VMAF(0.6.3) 97 96 Fig. 7. Checkerboard comparison of frame 5 from Bay time-lapse video sequence before and after distortions. 95 Unsharp mask Two images look almost equivalent. Without correction 0.948 0.950 0.952 0.954 0.956 0.958 0.960 0.962 0.964 Y-SSIM For Crowd run sequence, histogram equalization with kernelsize = 8 and cliplimit = 0.00419 also in- (c) Color tuning results for Speed bag video sequence. creased VMAF (Fig. 8a and Fig. 8b). The video is more contrasted, so the decrease in SSIM was more Fig. 5. Comparison of VMAF and SSIM scores for significant. However, tho images also look similar different configurations of unsharp mask and histogram (Fig. 9) and have similar SSIM score, while VMAF equalization on tested video sequences. The results in the showed better score after contrast transformation. second quadrant, where SSIM values weren't changed and Red kayak looked better according to VMAF after VMAF values increased, are interesting for us. unsharp mask with radius = 9.436, amount = 0.045. For Speed bag, the following parameters of unsharp as well as in MSU Video-Codec Comparisons, as a mask allowed to increase VMAF greatly without in- main objective quality metric. fluencing SSIM: radius = 9.429, amount = 0.114. We wanted to pay attention to this problem and hope to see the progress in this are, which is likely to happen since the metric is being actively developed. Our further research will involve a subjective compar- ison of the proposed color adjustments to the original videos and the development of novel approaches for metric tuning. 20000 17500 20000 15000 6. Acknowledgments 15000 12500 This work was partially supported by the Russian # of pixels # of pixels 10000 10000 7500 Foundation for Basic Research under Grant 19-01-00785a. 5000 5000 2500 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Intensity Intensity 7. References (a) Without color (b) After histogram correction equalization [1] Cisco Visual Networking Index: Forecast and V M AF = 51.005, V M AF = 53.083, Methodology. 2016-2021. SSIM = 0.715 SSIM = 0.712 [2] HEVC Video Codec Comparison 2018 (Thirteen MSU Video Codec Comparison) Fig. 8. Frame 1 from Crowd run video sequence and its http://compression.ru/video/codec_ comparison/ histogram with and without color correction. Two hevc_2018/ images and their histograms look almost similar. [3] MSU Quality Measurement Tool: Download Page http://compression.ru/video/quality_ measure/ vqmt_download.html [4] Perceptual Video Quality Metrics: Are they Ready for the Real World? Available on- line: https://www.ittiam.com/perceptual-video- quality-metrics-ready-real-world [5] VMAF: Perceptual video quality assessment based on multi-method fusion, Netflix, Inc., 2017 https:// github.com/Netflix/vmaf. [6] Xiph.org Video Test Media [derf’s collection] https://media.xiph.org/video/derf/ [7] C. G. Bampis, Z. Li, and A. C. Bovik, “Spatiotem- poral feature integration and model fusion for full reference video quality assessment,” in IEEE Fig. 9. Checkerboard comparison of frame 1 from Transactions on Circuits and Systems for Video Crowd run video sequence before and after distortions. Technology, 2018. [8] C. Chen, S. Inguva, A. Rankin, and A. Kokaram, “A subjective study for the design of multi- 5. Conclusion resolution ABR video streams with the VP9 Video quality reference metrics are used to show codec,” in Electronic Imaging, 2016(2), pp. 1-5. the difference between original and distorted streams [9] S. Chikkerur, V. Sundaram, M. Reisslein, and L. J. and are expected to take worse values when any trans- Karam, “Objective video quality assessment meth-ods: formations were applied to the original video. How- A classification, review, and performance ever, sometimes it is possible to deceive objective met- comparison,” in IEEE Transactions on Broadcast-ing, rics. In our article, we described the way to increase 57(2), pp. 165–182, 2011. the values of popular full-reference metric VMAF. If [10] K. Deb, A. Pratap, S. Agarwal, and T. A. M. the video is not contrasted, VMAF can be increased by T. Meyarivan, “A fast and elitist multiobjective color adjustments without influencing SSIM. In genetic algorithm: NSGA-II,” in IEEE transac- another case, contrasted video can also be tuned for tions on evolutionary computation, 6(2), VMAF but with little SSIM worsening. pp.182-197, 2002. Although VMAF has become popular and impor- [11] S. Li, F. Zhang, L. Ma, and K. N. Ngan, “Image tant, particularly for video codec developers and cus- quality assessment by separately evaluating detail tomers, there are still a number of issues in its applica- losses and additive impairments”, in IEEE Trans- tion. This is why SSIM is used in many competitions, actions on Multimedia, 2011, 13(5), pp. 935-949. [12] H. R. Sheikh and A. C. Bovik, “Image informa-tion and visual quality,” in IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, 2004, �. 3. – �. iii-709. [13] S. van der Walt, J. L. Schonberger, J. Nunez- Iglesias, F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu, and the scikit-image contribu- tors. scikit-image: Image processing in Python. PeerJ 2:e453, 2014.