Hacking VMAF with Video Color and Contrast Distortion
                             A. Zvezdakova1, S. Zvezdakov1, D. Kulikov1,2, D. Vatolin1
             azvezdakova@graphics.cs.msu.ru|szvezdakov@graphics.cs.msu.ru|dkulikov@graphics.cs.msu.ru|
                                            dmitriy@graphics.cs.msu.ru
                             1
                              Lomonosov Moscow State University, Moscow, Russia;
                                      2
                                        Dubna State University, Dubna, Russia
    Video quality measurement takes an important role in many applications. Full-reference quality metrics which
are usually used in video codecs comparisons are expected to reflect any changes in videos. In this article, we
consider different color corrections of compressed videos which increase the values of full-reference metric VMAF
and almost don’t decrease other widely-used metric SSIM. The proposed video contrast enhancement approach
shows the metric in-applicability in some cases for video codecs comparisons, as it may be used for cheating in
the comparisons via tuning to improve this metric values.
    Keywords: video quality, quality measuring, video-codec comparison, quality tuning, reference metrics, color
correction.
1. Introduction

    At the moment, video content takes a significant
part of worldwide network traffic and its share is ex-
pected to grow up to 71% by 2021 [1]. Therefore, the
quality of encoded videos is becoming increasingly im-
portant, which leads to growing of an interest in the
area of new video quality assessment methods devel-
opment. As new video codec standards appear, the
                                                                                 Fig. 1. The scheme of VMAF algorithm.
existing standards are being improved. In order to
choose one or another video encoding solution, it is
necessary to have appropriate tools for video quality                      Despite increasing attention to this metric, many
assessment. Since the best method of video quality                     video quality analysis projects, such as Moscow State
assessment is a subjective evaluation, which is quite                  University’s (MSU) Annual Video Codec Comparison
expensive in terms of time and cost of its implementa-                 [2], still use other common metrics developed many
tion, all other objective methods are improving in an                  years ago, such as structural similarity (SSIM) and
attempt to approach the ground truth-solution (sub-                    even peak signal-to-noise ratio (PSNR), which are
jective evaluation).                                                   based only on difference characteristics of two im-
                                                                       ages. At the same time, many readers of the reports of
    Methods for evaluating encoded videos quality                      these comparisons send requests to use new metrics of
can be divided into 3 categories [9]: full-reference,                  VMAF type. The main obstacle for the full tran-sition
reduced-reference and no-reference.       Full-reference               to the use of VMAF metrics is non-versatility of this
metrics are the most common, as their results are                      metric and not fully adequate results on some types of
easily interpreted — usually as an assessment of the                   videos [4].
degree of distortions in the video and their visibility to                 The main goal of our investigation was to prove
the observer. The only drawback of this approach                       the no-universality of the current version of VMAF
compared to the others is the need to have the orig-                   algorithm. In this paper, we describe video color
inal video for comparison with the encoded, which is                   and contrast transformations which increase VMAF-
often not available.                                                   score with keeping SSIM score the same or better.
                                                                       The possibility to improve full-reference metric score
    One of the widely-used full-reference metrics which is             after adding any transformations to distorted im-age
gaining popularity in the area of video quality                        means that the metric can be cheated in some cases.
assessment is Video Multimethod Fusion Approach                        Such transformations may allow competitors, for
(VMAF)[5], announced by Netflix. It is an open-                        example, to cheat in video codecs comparisons, if they
source learning-based solution. Its main idea is to                    “tune” their codecs for increasing VMAF qual-ity
combine multiple elementary video quality features,                    scores. Types of video distortions that we were
such as Visual Information Fidelity (VIF)[12], Detail                  looking for change the visual quality of the video,
Loss Metric (DLM)[11] and temporal information (TI) –                  which should lead to a decrease in the value of any
the difference between two neighboring frames, and then                full-reference metric. The fact that they lead to an
to train support vector machine (SVM) regres-sion on                   increase in the value of VMAF, is a significant obsta-
subjective data. The resulting regressor is used for                   cle to using VMAF for all types of videos as the main
estimating per-frame quality scores on new videos. The                 quality indicator and proves the need of modification of
scheme [7] of this metric is shown in Fig. 1.                          the original VMAF algorithm.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2. Study Method                                                    SSIM and VMAF scores were calculated for each
                                                               video processed with the considered color enhance-
    During testing of VMAF algorithm for video                 ment algorithms with different parameters. As it was
codecs comparisons, we noticed that it reacts on con-          mentioned before, after color correction the videos
trast distortions, so we chose color and contrast ad-          were compressed with medium preset of x264 encoder
justments as basic types of the searched video trans-          on 3 Mbps. Then, the difference between metric scores
formations. Two famous and common approaches for               of processed videos and original video were calculated
color adjustments were tested to find the best strat-egy       to compare, how color corrections influenced quality
for VMAF scores increasing. Two cases of trans-                scores. Fig. 2 shows this difference for SSIM metric
formations application to the video were tested: ap-           of Bay time-lapse video sequence for different param-
plying transformation before and after video encod-            eter values of unsharp mask algorithm. The similar-
ing. In general, there was no significant difference be-       ity scores for VMAF quality metric are presented in
tween these options, because the compression step can be       Fig. 3.
omitted for increasing VMAF with color enhance-ment.
Therefore, further we will describe only the first case
with adjustment before compression, and we leave the
compression step because in our work VMAF tuning is
considered in case of video-codec compar-isons.
    We chose 4 videos which represent different spa-
tial and temporal complexity [8], content and contrast to
test transformations which may influence VMAF
scores. All videos have FullHD resolution and high bit
rate. Bay time-lapse and Red kayak were filmed in flat           Fig. 2. SSIM scores for    Fig. 3. VMAF scores for
colors, which usually require color post-processing.             different parameters of     different parameters of
Three of the videos (Crowd run, Red kayak and Speed               unsharp mask on Bay        unsharp mask on Bay
bag) were taken from open video collection on me-              time-lapse video sequence. time-lapse video sequence.
dia.xiph.org and one was taken from MSU video col-
lection used for selecting testing video sets for annual           On these plots, higher values mean that the objec-
video codecs comparison [2]. The description (and              tive quality of the color-adjusted video was better ac-
sources) of the first three videos can be found on site        cording to the metric. VMAF shows better scores for
[6], and the rest Bay time-lapse video sequence con-           high radius and a medium amount of unsharp mask,
tained a scene with water and grass and the grass and          and SSIM becomes worse for high radius and high
waves on the water.                                            amount. The optimal values of the algorithm param-
    Three versions of VMAF were tested: 0.6.1, 0.6.2,          eters can be estimated on the difference in these plots.
0.6.3. The implementations of all three metric ver-            For another color adjustment algorithm (histogram
sions from MSU Video Quality Measurement Tool [3]              equalization), one parameter was optimized and the
were used. The results did not differ much, so the fol-        results are presented on Fig. 4 together with the re-
lowing plots are presented for the latest (0.6.3) version of   sults of unsharp mask.
VMAF.

                                                                                    74

3. Proposed Tuning Algorithm
                                                                                    72
                                                                    Y-VMAF(0.6.3)


    For color and brightness adjustment, two known
and widespread image processing algorithms were cho-                                70

sen: unsharp mask and histogram equalization. We
                                                                                    68
used the implementations of these algorithms which
are available in open-source scikit-image [13] library.                                     Histogram equalization
                                                                                    66      Unsharp mask
In this library, unsharp mask has two parameters                                            Without correction
which influence image levels: radius (the radius of                                  0.82   0.83    0.84    0.85     0.86   0.87   0.88   0.89
Gaussian blur) and amount (how much contrast was
                                                                                                             Y-SSIM
added at the edges). For histogram equalization, a
parameter of clipping limit was analyzed. In order to             Fig. 4. Comparison of VMAF and SSIM scores for
find optimal configurations of equalization parame-ters, a      different configurations of unsharp mask and histogram
multi-objective optimization algorithm NSGA-II [10]               equalization on Bay time-lapse video sequence. The
was used. Only the limits for the parameters were set          results in the second quadrant, where SSIM values weren't
to the genetic algorithm, and it was applied to find the        changed and VMAF values increased, are interesting for
best parameters for each testing video.                                                    us.
    According to these results, for some configurations of                        4. Results
histogram equalization VMAF become significantly
                                                                                      The following examples of frames from the test-
better (from 68 to 74) and SSIM doesn’t change a lot
                                                                                  ing videos demonstrate color corrections which in-
(decrease from 0.88 to 0.86). The results slightly dif-fer
                                                                                  creased VMAF and almost did not influence the val-
for other videos. On Crowd run video sequence,
                                                                                  ues of SSIM. Unsharp mask with radius = 2.843
VMAF was not increased by unsharp mask (Fig. 5a)
                                                                                  and amount = 0.179 increased VMAF without sig-
and was increased a little by histogram equalization.
                                                                                  nificant decrease of SSIM for Bay time-lapse (Fig. 6a
For Red kayak and Speed bag videos, unsharp mask
                                                                                  and Fig. 6b). The images before and after masking
could significantly increase VMAF and just slightly
                                                                                  look equivalent (a comparison in a checkerboard view is
decrease SSIM (Fig. 5b and Fig. 5c)
                                                                                  in Fig. 7) and have similar SSIM score, while VMAF
                                                                                  score is better after the transformation.

                               Histogram equalization
                     54        Unsharp mask
                               Without correction
                     52
     Y-VMAF(0.6.3)


                     50

                                                                                                20000                                                              20000
                     48
                                                                                  # of pixels   15000                                                              15000


                                                                                                                                                     # of pixels
                     46                                                                         10000                                                              10000


                     0.680 0.685 0.690 0.695 0.700 0.705 0.710 0.715 0.720                      5000                                                               5000

                                                Y-SSIM                                             0                                                                  0
                                                                                                        0   50   100               150   200   250                         0   50   100               150   200   250
                                                                                                                       Intensity                                                          Intensity

    (a) Color tuning results for Crowd run video sequence.
         45                                                                                     (a) Without color correction                                               (b) After unsharp mask
                                                                                                    V M AF = 68.160,                                                         V M AF = 72.009,
                     44
                                                                                                     SSIM = 0.879                                                             SSIM = 0.878
                     43
     Y-VMAF(0.6.3)


                     42                                                                           Fig. 6. Frame 5 from Bay time-lapse video sequence
                                                                                                 and its histogram with and without contrast correction.
                     41
                                                                                                   Two images and their histograms look equivalent.
                     40
                     39
                     38         Unsharp mask
                                Without correction
                          0.810 0.815 0.820 0.825 0.830 0.835 0.840
                                                 Y-SSIM
    (b) Color tuning results for Red kayak video sequence.


                     99

                     98
     Y-VMAF(0.6.3)


                     97

                     96                                                                         Fig. 7. Checkerboard comparison of frame 5 from Bay
                                                                                                time-lapse video sequence before and after distortions.
                     95         Unsharp mask                                                             Two images look almost equivalent.
                                Without correction
                          0.948 0.950 0.952 0.954 0.956 0.958 0.960 0.962 0.964
                                                 Y-SSIM                               For Crowd run sequence, histogram equalization
                                                                                  with kernelsize = 8 and cliplimit = 0.00419 also in-
    (c) Color tuning results for Speed bag video sequence.                        creased VMAF (Fig. 8a and Fig. 8b). The video is
                                                                                  more contrasted, so the decrease in SSIM was more
   Fig. 5. Comparison of VMAF and SSIM scores for                                 significant. However, tho images also look similar
 different configurations of unsharp mask and histogram                           (Fig. 9) and have similar SSIM score, while VMAF
equalization on tested video sequences. The results in the                        showed better score after contrast transformation.
second quadrant, where SSIM values weren't changed and                                   Red kayak looked better according to VMAF after
      VMAF values increased, are interesting for us.                                   unsharp mask with radius = 9.436, amount = 0.045.
    For Speed bag, the following parameters of unsharp                                                                                      as well as in MSU Video-Codec Comparisons, as a
mask allowed to increase VMAF greatly without in-                                                                                           main objective quality metric.
fluencing SSIM: radius = 9.429, amount = 0.114.                                                                                                 We wanted to pay attention to this problem and
                                                                                                                                            hope to see the progress in this are, which is likely to
                                                                                                                                            happen since the metric is being actively developed.
                                                                                                                                            Our further research will involve a subjective compar-
                                                                                                                                            ison of the proposed color adjustments to the original
                                                                                                                                            videos and the development of novel approaches for
                                                                                                                                            metric tuning.
              20000

              17500                                                                 20000

              15000
                                                                                                                                            6. Acknowledgments
                                                                                    15000
              12500
                                                                                                                                               This work was partially supported by the Russian
# of pixels


                                                                      # of pixels


              10000
                                                                                    10000
              7500
                                                                                                                                            Foundation for Basic Research under Grant
                                                                                                                                            19-01-00785a.
              5000                                                                  5000
              2500

                 0                                                                     0
                      0      50   100               150   200   250                         0      50   100               150   200   250
                                        Intensity                                                             Intensity
                                                                                                                                            7. References
                           (a) Without color                                                    (b) After histogram
                               correction                                                           equalization                             [1] Cisco Visual Networking Index: Forecast and
                          V M AF = 51.005,                                                      V M AF = 53.083,                                Methodology. 2016-2021.
                           SSIM = 0.715                                                          SSIM = 0.712                               [2] HEVC          Video      Codec     Comparison      2018
                                                                                                                                                (Thirteen MSU Video Codec Comparison)
              Fig. 8. Frame 1 from Crowd run video sequence and its                                                                             http://compression.ru/video/codec_          comparison/
                 histogram with and without color correction. Two                                                                               hevc_2018/
                  images and their histograms look almost similar.                                                                           [3] MSU Quality Measurement Tool: Download Page
                                                                                                                                                http://compression.ru/video/quality_           measure/
                                                                                                                                                vqmt_download.html
                                                                                                                                            [4] Perceptual Video Quality Metrics: Are they
                                                                                                                                                Ready for the Real World?              Available on-
                                                                                                                                                line: https://www.ittiam.com/perceptual-video-
                                                                                                                                                quality-metrics-ready-real-world
                                                                                                                                            [5] VMAF: Perceptual video quality assessment based on
                                                                                                                                                multi-method fusion, Netflix, Inc., 2017 https://
                                                                                                                                                github.com/Netflix/vmaf.
                                                                                                                                             [6] Xiph.org Video Test Media [derf’s collection]
                                                                                                                                                https://media.xiph.org/video/derf/
                                                                                                                                            [7] C. G. Bampis, Z. Li, and A. C. Bovik, “Spatiotem-
                                                                                                                                                poral feature integration and model fusion for full
                                                                                                                                                reference video quality assessment,” in IEEE
                Fig. 9. Checkerboard comparison of frame 1 from                                                                                 Transactions on Circuits and Systems for Video
              Crowd run video sequence before and after distortions.                                                                            Technology, 2018.
                                                                                                                                            [8] C. Chen, S. Inguva, A. Rankin, and A. Kokaram, “A
                                                                                                                                                subjective study for the design of multi-
5. Conclusion
                                                                                                                                                resolution ABR video streams with the VP9
    Video quality reference metrics are used to show                                                                                            codec,” in Electronic Imaging, 2016(2), pp. 1-5.
the difference between original and distorted streams                                                                                        [9] S. Chikkerur, V. Sundaram, M. Reisslein, and L. J.
and are expected to take worse values when any trans-                                                                                           Karam, “Objective video quality assessment meth-ods:
formations were applied to the original video. How-                                                                                             A      classification,     review,   and     performance
ever, sometimes it is possible to deceive objective met-                                                                                        comparison,” in IEEE Transactions on Broadcast-ing,
rics. In our article, we described the way to increase                                                                                          57(2), pp. 165–182, 2011.
the values of popular full-reference metric VMAF. If                                                                                        [10] K. Deb, A. Pratap, S. Agarwal, and T. A. M.
the video is not contrasted, VMAF can be increased by                                                                                            T. Meyarivan, “A fast and elitist multiobjective
color adjustments without influencing SSIM. In                                                                                                  genetic algorithm: NSGA-II,” in IEEE transac-
another case, contrasted video can also be tuned for                                                                                            tions      on      evolutionary    computation,     6(2),
VMAF but with little SSIM worsening.                                                                                                            pp.182-197, 2002.
    Although VMAF has become popular and impor-                                                                                             [11] S. Li, F. Zhang, L. Ma, and K. N. Ngan, “Image
tant, particularly for video codec developers and cus-                                                                                          quality assessment by separately evaluating detail
tomers, there are still a number of issues in its applica-                                                                                      losses and additive impairments”, in IEEE Trans-
tion. This is why SSIM is used in many competitions,                                                                                            actions on Multimedia, 2011, 13(5), pp. 935-949.
[12] H. R. Sheikh and A. C. Bovik, “Image informa-tion
    and visual quality,” in IEEE International Conference
    on Acoustics, Speech, and Signal Pro-cessing, 2004,
    �. 3. – �. iii-709.
[13] S. van der Walt, J. L. Schonberger, J. Nunez-
    Iglesias, F. Boulogne, J. D. Warner, N. Yager, E.
    Gouillart, T. Yu, and the scikit-image contribu-
    tors. scikit-image: Image processing in Python.
    PeerJ 2:e453, 2014.