=Paper=
{{Paper
|id=Vol-2688/paper12
|storemode=property
|title=Evaluating Video Quality by Differentiating Between Spatial and Temporal Distortions
|pdfUrl=https://ceur-ws.org/Vol-2688/paper12.pdf
|volume=Vol-2688
|authors=Meisam Jamshidi Seikavandi,Seyed Ali Amirshahi
|dblpUrl=https://dblp.org/rec/conf/cvcs/SeikavandiA20
}}
==Evaluating Video Quality by Differentiating Between Spatial and Temporal Distortions==
<pdf width="1500px">https://ceur-ws.org/Vol-2688/paper12.pdf</pdf>
<pre>
        Evaluating Video Quality by Differentiating
        Between Spatial and Temporal Distortions

              Meisam Jamshidi Seikavandi1 and Seyed Ali Amirshahi2
    1
     Nikoo Dana Fanavari Delfan, Technology and Science Park of Lorestan, Iran
                             meisamjam@gmail.com
2
  The Norwegian Colour and Visual Computing Laboratory, Norwegian University of
                    Science and Technology, Gjøvik, Norway
                           s.ali.amirshahi@ntnu.no


         Abstract. To objectively evaluate the quality of videos different state-
         of-the-art Image Quality Metrics (IQMs) have been used to introduce
         different Video Quality Metrics (VQM). While such approaches are able
         to evaluate the spatial quality of the frames in the video they are not
         able to address the temporal aspects of the video quality. In this study,
         we introduce a new full-reference VQM which is based on taking advan-
         tage of a Convolutional Neural Network (CNN) based IQM to evaluate
         the quality of the frame. Using other techniques such as visual saliency
         detection we are then able to differentiate between spatial and temporal
         distortions and use different pooling techniques to evaluate the quality
         of the video. Our results show that by detecting the type of distortion
         (spatial or temporal) affecting the video quality, the proposed VQM can
         evaluate the quality of the video with a higher accuracy.

         Keywords: Video Quality Assessment, Video Saliency, Image Saliency,
         Spatial Distortion, Temporal Distortion, Temporal pooling.


1       Introduction
With the huge amount of video we have access to in our daily life, evaluat-
ing the quality of videos is an essential part of any application that deals with
videos. Although subjective assessment is still considered the primary standard
for Video Quality Assessment (VQA), it is time-consuming and financially ex-
pensive to perform on a regular basis. For this and many other reasons, in the
last few decades, objective assessment of video quality has attracted much atten-
tion. Objective assessment methods, known as Video Quality Metrics (VQMs)
have been widely used to estimate the quality of videos [7]. Depending on the
availability of the reference video, VQMs can be classified into full-reference,
reduced-reference, and no-reference. Full-reference VQMs need access to the ref-
erence video, while reduced-reference metrics required partial information of the
reference video and no-reference metrics only have access to the test video.
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). Colour and Visual Com-
    puting Symposium 2020, Gjøvik, Norway, September 16-17, 2020.
2       M. Jamshidi Seikavandi and S. A. Amirshahi

     Full-reference VQMs can be further classified to error sensitivity based meth-
ods [10,18], structural similarity based approaches [39], information fidelity based
approaches [35], spatial-temporal approaches [34], saliency-based approaches
[25], and network-aware approaches [13]. Many of the mentioned methods are an
extended version of Image Quality Metrics (IQMs) that generally follow with a
pooling structure to bridge over Image Quality Assessment (IQA) and VQA. In
this study, we combine different approaches, such as metrics based on the use of
Convolutional Neural Networks (CNNs) and saliency techniques, to calculate a
series of quality values for the video frames. Using different weighting techniques
that depend on the type of distortion (spatial or temporal) affecting the video,
the quality of the frames is pooled to represent the video quality score.
     Keeping in mind that VQA has been a field of research for over two decades,
it is no surprise that a high number of different VQMs have been introduced. Like
any other field of research in image processing and computer vision, early VQMs
were based on introducing different single or multiple handcrafted features for
VQA. While initially these features were pure mathematical techniques such as
Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR) [40], overtime
a shift towards introducing features that try to model the Human Visual System
(HVS) is seen in VQMs [17,29,33]. Unlike the early VQMs which were mostly
focused on spatial aspects of the video [10,18,39], with the introduction of tempo-
ral features VQMs showed improvement in their performance [15,19,21,24]. Since
most VQMs either provide a spatial, temporal, and/or spatial-temporal quality
value for the videos, different pooling techniques have been used. As an example,
the use of saliency maps for providing different weights to different regions in
the frame or different frames have been used to introduce different VQMs [3,6].
Finally, in recent years, with the introduction of state-of-the-art machine learn-
ing techniques and especially Convolutional Neural Networks (CNNs), again, a
big improvement has been observed in the accuracy of VQMs [2,14,19,41].
     Our contribution in this paper can be summarized as: 1) by using video
saliency maps we introduce a spatial dimension to a state-of-the-art IQM and
use the approach for video quality assessment. 2) By applying temporal and
spatial-temporal pooling techniques two different quality scores are calculated
for each frame in the video. 3) A new content-based evaluation is introduced
that is able to detect the type of distortion (temporal or spatial) and propose a
VQM based on the distortion detected.
     The rest of the paper is organized as follows: in Section 2, we provide a
detailed description of the proposed approach while the experimental results are
presented in Section 3. Finally, Section 4 provides a conclusion of the work and
what future directions we plan to take to extend the work.


2   Proposed Approach

Our proposed VQM (Figure 1) is based on the extraction of spatial and temporal
features from the video. Apart from the main VQM introduced in this study,
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions                                                                                                        3


                                       Video
                                      Saliency
                                       Maps
 Reference                                                                         Weighted
   Video                                                                            Spatial
        VF Ri                                                                      Features
                                                                                         SW VF Ri
     i ∈ 1, · · · , N
                                                                                       i ∈ 1, · · · , N
                                       Spatial
                                      Features
 Frames’                                                                                                       Saliency
 Quality                       IQM                                                             IQM             Weighted
   IQ(VF Ti )
                                                                                                               Quality
                                                                                                               IQ(SW VF Ti )
 i ∈ 1, · · · , N                                                                                                i ∈ 1, · · · , N

                                       Spatial
                                      Features
        Test                                                                       Weighted
        Video                                                                       Spatial
        VF Ti                                                                      Features
     i ∈ 1, · · · , N                                                                    SW VF Ti
                                                                                       i ∈ 1, · · · , N
                                       Video
                                      Saliency
                                       Maps

                        (a) Spatial and spatial-temporal quality assessment for each frame.


                           Video
                          Saliency    Energy
                           Maps                           WH
                                                                                            Energy
Reference
  Video                                   Absolute     Comparison                           Based
                                          Difference      Box
   VF Ri                                                                                    Weights
 i ∈ 1, · · · , N
                            Image
                           Saliency   Energy
                                                                                                                      P
                                                                                                                              EW V Q2(VT )
                            Maps
                                                                                   P
                                                                                   N
                                                                                              V Q2(VT )
                                                         Saliency                                                          WH × V W V Q2 + (1 − WH ) × EW V Q2   Combined V Q2(VT )
                                                         Weighted
                                                         Quality
                           Video                         IQ(SW VF Ti )
                          Saliency    Energy               i ∈ 1, · · · , N              Variation
                                                                                                               P
                           Maps                                                           Based                           V W V Q2(VT )
                                                                                                               N

                                                                              σ2
                                                                                         Quality
   Test                                                                                 Wi × IQ(SW VF Ti ) ,
   Video                                  Absolute
                                          Difference
                                                                                          i ∈ 1, · · · , N
    VF Ti
 i ∈ 1, · · · , N
                            Image                        Frames’                   P

                                                         Quality                   N
                                                                                              V Q1(VT )
                           Saliency   Energy
                                                            IQ(VF Ti )
                            Maps                          i ∈ 1, · · · , N


(b) Different pooling methods used between the quality of each frame for evaluating
the quality of the video.

Fig. 1. Pipeline used for calculating different VQMs proposed in this study. In the
figure blocks with a magenta shade correspond to spatial features, blocks with a green
shade correspond to temporal features, and blocks with a gradient shade of magenta
to green correspond to spatial-temporal features.


and to better study the influence of the different features used in our VQM, we
also propose other VQMs based solely on one or multiple features introduced.
4       M. Jamshidi Seikavandi and S. A. Amirshahi

2.1   Spatial Approach
As pointed out in Section 1, IQA and VQA are closely linked. In fact, when
it comes to extracting spatial features from videos, a high number of features
used for evaluating the video quality were initially introduced in IQMs. In other
words, in the case of spatial features, different VQMs extract spatial features
introduced in different IQMs on each frame of the video [2,11,27].
    In this study, we aim to introduce a new VQM which takes advantage of
the IQM proposed in [8]. In [8] Amirshahi et al. propose a new IQM, which is
based on calculating the similarity between extracted feature maps in different
convolutional layers of a pre-trained CNN Network. Their hypothesis which was
inspired by the use of Pyramid Histogram of Orientation Gradients (PHOG) [12]
features for calculating self-similarity in images introduced in [4,9,30] is that the
similar feature maps at different convolutional layers are the similar the quality
of the test and reference images are. To calculate the similarity between two
features maps, they take the following steps:
 1. From the reference (IR ) and test (IT ) images feature maps are extracted at
    different convolutional layers.
 2. For the test image IT in Convolutional layer n histogram
        h(IT , n, L) =
                 X Y
                 X X                                X X
                                                    X Y
             (             F(IT , n, L, 1)(i, j),             F(IT , n, L, 2)(i, j), · · · ,
                 i=1 j=1                            i=1 j=1                                    (1)
              X Y
              X X                                        X X
                                                         X Y
                        F(IT , n, L, z)(i, j), · · · ,             F(IT , n, L, M )(i, j)),
              i=1 j=1                                    i=1 j=1

    is calculated. In Eq. (1), L corresponds to the level in spatial pyramid the
    histograms are calculated at and F(IT , n, L, z) corresponds to feature map
    z in the nth convolutional layer of image IT at level L with a size of X × Y .
    To take a pyramid approach, Amirshahi et al. divided feature maps to four
    equal sub-regions resulting in different h histograms (Eq. (1)) at different
    levels (L) of the spatial resolution. The division and calculation of h contin-
    ues to the point that the smallest side of the smallest sub-region is equal or
    larger than seven pixels.
 3. The quality of the test image at level L for the convolutional layer n is then
    calculated by
                 mIQM (IT , n, L) = dHIK (h(IT (n, L)), h(IR (n, L)))
                                     n
                                    X                                                          (2)
                                  =    min(hi (IT , n, L)), hi (IR , n, L)).
                                         i=1

 4. The concatenation of all mIQM (IT , n, l) values
                   mIQM (IT , n) = (mIQM (IT , n, 1), mIQM (IT , n, 2),
                                                                                               (3)
                              · · · , mIQM (IT , n, z), · · · , mIQM (IT , n, L)),
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions         5

      would then be used by
                                                    L
                              1 − σ(mIQM (IT , n)) X 1
               IQ(IT , n) =         PL 1               · mIQM (IT , n, l)       (4)
                                      l=1
                                                     l
                                          l         l=1

    to calculate the quality of the test image at convolutional layer n. In Eq. (4),
    σ(mIQM (IT , n)) corresponds to the standard deviation among the values in
    mIQM (IT , n).
 5. Finally, the overall quality of the test image is calculated using a geometric
    mean of all quality scores at different convolutional layers
                                          v
                                          uN
                                          uY
                              IQ(IT ) = t N
                                                IQ(IT , n)                       (5)
                                              n=1

      where N corresponds to the total number of convolutional layers.
   While the study presented in [8] was mainly focused on the use of the Alexnet
model [20], nevertheless, it was shown that it would be possible to use other
deeper CNN models such as VGG16 and VGG19 [36]. Different studies have
shown the flexibility of the mentioned IQM and how it can be extended to
improve the performance of other IQMs [5]. For this reason, in this study, we
would take advantage of this IQM to evaluate the spatial quality of the video
frames (Figure 1).
   To evaluate the spatial quality of the test video (VT ), the average quality of
the frames                            PN
                                            IQ(VF Ti )                         (6)
                         V Q1 (VT ) = i=1              .
                                              N
could be used. In Eq. (6), VF Ti represents the ith frame and N corresponds to
the number of frames in the test video.

2.2     Spatial-Temporal Approach
It is clear that without taking into account the temporal aspects of a video, any
VQM would be lacking accuracy. In our approach the first feature extracted from
the videos is visual saliency which is linked to the spatial-temporal aspects of
the video quality. Different studies have shown the important role visual saliency
plays in IQA and VQA [3]. While there are a considerable number of different
methods to calculate the saliency maps of images and video, the Graph-Based
Visual Saliency (GBVS) [16] approach is one of the well-known techniques which
has shown good accuracy for image and video saliency detection. It is important
to point out that while image saliency calculation is purely based on spatial
features, when it comes to video saliency, temporal aspects of the video are
also considered and so video saliency calculation could be linked to the spatial-
temporal properties of the video.
    In our approach, we first calculate the saliency map for the test and reference
videos. The saliency map of each frame is then resized to the size of the input
6      M. Jamshidi Seikavandi and S. A. Amirshahi

of the network. Similar to the layers of the pre-trained CNN model used, we
apply max-pooling to the calculated saliency maps in each frame resulting in
different saliency maps, each corresponding to the size of the feature maps at
each convolutional layer of our model. The calculated feature maps are then
used as pixel-wise weights for the features in different convolutional layers. This
will allow us to give higher weights to regions in the feature maps that are more
salient to the observer. The quality of the video is then calculated by
                                        PN
                                           i=1 IQ(SW VF Ti )                     (7)
                         V Q2 (VT ) =
                                                   N
in which IQ(SW VF Ti ) corresponds to the quality of VF Ti where the saliency
map of the frames has been used as a weighting function on the feature maps at
each convolutional layer (Figure 1).


2.3   Temporal Approach

Although different VQMs try to take into account the spatial and temporal
aspects of the video, nevertheless, most VQMs provide a single quality score for
each video. To reach this single quality score different pooling techniques are
used to combine quality scores of all video frames. While careful attention has
been paid on how the quality scores for each frame is calculated, most, if not all,
pooling approaches are based on some version of averaging the quality scores for
all frames. The average value, geometric mean, harmonic mean, and Minkowski
mean are some of the different types of averaging used in different VQMs. It is
clear that using any type of averaging on the quality values of the frame could
result in disregarding different aspects of the video that could be linked to the
HVS. In this study, to better link the video quality score to how observers react
to the change of quality in a video clip we try a new approach for pooling the
quality scores of the frames.
    Recent studies such as [28] have suggested that the overall perceptual quality
of a video is highly dependent on the temporal variation of the video quality.
That is, with an increase in the temporal variation of the video quality along
the video sequence, the video quality declines. To address this aspect, we use
the variation of the quality scores of the frames in our pooling approach. While
the variance of the quality scores of the frames could be a good description of
the quality fluctuation in the video, it only provides a general description of the
video quality. To better consider the temporal variation of the video quality, we
calculate the variance of quality scores in a specific time frame. That is, for the
ith frame of the test video (VTi ) we calculate

             LocalV ar (VTi ) = σ 2 (VQTi−L , · · · , VQTi , · · · , VQTi+L ).   (8)

In Eq. (8), the length of the local window in which we calculate the variance
(σ 2 ) of the frame quality score is 2L + 1. Based on our experimental results, the
best value for L is 2 resulting in a window of five frames. To introduce a better
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions          7

regional representation of the quality score for the video we calculate the video
quality using
                            PN
                                 Wi × IQ(VF Ti )
           V W V Q(VT ) = i=1 PN                   ,
                                     i=1 Wi                                   (9)
                           
                             1 if LocalV ar (VTi ) < GlobalV ar (VT )
                     Wi =                                             .
                             0 if LocalV ar (VTi ) > GlobalV ar (VT )
In Eq. (9) Wi corresponds to the weight given to the quality score of the ith
frame in the video (VF Ti ) and GlobalV ar (VT ) represents the variance of all the
frame quality scores in the test video. From Eq. (9) it is clear that if the variance
in the local quality score is larger than the variance of the global quality score a
weight of one is given to the frame quality but if the variance of the local quality
is lower than the variance of the global quality score a weight of zero is given to
the quality frame. Simply said, the quality of a frame is only considered if the
change of video quality in a given local interval ([VTi−L , VTi+L ]) is bigger than
the change of frame quality over the total duration of the video.

2.4   Spatial vs. Spatial-Temporal Distortion Detection
Although saliency maps have mostly been used to detect salient regions in the
image and/or videos, studies such as [26,31] have used saliency maps to differ-
entiate between salient and non-salient frames. While this labeling process is
simply done by calculating the total energy of the saliency map in each frame,
we take one step further. That is, by comparing saliency maps calculated for
frames using the GBVS video and image saliency techniques we will be able
to differentiate between frames that are mostly influenced by spatial or spatial-
temporal distortions (See Section 2.2 for a detailed description of the difference
between saliency maps calculated for frames using the image and video saliency
techniques). The following steps are taken for this process:
 1. Assuming the total energy of the saliency in the ith frame in the test video
    (VF Ti ) is equal to
                               X X
                               X Y
                                                                 2
                      EVTi =             (Video Sal(VF Ti )(x, y)) ,            (10)
                               x=1 y=1

    we calculate similar values for EVRi , EITi , and EIRi which represent the
    total energy of the ith frame in the reference video using a video saliency
    approach, the total energy of the ith frame in the test video using an image
    saliency approach, and the total energy of the ith frame in the reference video
    using an image saliency approach respectively. In Eq. (10), the ith frame has
    a size of X × Y and Video Sal represents the video saliency function used.
 2. The difference between the total salient energy of the reference and test
    frame using video and image saliency is calculated by
                                          | EVTi − EVRi |
                               dEVTi =                    ,                     (11)
                                               EVRi
8      M. Jamshidi Seikavandi and S. A. Amirshahi


                                        | EITi − EIRi |
                             dEITi =                    .                    (12)
                                             EIRi
   While dEVTi could be linked to spatial-temporal aspects of the video, dEITi
   is linked to the spatial aspects.
3. Assuming that the reference video has a higher quality compared to the test
   video in the cases in which dEVTi is larger than dEITi , it can be interpreted
   that the distortion has likely a higher spatial-temporal effect on the video
   than just a spatial effect.
    While in Section 2.3 the variance weighted VQM was introduced (V W V Q),
in this section by detecting the type of distortion (spatial or spatial-temporal)
we introduce the energy weighted VQM
                                 PN
                                     i=1 EWi × IQ(VF Ti )
                EW V Q(VT ) =            PN               ,
                                           i=1 EWi                           (13)
                                 
                                     dEVTi   if     dEITi < dEVTi
                         EWi =                                    .
                                     dEITi   if     dEITi > dEVTi
   While until now we have introduce two video quality values for each video
(the variance weighted video quality, and the energy weighted video quality)
we believe that since the two methods use different approaches, it is highly
possible to find situations that one of the methods perform better than the other.
Although finding a perfect metric that ideally detects this issue is challenging,
nevertheless, as a first step we introduce the following two parameters
                                            X
                              dEVALL =         dEVTi ,                        (14)

                                             X
                             dEIALL =            dEITi .                     (15)
The final video quality score which we refer to as the combined video quality is
then calculated by

      combined V Q(VT ) = (WH × V W V Q + (1 − WH ) × EW V Q),
                                                                             (16)
                          
                            1 if dEIALL < dEVALL
                   WH =
                            0 if dEIALL > dEVALL

using dEVALL and dEIALL values introduced earlier. Obviously finding a better
weighting approach than a simple zero and one for the WH values is a better
option which we will address in the next sections.


3   Experimental Results
To evaluate the performance of the proposed VQMs we calculate the correlation
between the subjective scores in different subjective datasets and the objective
quality scores from the VQMs.
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions          9

3.1   Datasets Used
To test the accuracy of our proposed VQMs, two different datasets which are
widely used in the scientific community are used. While one dataset (CSIQ) is
focused on covering different types of distortion, the other (NETFLIX) is mainly
focused on including videos and distortions in video streaming for entertainment
use.

Computational and Subjective Image Quality (CSIQ) video dataset [1]
contains 12 reference videos and 216 distorted videos from six different types of
distortion. All videos in the dataset are in the raw YUV420 format with a reso-
lution of 832 × 480 pixels, and with a duration of 10 seconds at different frame
rates (24, 25, 30, 50, or 60 fps). Among the six distortions, four are linked to dif-
ferent compression-based distortions: H.264 compression (H.264), HEVC/H.265
compression (HEVC), Motion JPEG compression (MJPEG), and Wavelet-based
compression using the Snow codec (SNOW). The PLoss and WNoise distortions
are the other two types of distortions covered in this dataset.

Netflix public dataset used the Double Stimulus Impairment Scale (DSIS)
method to collect their subjective scores. In the DSIS method the reference and
distorted videos are displayed sequentially. Since the focus of this study was to
evaluate the quality of video streams focused on entertainment in the subjective
experiments, a consumer-grade TV under controlled ambient lighting was used.
The distorted videos with lower resolution than the reference was upscaled to the
source resolution before displaying on the TV. Observers evaluated the quality
of the videos while sitting on a couch in a living room-like environment and
were asked to assess the impairment on a scale of one (very annoying) to five
(not noticeable). The scores from all observers were combined to generate a
Differential Mean Opinion Score (DMOS) for each distorted video and results
were normalized in the range of zero to 100 were it was assumed that the reference
video has a subjective quality score of 100 [23].

3.2   Results and Discussion
To calculate the accuracy of the proposed VQM in our experiments the lin-
ear Spearman and leaner and non-linear Pearson correlations were calculated
between our objective scores and the subjective scores provided in different
datasets. In this paper and due to space limitations, we would only provide
the non-linear Pearson correlation results. From the results we can observe that:
 – In the case of each separate distortion, the proposed spatial based VQM
   (V Q1 ) is able to evaluate the video quality with a relatively high correlation
   (Table 1). This correlation value (average of .89) drops dramatically (.77)
   when videos independent of their distortions are evaluated. This finding can
   be linked to the fact that depending on the type of distortion, different
   spatial features affect the video quality.
10        M. Jamshidi Seikavandi and S. A. Amirshahi

Table 1. Non-linear Pearson correlation values for different distortions using the V Q1
values at different convolutional layers in the Alexnet model.

     dataset          distortion   CONV 1   CONV 2   CONV 3   CONV 4   CONV 5   All

       CSIQ dataset    H.264        .95      .97      .96      .96      .96     .96
                       PLoss        .74      .79      .77      .77      .76     .79
                      MJPEG         .45      .89      .93      .93      .89     .90
                      Wavelet       .89      .87      .86      .85      .85     .86
                      WNoise        .87      .92      .91      .92      .92     .92
                      HEVC          .89      .92      .90      .91      .90     .91
                       ALL          .70      .77      .75      .76      .77     .77
     Netflix              -         .77      .83      .84      .84      .86     .84


Table 2. Non-linear Pearson correlation values for different distortions using the V Q2
values at different convolutional layers in the Alexnet model.

     dataset          distortion   CONV 1   CONV 2   CONV 3   CONV 4   CONV 5   All

                       H.264        .90      .95      .94      .93      .95     .94
       CSIQ dataset


                       PLoss        .74      .84      .83      .84      .86     .84
                      MJPEG         .50      .92      .91      .90      .90     .90
                      Wavelet       .88      .91      .90      .91      .92     .92
                      WNoise        .91      .94      .93      .93      .92     .95
                      HEVC          .93      .93      .92      .93      .91     .93
                       ALL          .77      .82      .82      .82      .82     .82
     Netflix              -         .78      .90      .88      .89      .92     .90


 – Similar to the IQM proposed in [8], quality scores using V Q1 in the mid-
   convolutional layers (CONV3 and CONV4 in the case of the Alexnet model)
   show a higher correlation value (Table 1). Amirshahi et al. have linked this
   issue in the case of images to the nature of deeper convolutional layers which
   are more focused on patterns and textures seen in the image.
 – Results from calculating V Q2 for different distortions in the case of the
   CSIQ dataset (Table 2) show an average of 0.02 increase in correlation values
   compared to V Q1 VQM. From the results, it is interesting to observe the
   most significant increase in the correlation value from the V Q1 to V Q2
   VQMs is in the case of PLoss (0.05) and WNoise (0.03).
 – When it comes to the case of all videos in the dataset independent of the
   type of distortion, V Q2 shows a better performance than V Q1 . This increase
   of approximately 0.06 shows that by simply giving a higher weight to more
   salient regions of the feature map we could increase the accuracy of the
   VQM.
 – We can see that compared to V Q2 , results from V W V Q2 (Table 3) decreases
   for all individual distortions in the CSIQ dataset while the overall result do
   not show any changes.
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions             11

Table 3. Non-linear Pearson correlation values for different distortions using the
V W V Q2 values at different convolutional layers in the Alexnet model.

     dataset          distortion   CONV 1   CONV 2   CONV 3   CONV 4   CONV 5   All

       CSIQ dataset    H.264        .89      .94      .93      .94      .95     .94
                       PLoss        .70      .81      .77      .79      .85     .80
                      MJPEG         .51      .91      .90      .89      .89     .90
                      Wavelet       .88      .90      .89      .90      .92     .91
                      WNoise        .89      .90      .88      .89      .87     .90
                      HEVC          .93      .93      .93      .93      .91     .93
                       ALL          .77      .82      .80      .80      .82     .82
    Netflix               -         .71      .87      .86      .87      .90     .87


Table 4. Non-linear Pearson correlation values for different distortions using the
EW V Q2 values at different convolutional layers in the Alexnet model

     dataset          distortion   CONV 1   CONV 2   CONV 3   CONV 4   CONV 5   All

                       H.264        .90      .94      .93      .93      .96     .95
       CSIQ dataset


                       PLoss        .71      .79      .80      .80      .83     .81
                      MJPEG         .51      .88      .87      .86      .89     .88
                      Wavelet       .87      .91      .88      .89      .92     .90
                      WNoise        .89      .92      .92      .92      .91     .92
                      HEVC          .93      .94      .93      .93      .92     .94
                       ALL          .75      .83      .81      .81      .83     .83
    Netflix               -         .85      .90      .90      .90      .91     .91


 – Using saliency-based weighting (EW V Q2 ), show a small improvement in
   the performance of H.264, HEVC, and all distortions by 0.01 (Table 4). This
   can be linked to the fact that by using saliency in the case of EW V Q2 the
   VQM covers both spatial and temporal aspects of the video quality.
 – In the case of combined V Q2 (Table 5), results show an increase in CONV1
   and CONV2 layers. Comparing EW V Q2 to V W V Q2, an improvement can
   also be seen in CONV3 and CONV4 layers. We can observe that combined V Q2
   has a better or the same performance in the case of H.264, MJPEG, and
   HEVC compressions, which could imply that the mentioned compressions
   are more discriminative for WH to detect.
 – Compared to other state-of-the-art VQMS (table 6), the proposed approach
   has better or as good as a performance in the case of H.264 and MJPEG
   compressions. This can be linked to the compatibility of our method with
   the structure of such compression methods and how WH is able to discrim-
   inate between these compression methods. In the case of the WNoise and
   PLoss distortions, our proposed approach does not show a competitive per-
   formance. This could be linked to the fact that the saliency methods used
12         M. Jamshidi Seikavandi and S. A. Amirshahi

Table 5. Non-linear Pearson correlation values for different distortions using the
Combined V Q2 values at different convolutional layers in the Alexnet model

      dataset          distortion      CONV 1    CONV 2   CONV 3     CONV 4    CONV 5   All

        CSIQ dataset    H.264           .90       .95      .93        .94       .95     .95
                        PLoss           .63       .73      .70        .71       .78     .72
                       MJPEG            .50       .91      .90        .90       .91     .91
                       Wavelet          .88       .91      .90        .90       .92     .90
                       WNoise           .89       .91      .90        .91       .90     .91
                       HEVC             .94       .94      .93        .94       .92     .94
                        ALL             .78       .83      .81        .81       .82     .83
      Netflix              -            .84       .89      .90        .90       .91     .90


Table 6. Non-linear Pearson correlation values for different distortions in the CSIQ
dataset in comparison with state-of-the-art VQMs.

                               H.264     PLoss    MJPEG    Wavelet    WNoise     HEVC    ALL

       SSIM                     .95       .84       .80      .89        .97       .96     .76
     VIF [35]                   .95       .92       .91      .92        .96       .96     .72
   STMAD [38]                   .96       .87       .89      .87        .89       .92     .82
     ViS3 [37]                  .93       .82       .81      .93        .93       .96     .81
   MOVIE [34]                   .90       .88       .87      .89        .85       .93     .78
  V-BLIINDS [32]                .94       .76       .85      .90        .93       .92     .84
  SACONVA [22]                  .91       .81       .85      .85        .90       .90     .86
       V Q1                     .96       .79       .90      .86        .92       .91     .77
       V Q2                     .94       .84       .90      .92        .95       .93     .82
     V W V Q2                   .94       .80       .90      .91        .90       .93     .82
     EW V Q2                    .95       .81       .88      .90        .92       .94     .83
  Combined V Q2                 .95       .72       .91      .90        .91       .94     .83


   could not follow imposed transformation loss as well as selected compression
   distortions.
 – Finally, our experiments showed that like the case of the IQM introduced in
   [8] the depth of the network (in our case, the use of VGG-16 and VGG-32
   [36]) did not have any significant impact on the performance of the proposed
   VQM.


3.3    Content and Compression Analysis

Our experiments show a link between the content and compression method with
the video and image saliency and so the performance of our VQM. To be more
specific, the difference between video saliency and image saliency can provide
a better understanding of the content. Likewise, the difference between image
or video saliency of test and reference video provides information about the
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions       13


        (a) Flowervase            (b) Chipmunks                (c) Keiba

         Fig. 2. Sample frames from three video clips in the CSIQ dataset.


compression method used in the video. Thus, WH would include information
about video quality, content, and distortion. Experimental results show that
instead of having a value of zero and one for WH , a fuzzy approach for selecting
the value of WH could result in improving the accuracy of our proposed VQM.
That is, depending on the amount of temporal and structural variations of the
video WH could have different values. For example, our initial study has shown
that in the case of the CSIQ database WH would have a low value in Flowervase
video (Figure 2(a)) while the Chipmunks and Keiba videos (Figures 2(b) and
(c) respectively) would be assigned a high WH values.


4   Conclusion and Future works
In conclusion, we proposed a set of different VQMs that are inspired by a CNN-
based IQM which assesses the spatial features effectively. Saliency maps of videos
added a spatial-temporal approach to our method, yielding to a series of quality
scores for each frame in the video. Different schemes are then applied to these
quality scores to introduced two different video quality scores for the video.
Finally, using a saliency based approach to compare spatial and temporal dis-
tortions, one of the two mentioned scores are presented as the final video quality.
The proposed measure was tested on the CSIQ and the Netflix public dataset.
Our experimental results show that by simply differentiating between spatial
and temporal distortions, our VQM could have a better accuracy. The proposed
approach performs well in the case of compression based distortions while its
accuracy drops in the case of distortions infected by transformation loss.
    As we discussed in Section 3.3, finding the content and distortion type of
a video based on spatial-temporal and spatial saliency could also improve the
performance of the VQM. Further study of this issue and selecting the perfect
weighting function for the two spatial and spatial-temporal VQM would be part
of the future work we plan to perform.


References
 1. CSIQ video quality database, http://vision.eng.shizuoka.ac.jp
 2. Ahn, S., Lee, S.: Deep blind video quality assessment based on temporal human
    perception. In: ICIP. pp. 619–623 (2018)
14      M. Jamshidi Seikavandi and S. A. Amirshahi

 3. Amirshahi, S.A.: Towards a perceptual metric for video quality assessment. Mas-
    ter’s thesis, Norwegian University of Science and Technology (NTNU) (2010)
 4. Amirshahi, S.A.: Aesthetic quality assessment of paintings. Verlag Dr. Hut (2015)
 5. Amirshahi, S.A., Kadyrova, A., Pedersen, M.: How do image quality metrics per-
    form on contrast enhanced images? In: EUVIP. pp. 232–237 (2019)
 6. Amirshahi, S.A., Larabi, M.C.: Spatial-temporal video quality metric based on an
    estimation of qoe. In: QoMEX. pp. 84–89 (2011)
 7. Amirshahi, S.A., Pedersen, M.: Future directions in image quality. In: CIC.
    vol. 2019, pp. 399–403 (2019)
 8. Amirshahi, S.A., Pedersen, M., Yu, S.X.: Image quality assessment by comparing
    cnn features between images. J ELECTRON IMAGING 2017(12), 42–51 (2017)
 9. Amirshahi, S.A., Redies, C., Denzler, J.: How self-similar are artworks at different
    levels of spatial resolution? In: CAE. pp. 93–100 (2013)
10. Antkowiak, J., Jamal Baina, T., Baroncini, F.V., Chateau, N., FranceTelecom, F.,
    Pessoa, A.C.F., Stephanie Colonnese, F., Contin, I.L., Caviedes, J., Philips, F.:
    Final report from the video quality experts group on the validation of objective
    models of video quality assessment march 2000 (2000)
11. Bampis, C.G., Li, Z., Bovik, A.C.: Spatiotemporal feature integration and model
    fusion for full reference video quality assessment. IEEE T CIRC SYST VID 29(8),
    2256–2270 (2018)
12. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid
    kernel. In: CIVR. pp. 401–408 (2007)
13. Chan, A., Zeng, K., Mohapatra, P., Lee, S.J., Banerjee, S.: Metrics for evaluating
    video streaming quality in lossy ieee 802.11 wireless networks. In: Infocom. pp. 1–9
    (2010)
14. Dendi, S.V.R., Krishnappa, G., Channappayya, S.S.: Full-reference video quality
    assessment using deep 3d convolutional neural networks. In: NCC. pp. 1–5 (2019)
15. Freitas, P.G., Akamine, W.Y., Farias, M.C.: Using multiple spatio-temporal fea-
    tures to estimate video quality. Signal Process. Image Commun. 64, 1–10 (2018)
16. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS. pp. 545–552
    (2007)
17. Hekstra, A.P., Beerends, J.G., Ledermann, D., De Caluwe, F., Kohler, S., Koenen,
    R., Rihs, S., Ehrsam, M., Schlauss, D.: Pvqm–a perceptual video quality measure.
    Signal Process. Image Commun. 17(10), 781–798 (2002)
18. Huynh-Thu, Q., Ghanbari, M.: Scope of validity of psnr in image/video quality
    assessment. Electron. Lett. 44(13), 800–801 (2008)
19. Kim, W., Kim, J., Ahn, S., Kim, J., Lee, S.: Deep video quality assessor: From
    spatio-temporal visual sensitivity to a convolutional neural aggregation network.
    In: ECCV. pp. 219–234 (2018)
20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
    volutional neural networks. In: NIPS. pp. 1097–1105 (2012)
21. Li, X., Guo, Q., Lu, X.: Spatiotemporal statistics for video quality assessment.
    IEEE T IMAGE PROCESS 25(7), 3329–3342 (2016)
22. Li, Y., Po, L.M., Cheung, C.H., Xu, X., Feng, L., Yuan, F., Cheung, K.W.: No-
    reference video quality assessment with 3d shearlet transform and convolutional
    neural networks. IEEE T CIRC SYST VID 26(6), 1044–1057 (2015)
23. Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practical
    perceptual video quality metric. The Netflix Tech Blog 6 (2016)
24. Liu, K.H., Liu, T.J., Liu, H.H., Pei, S.C.: Spatio-temporal interactive laws feature
    correlation method to video quality assessment. In: ICMEW. pp. 1–6 (2018)
Evaluating VQ by Differentiating Between Spatial and Temporal Distortions             15

25. Ma, Q., Zhang, L., Wang, B.: New strategy for image and video quality assessment.
    J ELECTRON IMAGING 19(1), 011019 (2010)
26. Maczyta, L., Bouthemy, P., Le Meur, O.: Cnn-based temporal detection of motion
    saliency in videos. Pattern Recognit. Lett. 128, 298–305 (2019)
27. Men, H., Lin, H., Saupe, D.: Spatiotemporal feature combination model for no-
    reference video quality assessment. In: QoMEX. pp. 1–3 (2018)
28. Ninassi, A., Le Meur, O., Le Callet, P., Barba, D.: Considering temporal variations
    of spatial visual distortions in video quality assessment. IEEE J. Sel. Topics Signal
    Process. 3(2), 253–265 (2009)
29. Ong, E., Lin, W., Lu, Z., Yao, S.: Colour perceptual video quality metric. In: ICIP.
    vol. 3, pp. III–1172 (2005)
30. Redies, C., Amirshahi, S.A., Koch, M., Denzler, J.: Phog-derived aesthetic mea-
    sures applied to color photographs of artworks, natural scenes and objects. In:
    ECCV. pp. 522–531 (2012)
31. Roja, B., Sandhya, B.: Saliency based assessment of videos from frame-wise quality
    measures. In: IACC. pp. 639–644 (2017)
32. Saad, M.A., Bovik, A.C., Charrier, C.: Blind prediction of natural video quality.
    IEEE T IMAGE PROCESS 23(3), 1352–1365 (2014)
33. Sector, I.T.S.: Objective perceptual multimedia video quality measurement in the
    presence of a full reference. ITU-T Recommendation J 247, 18 (2008)
34. Seshadrinathan, K., Bovik, A.C.: Motion tuned spatio-temporal quality assessment
    of natural videos. IEEE T IMAGE PROCESS 19(2), 335–350 (2009)
35. Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE T IMAGE
    PROCESS 15(2), 430–444 (2006)
36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. arXiv preprint arXiv:1409.1556 (2014)
37. Vu, P.V., Chandler, D.M.: Vis3: an algorithm for video quality assessment via
    analysis of spatial and spatiotemporal slices. J ELECTRON IMAGING 23(1),
    013016 (2014)
38. Vu, P.V., Vu, C.T., Chandler, D.M.: A spatiotemporal most-apparent-distortion
    model for video quality assessment. In: ICIP. pp. 2505–2508 (2011)
39. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:
    from error visibility to structural similarity. IEEE T IMAGE PROCESS 13(4),
    600–612 (2004)
40. Winkler, S.: Digital video quality: vision models and metrics. John Wiley & Sons
    (2005)
41. You, J., Korhonen, J.: Deep neural networks for no-reference video quality assess-
    ment. In: ICIP. pp. 2349–2353 (2019)

</pre>