<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Video Quality by Di erentiating Between Spatial and Temporal Distortions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Meisam Jamshidi Seikavandi</string-name>
          <email>meisamjam@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seyed Ali Amirshahi</string-name>
          <email>s.ali.amirshahi@ntnu.no</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nikoo Dana Fanavari Delfan, Technology and Science Park of Lorestan</institution>
          ,
          <country country="IR">Iran</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Norwegian Colour and Visual Computing Laboratory, Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Gj vik</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>To objectively evaluate the quality of videos di erent stateof-the-art Image Quality Metrics (IQMs) have been used to introduce di erent Video Quality Metrics (VQM). While such approaches are able to evaluate the spatial quality of the frames in the video they are not able to address the temporal aspects of the video quality. In this study, we introduce a new full-reference VQM which is based on taking advantage of a Convolutional Neural Network (CNN) based IQM to evaluate the quality of the frame. Using other techniques such as visual saliency detection we are then able to di erentiate between spatial and temporal distortions and use di erent pooling techniques to evaluate the quality of the video. Our results show that by detecting the type of distortion (spatial or temporal) a ecting the video quality, the proposed VQM can evaluate the quality of the video with a higher accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>Video Quality Assessment</kwd>
        <kwd>Video Saliency</kwd>
        <kwd>Image Saliency</kwd>
        <kwd>Spatial Distortion</kwd>
        <kwd>Temporal Distortion</kwd>
        <kwd>Temporal pooling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the huge amount of video we have access to in our daily life,
evaluating the quality of videos is an essential part of any application that deals with
videos. Although subjective assessment is still considered the primary standard
for Video Quality Assessment (VQA), it is time-consuming and nancially
expensive to perform on a regular basis. For this and many other reasons, in the
last few decades, objective assessment of video quality has attracted much
attention. Objective assessment methods, known as Video Quality Metrics (VQMs)
have been widely used to estimate the quality of videos [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Depending on the
availability of the reference video, VQMs can be classi ed into full-reference,
reduced-reference, and no-reference. Full-reference VQMs need access to the
reference video, while reduced-reference metrics required partial information of the
reference video and no-reference metrics only have access to the test video.
      </p>
      <p>
        Full-reference VQMs can be further classi ed to error sensitivity based
methods [
        <xref ref-type="bibr" rid="ref10 ref18">10,18</xref>
        ], structural similarity based approaches [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ], information delity based
approaches [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], spatial-temporal approaches [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], saliency-based approaches
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and network-aware approaches [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Many of the mentioned methods are an
extended version of Image Quality Metrics (IQMs) that generally follow with a
pooling structure to bridge over Image Quality Assessment (IQA) and VQA. In
this study, we combine di erent approaches, such as metrics based on the use of
Convolutional Neural Networks (CNNs) and saliency techniques, to calculate a
series of quality values for the video frames. Using di erent weighting techniques
that depend on the type of distortion (spatial or temporal) a ecting the video,
the quality of the frames is pooled to represent the video quality score.
      </p>
      <p>
        Keeping in mind that VQA has been a eld of research for over two decades,
it is no surprise that a high number of di erent VQMs have been introduced. Like
any other eld of research in image processing and computer vision, early VQMs
were based on introducing di erent single or multiple handcrafted features for
VQA. While initially these features were pure mathematical techniques such as
Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR) [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ], overtime
a shift towards introducing features that try to model the Human Visual System
(HVS) is seen in VQMs [
        <xref ref-type="bibr" rid="ref17 ref29 ref33">17,29,33</xref>
        ]. Unlike the early VQMs which were mostly
focused on spatial aspects of the video [
        <xref ref-type="bibr" rid="ref10 ref18 ref39">10,18,39</xref>
        ], with the introduction of
temporal features VQMs showed improvement in their performance [
        <xref ref-type="bibr" rid="ref15 ref19 ref21 ref24">15,19,21,24</xref>
        ]. Since
most VQMs either provide a spatial, temporal, and/or spatial-temporal quality
value for the videos, di erent pooling techniques have been used. As an example,
the use of saliency maps for providing di erent weights to di erent regions in
the frame or di erent frames have been used to introduce di erent VQMs [
        <xref ref-type="bibr" rid="ref3 ref6">3,6</xref>
        ].
Finally, in recent years, with the introduction of state-of-the-art machine
learning techniques and especially Convolutional Neural Networks (CNNs), again, a
big improvement has been observed in the accuracy of VQMs [
        <xref ref-type="bibr" rid="ref14 ref19 ref2 ref41">2,14,19,41</xref>
        ].
      </p>
      <p>Our contribution in this paper can be summarized as: 1) by using video
saliency maps we introduce a spatial dimension to a state-of-the-art IQM and
use the approach for video quality assessment. 2) By applying temporal and
spatial-temporal pooling techniques two di erent quality scores are calculated
for each frame in the video. 3) A new content-based evaluation is introduced
that is able to detect the type of distortion (temporal or spatial) and propose a
VQM based on the distortion detected.</p>
      <p>The rest of the paper is organized as follows: in Section 2, we provide a
detailed description of the proposed approach while the experimental results are
presented in Section 3. Finally, Section 4 provides a conclusion of the work and
what future directions we plan to take to extend the work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <p>Our proposed VQM (Figure 1) is based on the extraction of spatial and temporal
features from the video. Apart from the main VQM introduced in this study,
Frames’
Quality
IQ(VFTi)
i ∈ 1,···,N</p>
      <sec id="sec-2-1">
        <title>Test</title>
      </sec>
      <sec id="sec-2-2">
        <title>Video</title>
        <p>i ∈V1F,·T·i·,N
Test
Video
i ∈V1F,·T·i·,N</p>
        <p>IQM
Video
Saliency
Maps
Image
Saliency
Maps
Video
Saliency
Maps
Image
Saliency
Maps</p>
      </sec>
      <sec id="sec-2-3">
        <title>Video</title>
      </sec>
      <sec id="sec-2-4">
        <title>Saliency</title>
      </sec>
      <sec id="sec-2-5">
        <title>Maps</title>
        <p>Spatial
Features
Spatial
Features</p>
      </sec>
      <sec id="sec-2-6">
        <title>Video</title>
      </sec>
      <sec id="sec-2-7">
        <title>Saliency</title>
      </sec>
      <sec id="sec-2-8">
        <title>Maps</title>
        <p>Energy
Energy
Energy
Energy</p>
        <p>Absolute
Difference
Absolute
Difference</p>
        <p>Weighted
Spatial
Features
SW VFRi
i ∈ 1,···,N</p>
        <p>IQM
Weighted
Spatial
Features
SW VFTi
i ∈ 1,···,N</p>
        <p>Saliency
Weighted
Quality
IQ(SW VFTi)
i ∈ 1,···,N
WH
Comparison</p>
        <p>Box
Saliency
Weighted
Quality
IQ(SW VFTi)
i∈1,···,N
Frames’
Quality
IQ(VFTi)
i ∈ 1,···,N
σ2</p>
        <p>Energy
Based</p>
        <p>Weights
PN V Q2(VT)</p>
        <p>Variation
Based
Quality
Wi×IQ(SW VFTi) ,
i∈1,···,N
PN V Q1(VT)</p>
        <p>P EW V Q2(VT)</p>
        <p>WH×VW VQ2+(1−WH)×EW VQ2 Combined V Q2(VT)
PN V W V Q2(VT)
(a) Spatial and spatial-temporal quality assessment for each frame.
(b) Di erent pooling methods used between the quality of each frame for evaluating
the quality of the video.
and to better study the in uence of the di erent features used in our VQM, we
also propose other VQMs based solely on one or multiple features introduced.</p>
        <sec id="sec-2-8-1">
          <title>Spatial Approach</title>
          <p>
            As pointed out in Section 1, IQA and VQA are closely linked. In fact, when
it comes to extracting spatial features from videos, a high number of features
used for evaluating the video quality were initially introduced in IQMs. In other
words, in the case of spatial features, di erent VQMs extract spatial features
introduced in di erent IQMs on each frame of the video [
            <xref ref-type="bibr" rid="ref11 ref2 ref27">2,11,27</xref>
            ].
          </p>
          <p>
            In this study, we aim to introduce a new VQM which takes advantage of
the IQM proposed in [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. In [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] Amirshahi et al. propose a new IQM, which is
based on calculating the similarity between extracted feature maps in di erent
convolutional layers of a pre-trained CNN Network. Their hypothesis which was
inspired by the use of Pyramid Histogram of Orientation Gradients (PHOG) [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]
features for calculating self-similarity in images introduced in [
            <xref ref-type="bibr" rid="ref30 ref4 ref9">4,9,30</xref>
            ] is that the
similar feature maps at di erent convolutional layers are the similar the quality
of the test and reference images are. To calculate the similarity between two
features maps, they take the following steps:
1. From the reference (IR) and test (IT ) images feature maps are extracted at
di erent convolutional layers.
2. For the test image IT in Convolutional layer n histogram
h(IT ; n; L) =
          </p>
          <p>X Y X Y
(X X F (IT ; n; L; 1)(i; j); X X F (IT ; n; L; 2)(i; j); ;
i=1 j=1 i=1 j=1
X Y X Y
X X F (IT ; n; L; z)(i; j); ; X X F (IT ; n; L; M )(i; j));
is calculated. In Eq. (1), L corresponds to the level in spatial pyramid the
histograms are calculated at and F (IT ; n; L; z) corresponds to feature map
z in the nth convolutional layer of image IT at level L with a size of X Y .
To take a pyramid approach, Amirshahi et al. divided feature maps to four
equal sub-regions resulting in di erent h histograms (Eq. (1)) at di erent
levels (L) of the spatial resolution. The division and calculation of h
continues to the point that the smallest side of the smallest sub-region is equal or
larger than seven pixels.
3. The quality of the test image at level L for the convolutional layer n is then
calculated by
4. The concatenation of all mIQM (IT ; n; l) values
mIQM (IT ; n; L) = dHIK (h(IT (n; L)); h(IR(n; L)))</p>
          <p>n
= X min(hi(IT ; n; L)); hi(IR; n; L)):</p>
          <p>i=1
mIQM (IT ; n) = (mIQM (IT ; n; 1); mIQM (IT ; n; 2);
; mIQM (IT ; n; z);
; mIQM (IT ; n; L));
(1)
(2)
(3)
would then be used by
(4)
(5)
IQ(IT ; n) =
1</p>
          <p>L
(mIQM (IT ; n)) X 1</p>
          <p>PlL=1 1l l=1 l
mIQM (IT ; n; l)
to calculate the quality of the test image at convolutional layer n. In Eq. (4),
(mIQM (IT ; n)) corresponds to the standard deviation among the values in
mIQM (IT ; n).
5. Finally, the overall quality of the test image is calculated using a geometric
mean of all quality scores at di erent convolutional layers</p>
          <p>vu N
IQ(IT ) = Nu Y IQ(IT ; n)
t</p>
          <p>n=1
where N corresponds to the total number of convolutional layers.</p>
          <p>
            While the study presented in [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] was mainly focused on the use of the Alexnet
model [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], nevertheless, it was shown that it would be possible to use other
deeper CNN models such as VGG16 and VGG19 [
            <xref ref-type="bibr" rid="ref36">36</xref>
            ]. Di erent studies have
shown the exibility of the mentioned IQM and how it can be extended to
improve the performance of other IQMs [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. For this reason, in this study, we
would take advantage of this IQM to evaluate the spatial quality of the video
frames (Figure 1).
          </p>
          <p>
            To evaluate the spatial quality of the test video (VT ), the average quality of
the frames
It is clear that without taking into account the temporal aspects of a video, any
VQM would be lacking accuracy. In our approach the rst feature extracted from
the videos is visual saliency which is linked to the spatial-temporal aspects of
the video quality. Di erent studies have shown the important role visual saliency
plays in IQA and VQA [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. While there are a considerable number of di erent
methods to calculate the saliency maps of images and video, the Graph-Based
Visual Saliency (GBVS) [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] approach is one of the well-known techniques which
has shown good accuracy for image and video saliency detection. It is important
to point out that while image saliency calculation is purely based on spatial
features, when it comes to video saliency, temporal aspects of the video are
also considered and so video saliency calculation could be linked to the
spatialtemporal properties of the video.
          </p>
          <p>In our approach, we rst calculate the saliency map for the test and reference
videos. The saliency map of each frame is then resized to the size of the input
of the network. Similar to the layers of the pre-trained CNN model used, we
apply max-pooling to the calculated saliency maps in each frame resulting in
di erent saliency maps, each corresponding to the size of the feature maps at
each convolutional layer of our model. The calculated feature maps are then
used as pixel-wise weights for the features in di erent convolutional layers. This
will allow us to give higher weights to regions in the feature maps that are more
salient to the observer. The quality of the video is then calculated by
V Q2(VT ) =</p>
          <p>PN
i=1 IQ(SW VF Ti )</p>
          <p>N
(7)
in which IQ(SW VF Ti ) corresponds to the quality of VF Ti where the saliency
map of the frames has been used as a weighting function on the feature maps at
each convolutional layer (Figure 1).
2.3</p>
        </sec>
        <sec id="sec-2-8-2">
          <title>Temporal Approach</title>
          <p>Although di erent VQMs try to take into account the spatial and temporal
aspects of the video, nevertheless, most VQMs provide a single quality score for
each video. To reach this single quality score di erent pooling techniques are
used to combine quality scores of all video frames. While careful attention has
been paid on how the quality scores for each frame is calculated, most, if not all,
pooling approaches are based on some version of averaging the quality scores for
all frames. The average value, geometric mean, harmonic mean, and Minkowski
mean are some of the di erent types of averaging used in di erent VQMs. It is
clear that using any type of averaging on the quality values of the frame could
result in disregarding di erent aspects of the video that could be linked to the
HVS. In this study, to better link the video quality score to how observers react
to the change of quality in a video clip we try a new approach for pooling the
quality scores of the frames.</p>
          <p>
            Recent studies such as [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] have suggested that the overall perceptual quality
of a video is highly dependent on the temporal variation of the video quality.
That is, with an increase in the temporal variation of the video quality along
the video sequence, the video quality declines. To address this aspect, we use
the variation of the quality scores of the frames in our pooling approach. While
the variance of the quality scores of the frames could be a good description of
the quality uctuation in the video, it only provides a general description of the
video quality. To better consider the temporal variation of the video quality, we
calculate the variance of quality scores in a speci c time frame. That is, for the
ith frame of the test video (VTi ) we calculate
          </p>
          <p>LocalV ar(VTi ) =
2
(VQTi L ;
; VQTi ;
; VQTi+L ):
(8)
In Eq. (8), the length of the local window in which we calculate the variance
( 2) of the frame quality score is 2L + 1. Based on our experimental results, the
best value for L is 2 resulting in a window of ve frames. To introduce a better
regional representation of the quality score for the video we calculate the video
quality using</p>
          <p>V W V Q(VT ) =</p>
          <p>PN
i=1 Wi</p>
          <p>PN
i=1 Wi</p>
          <p>IQ(VF Ti ) ;
Wi =</p>
          <p>LocalV ar(VTi ) &lt; GlobalV ar(VT ) :</p>
          <p>LocalV ar(VTi ) &gt; GlobalV ar(VT )
In Eq. (9) Wi corresponds to the weight given to the quality score of the ith
frame in the video (VF Ti ) and GlobalV ar(VT ) represents the variance of all the
frame quality scores in the test video. From Eq. (9) it is clear that if the variance
in the local quality score is larger than the variance of the global quality score a
weight of one is given to the frame quality but if the variance of the local quality
is lower than the variance of the global quality score a weight of zero is given to
the quality frame. Simply said, the quality of a frame is only considered if the
change of video quality in a given local interval ([VTi L ; VTi+L ]) is bigger than
the change of frame quality over the total duration of the video.
2.4</p>
        </sec>
        <sec id="sec-2-8-3">
          <title>Spatial vs. Spatial-Temporal Distortion Detection</title>
          <p>
            Although saliency maps have mostly been used to detect salient regions in the
image and/or videos, studies such as [
            <xref ref-type="bibr" rid="ref26 ref31">26,31</xref>
            ] have used saliency maps to di
erentiate between salient and non-salient frames. While this labeling process is
simply done by calculating the total energy of the saliency map in each frame,
we take one step further. That is, by comparing saliency maps calculated for
frames using the GBVS video and image saliency techniques we will be able
to di erentiate between frames that are mostly in uenced by spatial or
spatialtemporal distortions (See Section 2.2 for a detailed description of the di erence
between saliency maps calculated for frames using the image and video saliency
techniques). The following steps are taken for this process:
1. Assuming the total energy of the saliency in the ith frame in the test video
(VF Ti ) is equal to
          </p>
          <p>X Y
EVTi = X X (Video Sal(VF Ti )(x; y))2 ;</p>
          <p>x=1 y=1
we calculate similar values for EVRi , EITi , and EIRi which represent the
total energy of the ith frame in the reference video using a video saliency
approach, the total energy of the ith frame in the test video using an image
saliency approach, and the total energy of the ith frame in the reference video
using an image saliency approach respectively. In Eq. (10), the ith frame has
a size of X Y and Video Sal represents the video saliency function used.
2. The di erence between the total salient energy of the reference and test
frame using video and image saliency is calculated by
dEVTi = j EVTi EVRi j ;</p>
          <p>EVRi
(9)
(10)
dEITi = j EITi EIRi j :</p>
          <p>EIRi
While dEVTi could be linked to spatial-temporal aspects of the video, dEITi
is linked to the spatial aspects.
3. Assuming that the reference video has a higher quality compared to the test
video in the cases in which dEVTi is larger than dEITi , it can be interpreted
that the distortion has likely a higher spatial-temporal e ect on the video
than just a spatial e ect.</p>
          <p>While in Section 2.3 the variance weighted VQM was introduced (V W V Q),
in this section by detecting the type of distortion (spatial or spatial-temporal)
we introduce the energy weighted VQM</p>
          <p>EW V Q(VT ) =</p>
          <p>PN
i=1 EWi</p>
          <p>PN
i=1 EWi</p>
          <p>While until now we have introduce two video quality values for each video
(the variance weighted video quality, and the energy weighted video quality)
we believe that since the two methods use di erent approaches, it is highly
possible to nd situations that one of the methods perform better than the other.
Although nding a perfect metric that ideally detects this issue is challenging,
nevertheless, as a rst step we introduce the following two parameters
dEVALL =</p>
          <p>X dEVTi ;
dEIALL =</p>
          <p>X dEITi :
The nal video quality score which we refer to as the combined video quality is
then calculated by
(12)
(13)
(14)
(15)
(16)
combined V Q(VT ) = (WH</p>
          <p>V W V Q + (1</p>
          <p>WH )</p>
          <p>EW V Q);
WH =
dEIALL &lt; dEVALL
dEIALL &gt; dEVALL
using dEVALL and dEIALL values introduced earlier. Obviously nding a better
weighting approach than a simple zero and one for the WH values is a better
option which we will address in the next sections.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results</title>
      <p>To evaluate the performance of the proposed VQMs we calculate the correlation
between the subjective scores in di erent subjective datasets and the objective
quality scores from the VQMs.</p>
      <sec id="sec-3-1">
        <title>Datasets Used</title>
        <p>To test the accuracy of our proposed VQMs, two di erent datasets which are
widely used in the scienti c community are used. While one dataset (CSIQ) is
focused on covering di erent types of distortion, the other (NETFLIX) is mainly
focused on including videos and distortions in video streaming for entertainment
use.</p>
        <p>
          Computational and Subjective Image Quality (CSIQ) video dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
contains 12 reference videos and 216 distorted videos from six di erent types of
distortion. All videos in the dataset are in the raw YUV420 format with a
resolution of 832 480 pixels, and with a duration of 10 seconds at di erent frame
rates (24, 25, 30, 50, or 60 fps). Among the six distortions, four are linked to
different compression-based distortions: H.264 compression (H.264), HEVC/H.265
compression (HEVC), Motion JPEG compression (MJPEG), and Wavelet-based
compression using the Snow codec (SNOW). The PLoss and WNoise distortions
are the other two types of distortions covered in this dataset.
        </p>
        <p>
          Net ix public dataset used the Double Stimulus Impairment Scale (DSIS)
method to collect their subjective scores. In the DSIS method the reference and
distorted videos are displayed sequentially. Since the focus of this study was to
evaluate the quality of video streams focused on entertainment in the subjective
experiments, a consumer-grade TV under controlled ambient lighting was used.
The distorted videos with lower resolution than the reference was upscaled to the
source resolution before displaying on the TV. Observers evaluated the quality
of the videos while sitting on a couch in a living room-like environment and
were asked to assess the impairment on a scale of one (very annoying) to ve
(not noticeable). The scores from all observers were combined to generate a
Di erential Mean Opinion Score (DMOS) for each distorted video and results
were normalized in the range of zero to 100 were it was assumed that the reference
video has a subjective quality score of 100 [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Results and Discussion</title>
        <p>
          To calculate the accuracy of the proposed VQM in our experiments the
linear Spearman and leaner and non-linear Pearson correlations were calculated
between our objective scores and the subjective scores provided in di erent
datasets. In this paper and due to space limitations, we would only provide
the non-linear Pearson correlation results. From the results we can observe that:
{ In the case of each separate distortion, the proposed spatial based VQM
(V Q1) is able to evaluate the video quality with a relatively high correlation
(Table 1). This correlation value (average of :89) drops dramatically (:77)
when videos independent of their distortions are evaluated. This nding can
be linked to the fact that depending on the type of distortion, di erent
spatial features a ect the video quality.
{ Similar to the IQM proposed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], quality scores using V Q1 in the
midconvolutional layers (CONV3 and CONV4 in the case of the Alexnet model)
show a higher correlation value (Table 1). Amirshahi et al. have linked this
issue in the case of images to the nature of deeper convolutional layers which
are more focused on patterns and textures seen in the image.
{ Results from calculating V Q2 for di erent distortions in the case of the
CSIQ dataset (Table 2) show an average of 0:02 increase in correlation values
compared to V Q1 VQM. From the results, it is interesting to observe the
most signi cant increase in the correlation value from the V Q1 to V Q2
VQMs is in the case of PLoss (0:05) and WNoise (0:03).
{ When it comes to the case of all videos in the dataset independent of the
type of distortion, V Q2 shows a better performance than V Q1. This increase
of approximately 0:06 shows that by simply giving a higher weight to more
salient regions of the feature map we could increase the accuracy of the
VQM.
{ We can see that compared to V Q2, results from V W V Q2 (Table 3) decreases
for all individual distortions in the CSIQ dataset while the overall result do
not show any changes.
{ Using saliency-based weighting (EW V Q2), show a small improvement in
the performance of H.264, HEVC, and all distortions by 0.01 (Table 4). This
can be linked to the fact that by using saliency in the case of EW V Q2 the
VQM covers both spatial and temporal aspects of the video quality.
{ In the case of combined V Q2 (Table 5), results show an increase in CONV1
and CONV2 layers. Comparing EW V Q2 to V W V Q2, an improvement can
also be seen in CONV3 and CONV4 layers. We can observe that combined V Q2
has a better or the same performance in the case of H.264, MJPEG, and
HEVC compressions, which could imply that the mentioned compressions
are more discriminative for WH to detect.
{ Compared to other state-of-the-art VQMS (table 6), the proposed approach
has better or as good as a performance in the case of H.264 and MJPEG
compressions. This can be linked to the compatibility of our method with
the structure of such compression methods and how WH is able to
discriminate between these compression methods. In the case of the WNoise and
PLoss distortions, our proposed approach does not show a competitive
performance. This could be linked to the fact that the saliency methods used
could not follow imposed transformation loss as well as selected compression
distortions.
{ Finally, our experiments showed that like the case of the IQM introduced in
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] the depth of the network (in our case, the use of VGG-16 and VGG-32
[
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]) did not have any signi cant impact on the performance of the proposed
VQM.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Content and Compression Analysis</title>
        <p>Our experiments show a link between the content and compression method with
the video and image saliency and so the performance of our VQM. To be more
speci c, the di erence between video saliency and image saliency can provide
a better understanding of the content. Likewise, the di erence between image
or video saliency of test and reference video provides information about the
(a) Flowervase
(b) Chipmunks
(c) Keiba
compression method used in the video. Thus, WH would include information
about video quality, content, and distortion. Experimental results show that
instead of having a value of zero and one for WH , a fuzzy approach for selecting
the value of WH could result in improving the accuracy of our proposed VQM.
That is, depending on the amount of temporal and structural variations of the
video WH could have di erent values. For example, our initial study has shown
that in the case of the CSIQ database WH would have a low value in Flowervase
video (Figure 2(a)) while the Chipmunks and Keiba videos (Figures 2(b) and
(c) respectively) would be assigned a high WH values.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future works</title>
      <p>In conclusion, we proposed a set of di erent VQMs that are inspired by a
CNNbased IQM which assesses the spatial features e ectively. Saliency maps of videos
added a spatial-temporal approach to our method, yielding to a series of quality
scores for each frame in the video. Di erent schemes are then applied to these
quality scores to introduced two di erent video quality scores for the video.
Finally, using a saliency based approach to compare spatial and temporal
distortions, one of the two mentioned scores are presented as the nal video quality.
The proposed measure was tested on the CSIQ and the Net ix public dataset.
Our experimental results show that by simply di erentiating between spatial
and temporal distortions, our VQM could have a better accuracy. The proposed
approach performs well in the case of compression based distortions while its
accuracy drops in the case of distortions infected by transformation loss.</p>
      <p>As we discussed in Section 3.3, nding the content and distortion type of
a video based on spatial-temporal and spatial saliency could also improve the
performance of the VQM. Further study of this issue and selecting the perfect
weighting function for the two spatial and spatial-temporal VQM would be part
of the future work we plan to perform.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. CSIQ video quality database</article-title>
          , http://vision.eng.shizuoka.ac.jp
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Deep blind video quality assessment based on temporal human perception</article-title>
          .
          <source>In: ICIP</source>
          . pp.
          <volume>619</volume>
          {
          <issue>623</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Towards a perceptual metric for video quality assessment</article-title>
          .
          <source>Master's thesis</source>
          , Norwegian University of Science and
          <source>Technology (NTNU)</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Aesthetic quality assessment of paintings</article-title>
          .
          <source>Verlag Dr. Hut</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kadyrova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>How do image quality metrics perform on contrast enhanced images</article-title>
          ? In: EUVIP. pp.
          <volume>232</volume>
          {
          <issue>237</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larabi</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Spatial-temporal video quality metric based on an estimation of qoe</article-title>
          . In: QoMEX. pp.
          <volume>84</volume>
          {
          <issue>89</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Future directions in image quality</article-title>
          .
          <source>In: CIC</source>
          . vol.
          <year>2019</year>
          , pp.
          <volume>399</volume>
          {
          <issue>403</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>S.X.</given-names>
          </string-name>
          :
          <article-title>Image quality assessment by comparing cnn features between images</article-title>
          .
          <source>J ELECTRON IMAGING</source>
          <year>2017</year>
          (
          <volume>12</volume>
          ),
          <volume>42</volume>
          {
          <fpage>51</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redies</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denzler</surname>
          </string-name>
          , J.:
          <article-title>How self-similar are artworks at di erent levels of spatial resolution? In: CAE</article-title>
          . pp.
          <volume>93</volume>
          {
          <issue>100</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Antkowiak</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Jamal Baina,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Baroncini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.V.</given-names>
            ,
            <surname>Chateau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>FranceTelecom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Pessoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.F.</given-names>
            ,
            <surname>Stephanie Colonnese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Contin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.L.</given-names>
            ,
            <surname>Caviedes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Philips</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Final report from the video quality experts group on the validation of objective models of video quality assessment march</article-title>
          <year>2000</year>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Bampis</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bovik</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>Spatiotemporal feature integration and model fusion for full reference video quality assessment</article-title>
          .
          <source>IEEE T CIRC SYST VID 29(8)</source>
          ,
          <volume>2256</volume>
          {
          <fpage>2270</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munoz</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Representing shape with a spatial pyramid kernel</article-title>
          .
          <source>In: CIVR</source>
          . pp.
          <volume>401</volume>
          {
          <issue>408</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Chan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohapatra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Metrics for evaluating video streaming quality in lossy ieee 802.11 wireless networks</article-title>
          .
          <source>In: Infocom</source>
          . pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Dendi</surname>
            ,
            <given-names>S.V.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnappa</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Channappayya</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          :
          <article-title>Full-reference video quality assessment using deep 3d convolutional neural networks</article-title>
          .
          <source>In: NCC</source>
          . pp.
          <volume>1</volume>
          {
          <issue>5</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>P.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akamine</surname>
          </string-name>
          , W.Y.,
          <string-name>
            <surname>Farias</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Using multiple spatio-temporal features to estimate video quality</article-title>
          .
          <source>Signal Process. Image Commun</source>
          .
          <volume>64</volume>
          ,
          <issue>1</issue>
          {
          <fpage>10</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Harel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Graph-based visual saliency</article-title>
          .
          <source>In: NIPS</source>
          . pp.
          <volume>545</volume>
          {
          <issue>552</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Hekstra</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beerends</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ledermann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Caluwe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koenen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rihs</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrsam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlauss</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Pvqm{a perceptual video quality measure</article-title>
          .
          <source>Signal Process. Image Commun</source>
          .
          <volume>17</volume>
          (
          <issue>10</issue>
          ),
          <volume>781</volume>
          {
          <fpage>798</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Huynh-Thu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghanbari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Scope of validity of psnr in image/video quality assessment</article-title>
          .
          <source>Electron. Lett</source>
          .
          <volume>44</volume>
          (
          <issue>13</issue>
          ),
          <volume>800</volume>
          {
          <fpage>801</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network</article-title>
          .
          <source>In: ECCV</source>
          . pp.
          <volume>219</volume>
          {
          <issue>234</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: NIPS</source>
          . pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Spatiotemporal statistics for video quality assessment</article-title>
          .
          <source>IEEE T IMAGE PROCESS</source>
          <volume>25</volume>
          (
          <issue>7</issue>
          ),
          <volume>3329</volume>
          {
          <fpage>3342</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Po</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          :
          <article-title>Noreference video quality assessment with 3d shearlet transform and convolutional neural networks</article-title>
          .
          <source>IEEE T CIRC SYST VID 26(6)</source>
          ,
          <volume>1044</volume>
          {
          <fpage>1057</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aaron</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katsavounidis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moorthy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manohara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Toward a practical perceptual video quality metric</article-title>
          .
          <source>The Net ix Tech Blog</source>
          <volume>6</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>K.H.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>T.J.</surname>
          </string-name>
          , Liu,
          <string-name>
            <given-names>H.H.</given-names>
            ,
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.C.</surname>
          </string-name>
          :
          <article-title>Spatio-temporal interactive laws feature correlation method to video quality assessment</article-title>
          .
          <source>In: ICMEW</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>New strategy for image and video quality assessment</article-title>
          .
          <source>J ELECTRON IMAGING</source>
          <volume>19</volume>
          (
          <issue>1</issue>
          ),
          <volume>011019</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Maczyta</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouthemy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Meur</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Cnn-based temporal detection of motion saliency in videos</article-title>
          .
          <source>Pattern Recognit. Lett</source>
          .
          <volume>128</volume>
          ,
          <issue>298</issue>
          {
          <fpage>305</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Men</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saupe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Spatiotemporal feature combination model for noreference video quality assessment</article-title>
          .
          <source>In: QoMEX</source>
          . pp.
          <volume>1</volume>
          {
          <issue>3</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Ninassi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Meur</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            <given-names>Callet</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Barba</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Considering temporal variations of spatial visual distortions in video quality assessment</article-title>
          .
          <source>IEEE J. Sel. Topics Signal Process</source>
          .
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <volume>253</volume>
          {
          <fpage>265</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Ong</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Colour perceptual video quality metric</article-title>
          .
          <source>In: ICIP</source>
          . vol.
          <volume>3</volume>
          , pp.
          <source>III{1172</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Redies</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amirshahi</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denzler</surname>
          </string-name>
          , J.:
          <article-title>Phog-derived aesthetic measures applied to color photographs of artworks, natural scenes and objects</article-title>
          .
          <source>In: ECCV</source>
          . pp.
          <volume>522</volume>
          {
          <issue>531</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Roja</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sandhya</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Saliency based assessment of videos from frame-wise quality measures</article-title>
          .
          <source>In: IACC</source>
          . pp.
          <volume>639</volume>
          {
          <issue>644</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Saad</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bovik</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charrier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Blind prediction of natural video quality</article-title>
          .
          <source>IEEE T IMAGE PROCESS</source>
          <volume>23</volume>
          (
          <issue>3</issue>
          ),
          <volume>1352</volume>
          {
          <fpage>1365</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Sector</surname>
          </string-name>
          , I.T.S.:
          <article-title>Objective perceptual multimedia video quality measurement in the presence of a full reference</article-title>
          .
          <string-name>
            <surname>ITU-T Recommendation</surname>
          </string-name>
          J
          <volume>247</volume>
          ,
          <volume>18</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Seshadrinathan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bovik</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>Motion tuned spatio-temporal quality assessment of natural videos</article-title>
          .
          <source>IEEE T IMAGE PROCESS</source>
          <volume>19</volume>
          (
          <issue>2</issue>
          ),
          <volume>335</volume>
          {
          <fpage>350</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Sheikh</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bovik</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>Image information and visual quality</article-title>
          .
          <source>IEEE T IMAGE PROCESS</source>
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <volume>430</volume>
          {
          <fpage>444</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>P.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandler</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Vis3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices</article-title>
          .
          <source>J ELECTRON IMAGING</source>
          <volume>23</volume>
          (
          <issue>1</issue>
          ),
          <volume>013016</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>P.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>C.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandler</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>A spatiotemporal most-apparent-distortion model for video quality assessment</article-title>
          .
          <source>In: ICIP</source>
          . pp.
          <volume>2505</volume>
          {
          <issue>2508</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bovik</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheikh</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simoncelli</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          :
          <article-title>Image quality assessment: from error visibility to structural similarity</article-title>
          .
          <source>IEEE T IMAGE PROCESS</source>
          <volume>13</volume>
          (
          <issue>4</issue>
          ),
          <volume>600</volume>
          {
          <fpage>612</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Digital video quality: vision models and metrics</article-title>
          . John Wiley &amp; Sons (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <surname>You</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korhonen</surname>
          </string-name>
          , J.:
          <article-title>Deep neural networks for no-reference video quality assessment</article-title>
          .
          <source>In: ICIP</source>
          . pp.
          <volume>2349</volume>
          {
          <issue>2353</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>