=Paper=
{{Paper
|id=Vol-2688/paper12
|storemode=property
|title=Evaluating Video Quality by Differentiating Between Spatial and Temporal Distortions
|pdfUrl=https://ceur-ws.org/Vol-2688/paper12.pdf
|volume=Vol-2688
|authors=Meisam Jamshidi Seikavandi,Seyed Ali Amirshahi
|dblpUrl=https://dblp.org/rec/conf/cvcs/SeikavandiA20
}}
==Evaluating Video Quality by Differentiating Between Spatial and Temporal Distortions==
Evaluating Video Quality by Differentiating Between Spatial and Temporal Distortions Meisam Jamshidi Seikavandi1 and Seyed Ali Amirshahi2 1 Nikoo Dana Fanavari Delfan, Technology and Science Park of Lorestan, Iran meisamjam@gmail.com 2 The Norwegian Colour and Visual Computing Laboratory, Norwegian University of Science and Technology, Gjøvik, Norway s.ali.amirshahi@ntnu.no Abstract. To objectively evaluate the quality of videos different state- of-the-art Image Quality Metrics (IQMs) have been used to introduce different Video Quality Metrics (VQM). While such approaches are able to evaluate the spatial quality of the frames in the video they are not able to address the temporal aspects of the video quality. In this study, we introduce a new full-reference VQM which is based on taking advan- tage of a Convolutional Neural Network (CNN) based IQM to evaluate the quality of the frame. Using other techniques such as visual saliency detection we are then able to differentiate between spatial and temporal distortions and use different pooling techniques to evaluate the quality of the video. Our results show that by detecting the type of distortion (spatial or temporal) affecting the video quality, the proposed VQM can evaluate the quality of the video with a higher accuracy. Keywords: Video Quality Assessment, Video Saliency, Image Saliency, Spatial Distortion, Temporal Distortion, Temporal pooling. 1 Introduction With the huge amount of video we have access to in our daily life, evaluat- ing the quality of videos is an essential part of any application that deals with videos. Although subjective assessment is still considered the primary standard for Video Quality Assessment (VQA), it is time-consuming and financially ex- pensive to perform on a regular basis. For this and many other reasons, in the last few decades, objective assessment of video quality has attracted much atten- tion. Objective assessment methods, known as Video Quality Metrics (VQMs) have been widely used to estimate the quality of videos [7]. Depending on the availability of the reference video, VQMs can be classified into full-reference, reduced-reference, and no-reference. Full-reference VQMs need access to the ref- erence video, while reduced-reference metrics required partial information of the reference video and no-reference metrics only have access to the test video. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). Colour and Visual Com- puting Symposium 2020, Gjøvik, Norway, September 16-17, 2020. 2 M. Jamshidi Seikavandi and S. A. Amirshahi Full-reference VQMs can be further classified to error sensitivity based meth- ods [10,18], structural similarity based approaches [39], information fidelity based approaches [35], spatial-temporal approaches [34], saliency-based approaches [25], and network-aware approaches [13]. Many of the mentioned methods are an extended version of Image Quality Metrics (IQMs) that generally follow with a pooling structure to bridge over Image Quality Assessment (IQA) and VQA. In this study, we combine different approaches, such as metrics based on the use of Convolutional Neural Networks (CNNs) and saliency techniques, to calculate a series of quality values for the video frames. Using different weighting techniques that depend on the type of distortion (spatial or temporal) affecting the video, the quality of the frames is pooled to represent the video quality score. Keeping in mind that VQA has been a field of research for over two decades, it is no surprise that a high number of different VQMs have been introduced. Like any other field of research in image processing and computer vision, early VQMs were based on introducing different single or multiple handcrafted features for VQA. While initially these features were pure mathematical techniques such as Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR) [40], overtime a shift towards introducing features that try to model the Human Visual System (HVS) is seen in VQMs [17,29,33]. Unlike the early VQMs which were mostly focused on spatial aspects of the video [10,18,39], with the introduction of tempo- ral features VQMs showed improvement in their performance [15,19,21,24]. Since most VQMs either provide a spatial, temporal, and/or spatial-temporal quality value for the videos, different pooling techniques have been used. As an example, the use of saliency maps for providing different weights to different regions in the frame or different frames have been used to introduce different VQMs [3,6]. Finally, in recent years, with the introduction of state-of-the-art machine learn- ing techniques and especially Convolutional Neural Networks (CNNs), again, a big improvement has been observed in the accuracy of VQMs [2,14,19,41]. Our contribution in this paper can be summarized as: 1) by using video saliency maps we introduce a spatial dimension to a state-of-the-art IQM and use the approach for video quality assessment. 2) By applying temporal and spatial-temporal pooling techniques two different quality scores are calculated for each frame in the video. 3) A new content-based evaluation is introduced that is able to detect the type of distortion (temporal or spatial) and propose a VQM based on the distortion detected. The rest of the paper is organized as follows: in Section 2, we provide a detailed description of the proposed approach while the experimental results are presented in Section 3. Finally, Section 4 provides a conclusion of the work and what future directions we plan to take to extend the work. 2 Proposed Approach Our proposed VQM (Figure 1) is based on the extraction of spatial and temporal features from the video. Apart from the main VQM introduced in this study, Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 3 Video Saliency Maps Reference Weighted Video Spatial VF Ri Features SW VF Ri i ∈ 1, · · · , N i ∈ 1, · · · , N Spatial Features Frames’ Saliency Quality IQM IQM Weighted IQ(VF Ti ) Quality IQ(SW VF Ti ) i ∈ 1, · · · , N i ∈ 1, · · · , N Spatial Features Test Weighted Video Spatial VF Ti Features i ∈ 1, · · · , N SW VF Ti i ∈ 1, · · · , N Video Saliency Maps (a) Spatial and spatial-temporal quality assessment for each frame. Video Saliency Energy Maps WH Energy Reference Video Absolute Comparison Based Difference Box VF Ri Weights i ∈ 1, · · · , N Image Saliency Energy P EW V Q2(VT ) Maps P N V Q2(VT ) Saliency WH × V W V Q2 + (1 − WH ) × EW V Q2 Combined V Q2(VT ) Weighted Quality Video IQ(SW VF Ti ) Saliency Energy i ∈ 1, · · · , N Variation P Maps Based V W V Q2(VT ) N σ2 Quality Test Wi × IQ(SW VF Ti ) , Video Absolute Difference i ∈ 1, · · · , N VF Ti i ∈ 1, · · · , N Image Frames’ P Quality N V Q1(VT ) Saliency Energy IQ(VF Ti ) Maps i ∈ 1, · · · , N (b) Different pooling methods used between the quality of each frame for evaluating the quality of the video. Fig. 1. Pipeline used for calculating different VQMs proposed in this study. In the figure blocks with a magenta shade correspond to spatial features, blocks with a green shade correspond to temporal features, and blocks with a gradient shade of magenta to green correspond to spatial-temporal features. and to better study the influence of the different features used in our VQM, we also propose other VQMs based solely on one or multiple features introduced. 4 M. Jamshidi Seikavandi and S. A. Amirshahi 2.1 Spatial Approach As pointed out in Section 1, IQA and VQA are closely linked. In fact, when it comes to extracting spatial features from videos, a high number of features used for evaluating the video quality were initially introduced in IQMs. In other words, in the case of spatial features, different VQMs extract spatial features introduced in different IQMs on each frame of the video [2,11,27]. In this study, we aim to introduce a new VQM which takes advantage of the IQM proposed in [8]. In [8] Amirshahi et al. propose a new IQM, which is based on calculating the similarity between extracted feature maps in different convolutional layers of a pre-trained CNN Network. Their hypothesis which was inspired by the use of Pyramid Histogram of Orientation Gradients (PHOG) [12] features for calculating self-similarity in images introduced in [4,9,30] is that the similar feature maps at different convolutional layers are the similar the quality of the test and reference images are. To calculate the similarity between two features maps, they take the following steps: 1. From the reference (IR ) and test (IT ) images feature maps are extracted at different convolutional layers. 2. For the test image IT in Convolutional layer n histogram h(IT , n, L) = X Y X X X X X Y ( F(IT , n, L, 1)(i, j), F(IT , n, L, 2)(i, j), · · · , i=1 j=1 i=1 j=1 (1) X Y X X X X X Y F(IT , n, L, z)(i, j), · · · , F(IT , n, L, M )(i, j)), i=1 j=1 i=1 j=1 is calculated. In Eq. (1), L corresponds to the level in spatial pyramid the histograms are calculated at and F(IT , n, L, z) corresponds to feature map z in the nth convolutional layer of image IT at level L with a size of X × Y . To take a pyramid approach, Amirshahi et al. divided feature maps to four equal sub-regions resulting in different h histograms (Eq. (1)) at different levels (L) of the spatial resolution. The division and calculation of h contin- ues to the point that the smallest side of the smallest sub-region is equal or larger than seven pixels. 3. The quality of the test image at level L for the convolutional layer n is then calculated by mIQM (IT , n, L) = dHIK (h(IT (n, L)), h(IR (n, L))) n X (2) = min(hi (IT , n, L)), hi (IR , n, L)). i=1 4. The concatenation of all mIQM (IT , n, l) values mIQM (IT , n) = (mIQM (IT , n, 1), mIQM (IT , n, 2), (3) · · · , mIQM (IT , n, z), · · · , mIQM (IT , n, L)), Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 5 would then be used by L 1 − σ(mIQM (IT , n)) X 1 IQ(IT , n) = PL 1 · mIQM (IT , n, l) (4) l=1 l l l=1 to calculate the quality of the test image at convolutional layer n. In Eq. (4), σ(mIQM (IT , n)) corresponds to the standard deviation among the values in mIQM (IT , n). 5. Finally, the overall quality of the test image is calculated using a geometric mean of all quality scores at different convolutional layers v uN uY IQ(IT ) = t N IQ(IT , n) (5) n=1 where N corresponds to the total number of convolutional layers. While the study presented in [8] was mainly focused on the use of the Alexnet model [20], nevertheless, it was shown that it would be possible to use other deeper CNN models such as VGG16 and VGG19 [36]. Different studies have shown the flexibility of the mentioned IQM and how it can be extended to improve the performance of other IQMs [5]. For this reason, in this study, we would take advantage of this IQM to evaluate the spatial quality of the video frames (Figure 1). To evaluate the spatial quality of the test video (VT ), the average quality of the frames PN IQ(VF Ti ) (6) V Q1 (VT ) = i=1 . N could be used. In Eq. (6), VF Ti represents the ith frame and N corresponds to the number of frames in the test video. 2.2 Spatial-Temporal Approach It is clear that without taking into account the temporal aspects of a video, any VQM would be lacking accuracy. In our approach the first feature extracted from the videos is visual saliency which is linked to the spatial-temporal aspects of the video quality. Different studies have shown the important role visual saliency plays in IQA and VQA [3]. While there are a considerable number of different methods to calculate the saliency maps of images and video, the Graph-Based Visual Saliency (GBVS) [16] approach is one of the well-known techniques which has shown good accuracy for image and video saliency detection. It is important to point out that while image saliency calculation is purely based on spatial features, when it comes to video saliency, temporal aspects of the video are also considered and so video saliency calculation could be linked to the spatial- temporal properties of the video. In our approach, we first calculate the saliency map for the test and reference videos. The saliency map of each frame is then resized to the size of the input 6 M. Jamshidi Seikavandi and S. A. Amirshahi of the network. Similar to the layers of the pre-trained CNN model used, we apply max-pooling to the calculated saliency maps in each frame resulting in different saliency maps, each corresponding to the size of the feature maps at each convolutional layer of our model. The calculated feature maps are then used as pixel-wise weights for the features in different convolutional layers. This will allow us to give higher weights to regions in the feature maps that are more salient to the observer. The quality of the video is then calculated by PN i=1 IQ(SW VF Ti ) (7) V Q2 (VT ) = N in which IQ(SW VF Ti ) corresponds to the quality of VF Ti where the saliency map of the frames has been used as a weighting function on the feature maps at each convolutional layer (Figure 1). 2.3 Temporal Approach Although different VQMs try to take into account the spatial and temporal aspects of the video, nevertheless, most VQMs provide a single quality score for each video. To reach this single quality score different pooling techniques are used to combine quality scores of all video frames. While careful attention has been paid on how the quality scores for each frame is calculated, most, if not all, pooling approaches are based on some version of averaging the quality scores for all frames. The average value, geometric mean, harmonic mean, and Minkowski mean are some of the different types of averaging used in different VQMs. It is clear that using any type of averaging on the quality values of the frame could result in disregarding different aspects of the video that could be linked to the HVS. In this study, to better link the video quality score to how observers react to the change of quality in a video clip we try a new approach for pooling the quality scores of the frames. Recent studies such as [28] have suggested that the overall perceptual quality of a video is highly dependent on the temporal variation of the video quality. That is, with an increase in the temporal variation of the video quality along the video sequence, the video quality declines. To address this aspect, we use the variation of the quality scores of the frames in our pooling approach. While the variance of the quality scores of the frames could be a good description of the quality fluctuation in the video, it only provides a general description of the video quality. To better consider the temporal variation of the video quality, we calculate the variance of quality scores in a specific time frame. That is, for the ith frame of the test video (VTi ) we calculate LocalV ar (VTi ) = σ 2 (VQTi−L , · · · , VQTi , · · · , VQTi+L ). (8) In Eq. (8), the length of the local window in which we calculate the variance (σ 2 ) of the frame quality score is 2L + 1. Based on our experimental results, the best value for L is 2 resulting in a window of five frames. To introduce a better Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 7 regional representation of the quality score for the video we calculate the video quality using PN Wi × IQ(VF Ti ) V W V Q(VT ) = i=1 PN , i=1 Wi (9) 1 if LocalV ar (VTi ) < GlobalV ar (VT ) Wi = . 0 if LocalV ar (VTi ) > GlobalV ar (VT ) In Eq. (9) Wi corresponds to the weight given to the quality score of the ith frame in the video (VF Ti ) and GlobalV ar (VT ) represents the variance of all the frame quality scores in the test video. From Eq. (9) it is clear that if the variance in the local quality score is larger than the variance of the global quality score a weight of one is given to the frame quality but if the variance of the local quality is lower than the variance of the global quality score a weight of zero is given to the quality frame. Simply said, the quality of a frame is only considered if the change of video quality in a given local interval ([VTi−L , VTi+L ]) is bigger than the change of frame quality over the total duration of the video. 2.4 Spatial vs. Spatial-Temporal Distortion Detection Although saliency maps have mostly been used to detect salient regions in the image and/or videos, studies such as [26,31] have used saliency maps to differ- entiate between salient and non-salient frames. While this labeling process is simply done by calculating the total energy of the saliency map in each frame, we take one step further. That is, by comparing saliency maps calculated for frames using the GBVS video and image saliency techniques we will be able to differentiate between frames that are mostly influenced by spatial or spatial- temporal distortions (See Section 2.2 for a detailed description of the difference between saliency maps calculated for frames using the image and video saliency techniques). The following steps are taken for this process: 1. Assuming the total energy of the saliency in the ith frame in the test video (VF Ti ) is equal to X X X Y 2 EVTi = (Video Sal(VF Ti )(x, y)) , (10) x=1 y=1 we calculate similar values for EVRi , EITi , and EIRi which represent the total energy of the ith frame in the reference video using a video saliency approach, the total energy of the ith frame in the test video using an image saliency approach, and the total energy of the ith frame in the reference video using an image saliency approach respectively. In Eq. (10), the ith frame has a size of X × Y and Video Sal represents the video saliency function used. 2. The difference between the total salient energy of the reference and test frame using video and image saliency is calculated by | EVTi − EVRi | dEVTi = , (11) EVRi 8 M. Jamshidi Seikavandi and S. A. Amirshahi | EITi − EIRi | dEITi = . (12) EIRi While dEVTi could be linked to spatial-temporal aspects of the video, dEITi is linked to the spatial aspects. 3. Assuming that the reference video has a higher quality compared to the test video in the cases in which dEVTi is larger than dEITi , it can be interpreted that the distortion has likely a higher spatial-temporal effect on the video than just a spatial effect. While in Section 2.3 the variance weighted VQM was introduced (V W V Q), in this section by detecting the type of distortion (spatial or spatial-temporal) we introduce the energy weighted VQM PN i=1 EWi × IQ(VF Ti ) EW V Q(VT ) = PN , i=1 EWi (13) dEVTi if dEITi < dEVTi EWi = . dEITi if dEITi > dEVTi While until now we have introduce two video quality values for each video (the variance weighted video quality, and the energy weighted video quality) we believe that since the two methods use different approaches, it is highly possible to find situations that one of the methods perform better than the other. Although finding a perfect metric that ideally detects this issue is challenging, nevertheless, as a first step we introduce the following two parameters X dEVALL = dEVTi , (14) X dEIALL = dEITi . (15) The final video quality score which we refer to as the combined video quality is then calculated by combined V Q(VT ) = (WH × V W V Q + (1 − WH ) × EW V Q), (16) 1 if dEIALL < dEVALL WH = 0 if dEIALL > dEVALL using dEVALL and dEIALL values introduced earlier. Obviously finding a better weighting approach than a simple zero and one for the WH values is a better option which we will address in the next sections. 3 Experimental Results To evaluate the performance of the proposed VQMs we calculate the correlation between the subjective scores in different subjective datasets and the objective quality scores from the VQMs. Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 9 3.1 Datasets Used To test the accuracy of our proposed VQMs, two different datasets which are widely used in the scientific community are used. While one dataset (CSIQ) is focused on covering different types of distortion, the other (NETFLIX) is mainly focused on including videos and distortions in video streaming for entertainment use. Computational and Subjective Image Quality (CSIQ) video dataset [1] contains 12 reference videos and 216 distorted videos from six different types of distortion. All videos in the dataset are in the raw YUV420 format with a reso- lution of 832 × 480 pixels, and with a duration of 10 seconds at different frame rates (24, 25, 30, 50, or 60 fps). Among the six distortions, four are linked to dif- ferent compression-based distortions: H.264 compression (H.264), HEVC/H.265 compression (HEVC), Motion JPEG compression (MJPEG), and Wavelet-based compression using the Snow codec (SNOW). The PLoss and WNoise distortions are the other two types of distortions covered in this dataset. Netflix public dataset used the Double Stimulus Impairment Scale (DSIS) method to collect their subjective scores. In the DSIS method the reference and distorted videos are displayed sequentially. Since the focus of this study was to evaluate the quality of video streams focused on entertainment in the subjective experiments, a consumer-grade TV under controlled ambient lighting was used. The distorted videos with lower resolution than the reference was upscaled to the source resolution before displaying on the TV. Observers evaluated the quality of the videos while sitting on a couch in a living room-like environment and were asked to assess the impairment on a scale of one (very annoying) to five (not noticeable). The scores from all observers were combined to generate a Differential Mean Opinion Score (DMOS) for each distorted video and results were normalized in the range of zero to 100 were it was assumed that the reference video has a subjective quality score of 100 [23]. 3.2 Results and Discussion To calculate the accuracy of the proposed VQM in our experiments the lin- ear Spearman and leaner and non-linear Pearson correlations were calculated between our objective scores and the subjective scores provided in different datasets. In this paper and due to space limitations, we would only provide the non-linear Pearson correlation results. From the results we can observe that: – In the case of each separate distortion, the proposed spatial based VQM (V Q1 ) is able to evaluate the video quality with a relatively high correlation (Table 1). This correlation value (average of .89) drops dramatically (.77) when videos independent of their distortions are evaluated. This finding can be linked to the fact that depending on the type of distortion, different spatial features affect the video quality. 10 M. Jamshidi Seikavandi and S. A. Amirshahi Table 1. Non-linear Pearson correlation values for different distortions using the V Q1 values at different convolutional layers in the Alexnet model. dataset distortion CONV 1 CONV 2 CONV 3 CONV 4 CONV 5 All CSIQ dataset H.264 .95 .97 .96 .96 .96 .96 PLoss .74 .79 .77 .77 .76 .79 MJPEG .45 .89 .93 .93 .89 .90 Wavelet .89 .87 .86 .85 .85 .86 WNoise .87 .92 .91 .92 .92 .92 HEVC .89 .92 .90 .91 .90 .91 ALL .70 .77 .75 .76 .77 .77 Netflix - .77 .83 .84 .84 .86 .84 Table 2. Non-linear Pearson correlation values for different distortions using the V Q2 values at different convolutional layers in the Alexnet model. dataset distortion CONV 1 CONV 2 CONV 3 CONV 4 CONV 5 All H.264 .90 .95 .94 .93 .95 .94 CSIQ dataset PLoss .74 .84 .83 .84 .86 .84 MJPEG .50 .92 .91 .90 .90 .90 Wavelet .88 .91 .90 .91 .92 .92 WNoise .91 .94 .93 .93 .92 .95 HEVC .93 .93 .92 .93 .91 .93 ALL .77 .82 .82 .82 .82 .82 Netflix - .78 .90 .88 .89 .92 .90 – Similar to the IQM proposed in [8], quality scores using V Q1 in the mid- convolutional layers (CONV3 and CONV4 in the case of the Alexnet model) show a higher correlation value (Table 1). Amirshahi et al. have linked this issue in the case of images to the nature of deeper convolutional layers which are more focused on patterns and textures seen in the image. – Results from calculating V Q2 for different distortions in the case of the CSIQ dataset (Table 2) show an average of 0.02 increase in correlation values compared to V Q1 VQM. From the results, it is interesting to observe the most significant increase in the correlation value from the V Q1 to V Q2 VQMs is in the case of PLoss (0.05) and WNoise (0.03). – When it comes to the case of all videos in the dataset independent of the type of distortion, V Q2 shows a better performance than V Q1 . This increase of approximately 0.06 shows that by simply giving a higher weight to more salient regions of the feature map we could increase the accuracy of the VQM. – We can see that compared to V Q2 , results from V W V Q2 (Table 3) decreases for all individual distortions in the CSIQ dataset while the overall result do not show any changes. Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 11 Table 3. Non-linear Pearson correlation values for different distortions using the V W V Q2 values at different convolutional layers in the Alexnet model. dataset distortion CONV 1 CONV 2 CONV 3 CONV 4 CONV 5 All CSIQ dataset H.264 .89 .94 .93 .94 .95 .94 PLoss .70 .81 .77 .79 .85 .80 MJPEG .51 .91 .90 .89 .89 .90 Wavelet .88 .90 .89 .90 .92 .91 WNoise .89 .90 .88 .89 .87 .90 HEVC .93 .93 .93 .93 .91 .93 ALL .77 .82 .80 .80 .82 .82 Netflix - .71 .87 .86 .87 .90 .87 Table 4. Non-linear Pearson correlation values for different distortions using the EW V Q2 values at different convolutional layers in the Alexnet model dataset distortion CONV 1 CONV 2 CONV 3 CONV 4 CONV 5 All H.264 .90 .94 .93 .93 .96 .95 CSIQ dataset PLoss .71 .79 .80 .80 .83 .81 MJPEG .51 .88 .87 .86 .89 .88 Wavelet .87 .91 .88 .89 .92 .90 WNoise .89 .92 .92 .92 .91 .92 HEVC .93 .94 .93 .93 .92 .94 ALL .75 .83 .81 .81 .83 .83 Netflix - .85 .90 .90 .90 .91 .91 – Using saliency-based weighting (EW V Q2 ), show a small improvement in the performance of H.264, HEVC, and all distortions by 0.01 (Table 4). This can be linked to the fact that by using saliency in the case of EW V Q2 the VQM covers both spatial and temporal aspects of the video quality. – In the case of combined V Q2 (Table 5), results show an increase in CONV1 and CONV2 layers. Comparing EW V Q2 to V W V Q2, an improvement can also be seen in CONV3 and CONV4 layers. We can observe that combined V Q2 has a better or the same performance in the case of H.264, MJPEG, and HEVC compressions, which could imply that the mentioned compressions are more discriminative for WH to detect. – Compared to other state-of-the-art VQMS (table 6), the proposed approach has better or as good as a performance in the case of H.264 and MJPEG compressions. This can be linked to the compatibility of our method with the structure of such compression methods and how WH is able to discrim- inate between these compression methods. In the case of the WNoise and PLoss distortions, our proposed approach does not show a competitive per- formance. This could be linked to the fact that the saliency methods used 12 M. Jamshidi Seikavandi and S. A. Amirshahi Table 5. Non-linear Pearson correlation values for different distortions using the Combined V Q2 values at different convolutional layers in the Alexnet model dataset distortion CONV 1 CONV 2 CONV 3 CONV 4 CONV 5 All CSIQ dataset H.264 .90 .95 .93 .94 .95 .95 PLoss .63 .73 .70 .71 .78 .72 MJPEG .50 .91 .90 .90 .91 .91 Wavelet .88 .91 .90 .90 .92 .90 WNoise .89 .91 .90 .91 .90 .91 HEVC .94 .94 .93 .94 .92 .94 ALL .78 .83 .81 .81 .82 .83 Netflix - .84 .89 .90 .90 .91 .90 Table 6. Non-linear Pearson correlation values for different distortions in the CSIQ dataset in comparison with state-of-the-art VQMs. H.264 PLoss MJPEG Wavelet WNoise HEVC ALL SSIM .95 .84 .80 .89 .97 .96 .76 VIF [35] .95 .92 .91 .92 .96 .96 .72 STMAD [38] .96 .87 .89 .87 .89 .92 .82 ViS3 [37] .93 .82 .81 .93 .93 .96 .81 MOVIE [34] .90 .88 .87 .89 .85 .93 .78 V-BLIINDS [32] .94 .76 .85 .90 .93 .92 .84 SACONVA [22] .91 .81 .85 .85 .90 .90 .86 V Q1 .96 .79 .90 .86 .92 .91 .77 V Q2 .94 .84 .90 .92 .95 .93 .82 V W V Q2 .94 .80 .90 .91 .90 .93 .82 EW V Q2 .95 .81 .88 .90 .92 .94 .83 Combined V Q2 .95 .72 .91 .90 .91 .94 .83 could not follow imposed transformation loss as well as selected compression distortions. – Finally, our experiments showed that like the case of the IQM introduced in [8] the depth of the network (in our case, the use of VGG-16 and VGG-32 [36]) did not have any significant impact on the performance of the proposed VQM. 3.3 Content and Compression Analysis Our experiments show a link between the content and compression method with the video and image saliency and so the performance of our VQM. To be more specific, the difference between video saliency and image saliency can provide a better understanding of the content. Likewise, the difference between image or video saliency of test and reference video provides information about the Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 13 (a) Flowervase (b) Chipmunks (c) Keiba Fig. 2. Sample frames from three video clips in the CSIQ dataset. compression method used in the video. Thus, WH would include information about video quality, content, and distortion. Experimental results show that instead of having a value of zero and one for WH , a fuzzy approach for selecting the value of WH could result in improving the accuracy of our proposed VQM. That is, depending on the amount of temporal and structural variations of the video WH could have different values. For example, our initial study has shown that in the case of the CSIQ database WH would have a low value in Flowervase video (Figure 2(a)) while the Chipmunks and Keiba videos (Figures 2(b) and (c) respectively) would be assigned a high WH values. 4 Conclusion and Future works In conclusion, we proposed a set of different VQMs that are inspired by a CNN- based IQM which assesses the spatial features effectively. Saliency maps of videos added a spatial-temporal approach to our method, yielding to a series of quality scores for each frame in the video. Different schemes are then applied to these quality scores to introduced two different video quality scores for the video. Finally, using a saliency based approach to compare spatial and temporal dis- tortions, one of the two mentioned scores are presented as the final video quality. The proposed measure was tested on the CSIQ and the Netflix public dataset. Our experimental results show that by simply differentiating between spatial and temporal distortions, our VQM could have a better accuracy. The proposed approach performs well in the case of compression based distortions while its accuracy drops in the case of distortions infected by transformation loss. As we discussed in Section 3.3, finding the content and distortion type of a video based on spatial-temporal and spatial saliency could also improve the performance of the VQM. Further study of this issue and selecting the perfect weighting function for the two spatial and spatial-temporal VQM would be part of the future work we plan to perform. References 1. CSIQ video quality database, http://vision.eng.shizuoka.ac.jp 2. Ahn, S., Lee, S.: Deep blind video quality assessment based on temporal human perception. In: ICIP. pp. 619–623 (2018) 14 M. Jamshidi Seikavandi and S. A. Amirshahi 3. Amirshahi, S.A.: Towards a perceptual metric for video quality assessment. Mas- ter’s thesis, Norwegian University of Science and Technology (NTNU) (2010) 4. Amirshahi, S.A.: Aesthetic quality assessment of paintings. Verlag Dr. Hut (2015) 5. Amirshahi, S.A., Kadyrova, A., Pedersen, M.: How do image quality metrics per- form on contrast enhanced images? In: EUVIP. pp. 232–237 (2019) 6. Amirshahi, S.A., Larabi, M.C.: Spatial-temporal video quality metric based on an estimation of qoe. In: QoMEX. pp. 84–89 (2011) 7. Amirshahi, S.A., Pedersen, M.: Future directions in image quality. In: CIC. vol. 2019, pp. 399–403 (2019) 8. Amirshahi, S.A., Pedersen, M., Yu, S.X.: Image quality assessment by comparing cnn features between images. J ELECTRON IMAGING 2017(12), 42–51 (2017) 9. Amirshahi, S.A., Redies, C., Denzler, J.: How self-similar are artworks at different levels of spatial resolution? In: CAE. pp. 93–100 (2013) 10. Antkowiak, J., Jamal Baina, T., Baroncini, F.V., Chateau, N., FranceTelecom, F., Pessoa, A.C.F., Stephanie Colonnese, F., Contin, I.L., Caviedes, J., Philips, F.: Final report from the video quality experts group on the validation of objective models of video quality assessment march 2000 (2000) 11. Bampis, C.G., Li, Z., Bovik, A.C.: Spatiotemporal feature integration and model fusion for full reference video quality assessment. IEEE T CIRC SYST VID 29(8), 2256–2270 (2018) 12. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: CIVR. pp. 401–408 (2007) 13. Chan, A., Zeng, K., Mohapatra, P., Lee, S.J., Banerjee, S.: Metrics for evaluating video streaming quality in lossy ieee 802.11 wireless networks. In: Infocom. pp. 1–9 (2010) 14. Dendi, S.V.R., Krishnappa, G., Channappayya, S.S.: Full-reference video quality assessment using deep 3d convolutional neural networks. In: NCC. pp. 1–5 (2019) 15. Freitas, P.G., Akamine, W.Y., Farias, M.C.: Using multiple spatio-temporal fea- tures to estimate video quality. Signal Process. Image Commun. 64, 1–10 (2018) 16. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS. pp. 545–552 (2007) 17. Hekstra, A.P., Beerends, J.G., Ledermann, D., De Caluwe, F., Kohler, S., Koenen, R., Rihs, S., Ehrsam, M., Schlauss, D.: Pvqm–a perceptual video quality measure. Signal Process. Image Commun. 17(10), 781–798 (2002) 18. Huynh-Thu, Q., Ghanbari, M.: Scope of validity of psnr in image/video quality assessment. Electron. Lett. 44(13), 800–801 (2008) 19. Kim, W., Kim, J., Ahn, S., Kim, J., Lee, S.: Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network. In: ECCV. pp. 219–234 (2018) 20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: NIPS. pp. 1097–1105 (2012) 21. Li, X., Guo, Q., Lu, X.: Spatiotemporal statistics for video quality assessment. IEEE T IMAGE PROCESS 25(7), 3329–3342 (2016) 22. Li, Y., Po, L.M., Cheung, C.H., Xu, X., Feng, L., Yuan, F., Cheung, K.W.: No- reference video quality assessment with 3d shearlet transform and convolutional neural networks. IEEE T CIRC SYST VID 26(6), 1044–1057 (2015) 23. Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practical perceptual video quality metric. The Netflix Tech Blog 6 (2016) 24. Liu, K.H., Liu, T.J., Liu, H.H., Pei, S.C.: Spatio-temporal interactive laws feature correlation method to video quality assessment. In: ICMEW. pp. 1–6 (2018) Evaluating VQ by Differentiating Between Spatial and Temporal Distortions 15 25. Ma, Q., Zhang, L., Wang, B.: New strategy for image and video quality assessment. J ELECTRON IMAGING 19(1), 011019 (2010) 26. Maczyta, L., Bouthemy, P., Le Meur, O.: Cnn-based temporal detection of motion saliency in videos. Pattern Recognit. Lett. 128, 298–305 (2019) 27. Men, H., Lin, H., Saupe, D.: Spatiotemporal feature combination model for no- reference video quality assessment. In: QoMEX. pp. 1–3 (2018) 28. Ninassi, A., Le Meur, O., Le Callet, P., Barba, D.: Considering temporal variations of spatial visual distortions in video quality assessment. IEEE J. Sel. Topics Signal Process. 3(2), 253–265 (2009) 29. Ong, E., Lin, W., Lu, Z., Yao, S.: Colour perceptual video quality metric. In: ICIP. vol. 3, pp. III–1172 (2005) 30. Redies, C., Amirshahi, S.A., Koch, M., Denzler, J.: Phog-derived aesthetic mea- sures applied to color photographs of artworks, natural scenes and objects. In: ECCV. pp. 522–531 (2012) 31. Roja, B., Sandhya, B.: Saliency based assessment of videos from frame-wise quality measures. In: IACC. pp. 639–644 (2017) 32. Saad, M.A., Bovik, A.C., Charrier, C.: Blind prediction of natural video quality. IEEE T IMAGE PROCESS 23(3), 1352–1365 (2014) 33. Sector, I.T.S.: Objective perceptual multimedia video quality measurement in the presence of a full reference. ITU-T Recommendation J 247, 18 (2008) 34. Seshadrinathan, K., Bovik, A.C.: Motion tuned spatio-temporal quality assessment of natural videos. IEEE T IMAGE PROCESS 19(2), 335–350 (2009) 35. Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE T IMAGE PROCESS 15(2), 430–444 (2006) 36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 37. Vu, P.V., Chandler, D.M.: Vis3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices. J ELECTRON IMAGING 23(1), 013016 (2014) 38. Vu, P.V., Vu, C.T., Chandler, D.M.: A spatiotemporal most-apparent-distortion model for video quality assessment. In: ICIP. pp. 2505–2508 (2011) 39. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE T IMAGE PROCESS 13(4), 600–612 (2004) 40. Winkler, S.: Digital video quality: vision models and metrics. John Wiley & Sons (2005) 41. You, J., Korhonen, J.: Deep neural networks for no-reference video quality assess- ment. In: ICIP. pp. 2349–2353 (2019)