Method for Visual Video Defects Detection using Machine
Learning
Dmytro Fedasyuka, Roman Lukomskyia, Tetyana Marusenkovaa
a
 Lviv Polytechnic National University, St. Bandery str, 28 a, Lviv, 79013, Ukraine

                 Abstract
                 The main problem to be solved by this research is the imperfection of the video testing
                 process. Nowadays this process involves mainly manual testing, which is inefficient due to
                 the high probability of human errors, significant time, and material costs. In order to improve
                 this process, we have created a convolutional neural network that can detect defects in video
                 frames with high probability. We have also built a prototype of a software system that can
                 automatically detect defects in video using the created and trained convolutional neural
                 network.
                 Keywords
                 Defects detection, image recognition, machine learning, deep learning, convolutional neural
                 networks, automation, video testing.

     1. Introduction
         The evolution of digital communication systems has led to the active development of
multimedia systems and applications such as IPTV (Internet protocol television), mobile multimedia,
social networks, virtual reality games, video conferencing, and educational multimedia presentations.
These multimedia applications have now become an integral part of everyday life, and they are
expected to grow rapidly in the future. Video is being widely used in the above-mentioned application
areas, which set strict requirements to its quality, i.e., video content should be free of defects.
         A visual defect or a perceptual artifact in a video is a noticeable frame distortion. Such
distortion can appear as a result of errors in compression, transmission, or encoding. An example of a
visual defect is shown in Figure 1.


Figure 1: An example of a perceptual artifact

CMIS-2021: The Fourth International Workshop on Computer Modeling and Intelligent Systems, April, 27, 2021, Zaporizhzhia, Ukraine
EMAIL: dmytro.v.fedasyuk@lpnu.ua (D. Fedasyuk); roman.lukomskyi.mnpz.2019@lpnu.ua (R. Lukomskyi); tetiana.a.marusenkova@lpnu.ua(T. Marusenkova)
ORCID: 0000-0003-3552-7454 (D.Fedasyuk); 0000-0002-0345-8290 (R. Lukomskyi); 0000-0003-4508-5725 (T. Marusenkova)
               © 2021 Copyright for this paper by its authors.
               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
               CEUR Workshop Proceedings (CEUR-WS.org)
         Even minor perceptual artifacts can have a significant impact on the satisfaction of users from
watching videos and using multimedia services. With that in mind, multimedia providers are paying
more and more attention to ensure the high quality of their video content. Nowadays, the major
methods for video quality assessment involve manual testing which is inefficient due to the human
factor.
         The purpose of this research is to improve the process of video testing. The use of the
developed method should reduce the focus on manual testing of video content quality because manual
testing has many disadvantages. For example, such disadvantages include the relatively high cost of
this process due to the significant cost of viewing equipment, premises, etc. Another important
drawback is the high probability that a certain defect will be missed by testers because in order to save
time, they usually review only small test fragments, which may not have a particular visual defect in
them.
         It is expected that the use of the method will increase the percentage of detected defects
during testing and, accordingly, will help to improve the overall quality of video and service. An
important advantage is that the input of the method is only the video that is being tested. This method,
unlike some existing methods of objective quality determination, will allow usage of the software
system in cases where the original video file is not available. This is an important factor because in
real conditions there is often no access to the original video.
         Given the rapid development of the video services market, we can conclude that the
development of a method for detecting visual defects is quite promising. The presence of even minor
and short defects can significantly affect the satisfaction of users, so companies that provide services
related to video distribution and viewing will be interested in improving their video testing process.
         The object of research is the process of testing video with the detection of visual defects. The
subject of research is methods of detecting defects in the video.

    2. Related work
         Visual quality assessment refers to gauging a probability of a visual perceptual artifact. Since
humans are the ultimate consumers of video content, the subjective methods of testing video quality
require manual testing, i.e. involving people who view fragments of video and give a probably biased
assessment of its quality on a particular rating scale [1, 2]. Because video content is very large, testers
usually are not able to view it entirely.
         There are various methods of subjective video quality assessment which determine the rules
for selecting an optimal test video fragment duration, number of people that will conduct the
assessment as well as metrics that should be used to get the final result. For example, there is a
recommendation that suggests that an optimal test fragment length is 10 seconds [3]. However,
studies have been conducted to optimize this time indicator [4]. These studies claim that in some
cases, reducing the duration of test fragments may not have a significant impact on the resulting
testing, but at the same time significantly reduce the time spent on testing. This is evidenced by
previous studies that show that, firstly, testers become less attentive if the fragment length exceeds 10
seconds [5], secondly, shorter test sequences contribute to more consistent results [6] and, thirdly, the
average shooting time in most modern films is less than 10 seconds [7]. However, when optimizing
test suites, there are risks associated with the possibility that defects will be contained in fragments
that have been removed from the test suite. Besides, there are various laboratory factors, including, for
example, screen size, lighting, viewer-to-screen distance, and so on [8].
         Subjective video testing provides reliable results in some cases since it is conducted by
humans. On the other hand, this approach has a number of drawbacks for the same reason. To name a
few, they include but not limited to:
         ● the need for significant human resources;
         ● significant costs for the viewing equipment;
         ● it takes a lot of time;
         ● there is a possibility that a certain fragment that actually contains a defect will not be
              selected to the test fragments set or a short defect will go unnoticed by the tester.
         Moreover, this approach cannot be used in continuous quality assessment systems. That is, if
a new file is added to the video file database, it requires a separate testing session, which often cannot
take place immediately after receiving the file, but must be scheduled for the future due to the human
factor. This also contributes to the risk that a video file will be skipped.
         Another significant disadvantage is that very often the same video file is encoded with
different quality parameters for adaptive transmission, which greatly increases the number of test
videos and, accordingly, the resources spent on testing. Currently, there are methods of objective
video quality testing which are being developed in order to solve these problems. They are often
referred to as Objective Quality of Experience (QoE) methods.
         The structural information of the video is especially important for the comprehension of a
person's visual perception. There is an approach to image quality assessment based on structural
information, which determines the degradation of structural similarity derived from the statistical
properties of local information between the reference and the distorted images [9]. This approach has
also been proposed for video. In addition, a criterion of information reliability was proposed to assess
the quality of images [10], which is based on statistical characteristics of the structural information of
natural images.
         Image motion information is also important for optimal, realistic quality assessment. Various
researchers have proposed methods for determining quality based on motion information [11, 12]. In
particular, in the Motion-based Video Integrity Evaluation (MOVIE) [13], proposed for natural video,
distorted and reference content is decomposed using Gabor spatio-temporal filters, therefore the
quality index consists of two components – spatial and temporal.
         The spatio-temporal quality method provides a significant improvement in performance in
terms of statistical correlation between the results obtained with this objective method and the
subjective data obtained from humans. The authors note that this method can be used mainly for
videos with natural content. The authors also acknowledge that the method is not computationally
efficient. This is due to the fact that, since this method involves the decomposition of a video using a
large number of Gabor filters, which, in turn, require a large number of frames, the use of the method
requires the availability of significant computing resources. Taking this into account, we can conclude
that the possibilities for using this method are limited.
         Despite some progress in the objective detection of defects in the video, there are a number of
issues that need to be addressed [14]. In particular, it is difficult to use the existing objective models
in general cases, as each has its own characteristics, which are associated with certain types of
content, context, or defects. Because of this, a particular model may do well with some types of video,
but have very poor results, in cases of use for another video because it was not configured for it, or
this video does not have the statistical characteristics on which the model is built.
         Given the different nature of multimedia data, the creation of a generalized model, the logic
of which will not depend on the above factors will significantly improve the state of modern video
quality assessment.
         In addition, traditional methods of quality assessment are often based on explicit modeling of
human perception. As a result, systems based on these methods are prone to so-called overfitting and
therefore have questionable results on real data. Instead, machine learning-based methods could
mimic human perception of quality, rather than developing a precise model of the human visual
perception system. Moreover, such methods will not need the original media file in order to assess
quality.

    3. Using the convolutional neural networks
       3.1. The neural network architecture
         In order to check if visual defects are present in the video, we can divide it into frames and
analyze them separately. The use of such an approach gives us an ability to get detailed results that
not only show if the defect is present or not but also, in case if it is present, it can indicate the time
when the defect occurred. This might be especially handy if there is a need for a human to double-
check if there is a real defect before marking the video as defective.
         We can use artificial neural networks (ANN) in order to detect visual defects in specific
frames. Artificial neural networks are software implementation of the logic of brain structures. We
can train them by specifying input information and examples of relevant output information [15].
         As a matter of fact, the biggest limitation of artificial neural networks is that they inevitably
require a big amount of computational resources when working with a large amount of input data.
This might not be an issue for small black and white images, but the image data in real-world videos
always contains a lot of input information (taking into account its size and color) which renders
traditional neural networks useless when applied in image recognition tasks. In order to resolve this
problem, there needs to be an optimization of the input data that is passed to the fully connected
layers. We need to separate image features that are truly important in the given recognition task.
         Convolutional Neural Networks (CNN) are a class of artificial neural networks for deep
learning that is often used in visual image analysis [16]. The architectural structure of a conventional
neural network is fairly simple – each connection has its weight which is used during the process of
propagation. The convolutional neural networks try to optimize their input before passing it to a fully
connected layer. When it receives large input data (such as a colored image), it uses a number of
convolution kernel matrices to retrieve important information from the given data. The important data
in the case of a graphical input can be relative locations of certain shapes such as circles, arcs, and
lines, distribution of color across the image, etc. When a fully connected layer receives information
about the presence or absence of these features as well as their relative location information, it can
effectively conduct the image recognition process since the volume of this input is tremendously
smaller compared to the full input. A basic CNN architecture is shown in Figure 2. Usually, a CNN
architecture would contain several convolutional and pooling layers.


Figure 2: Basic CNN architecture

         Naturally, in a convolutional neural network, a set of weights encodes important visual
features of the image (color distribution, angles, lines as well as their relative location information).
The network’s convolution kernels are formed during the training process [17]. They cannot be set in
advance, since each image recognition task has its own important features (recognition of certain
objects would require a different set of graphical features than recognition of defects).
         The pooling operation reduces the size of the formed feature maps. In this network
architecture, it is considered that information about the presence of the specific feature is more
important than accurate knowledge of its coordinates, so the maximum of several neighboring
neurons of the feature map is selected and taken as one neuron of a compact feature map of smaller
size [16]. Because of this operation, in addition to accelerating further computations, the neural
network becomes more adaptable to the scale of the input image.
         A convolutional neural network consists of input and output layers, as well as several hidden
layers. Hidden CNN layers typically consist of convolutional, pooling, fully connected, and
normalization layers. Convolutional layers apply a convolution operation to the input data, passing the
result to the next layer. The convolution operation simulates the response of a single neuron to a
specific visual stimulus. Each convolutional neuron processes data for its receptive field. The
convolution operation solves the problem of the increasing number of connections, as it reduces the
number of parameters, allowing the network to be deeper [16]. The convolutional layer is the main
building block of CNN. The parameters of this layer consist of a set of filters for learning, which have
a small receptive field. During the forward propagation, each filter performs a convolution on the
width and height of the input layer, calculating the scalar product of the filter and input data, and
forming a 2-dimensional activation map of this filter. As a result, the network learns which filters are
activated when it detects a particular type of feature in a particular spatial position in the input data.
         When processing multidimensional inputs (such as images), it is impractical to connect all
neurons with all neurons of the previous layer, because such a network architecture does not take into
account the spatial structure of the data. Convolutional networks use spatial-local correlation by
providing a scheme of local connection between neurons of adjacent layers: each neuron is connected
to only a small area of the input layer. The receptive field of a neuron is a hyperparameter that is the
degree to which neurons connect. Connections are local in space (along width and height) but always
propagate along the entire depth of the input layer.
        The spatial size of the output volume is calculated using a formula (1).
                                 N  (W  F  2 P ) / 2  1,                                         (1)
where N – the output volume size,
        W – the input volume size,
        F – the kernel size,
        P – the amount of zero padding,
        S – the stride.
        In order to ensure that the dimensions of the input and output matrices are equal (provided
that the step of the filter area is equal to one), the size of the zero padding is determined by the
formula (2).
                                      P  ( F  1) / 2,                                              (2)
where P – the amount of zero padding,
        F – the kernel size.
        Convolutional and pooling layers are followed by fully connected layers (Figure 3). Neurons
in the fully connected layer are connected to all neurons in the previous layer, just as they do in
conventional artificial neural networks [18].


Figure 3: Fully connected layers

         The loss layer determines how the training process penalizes the deviation between the
predicted and expected results and is usually the final layer. It can use different loss functions for
different tasks. For example, normalized exponential (softmax).
         Because the fully connected layer has the most parameters, it is prone to overfitting. One of
the most common methods of reducing overfitting is a dropout. At each stage of training, individual
nodes with a certain probability are either "excluded" from the network or remain in it, thus, as a
result, the network is reduced. Inbound and outbound links of excluded nodes are also deleted. In the
next step, only a reduced version of the neural network is trained on the data. The removed nodes are
then re-inserted into the network with their previous weights.
         Taking into account the above facts it was decided to choose the neural network architecture
which is shown in Figure 4.


Figure 4: Architecture of the created convolutional neural network

        3.2. Dataset construction
        In order to construct the dataset, we selected several creative commons licensed videos from
the Internet. To provide the best learning results during the neural network training process, the
dataset needs to have a wide variety of training data. In order to achieve this, we made sure that the
selected video content includes a diverse set of content genres. In addition to that, the scenes in the
videos include different types of camera motion such as static, moving, and zoom. Additionally, we
have ensured that the videos are of pristine quality by carefully inspecting each of them. Making sure
that the training data is labeled correctly will help to avoid confusing the convolutional neural
network during the training process.
        In order to have distorted versions of the same videos, we have simulated errors in the
original videos. After that, we have once again ensured that the distorted videos indeed have visual
defects in each frame. Later, we broke down the videos into separate frames so that they could be
given to the CNN as its input.
        The final dataset consists of 129,092 video frames. Each frame from the dataset is labeled as
either “damaged” or “original” as shown in Figure 5. For each damaged frame there is an original
frame in the dataset and vise versa. This way the convolutional neural network can learn most
effectively since it has similar reference images for both classes.


Figure 5: Original (on the left) and distorted (on the right) frames from the dataset

        3.3. Training
        Since convolutional neural networks require a lot of training data to perform well, it is
common to use the existing entries in the dataset in order to generate new training examples. This
technique is called data augmentation [18, 19]. Advanced data augmentation might include the usage
of generative adversarial network (GAN) [20] or balancing GAN (BAGAN) [21]. However, taking
into account the nature of our data, especially the importance to keep all of our training examples as
close to natural samples as possible, these techniques are not needed in our case. Traditional
transformations consist of using a combination of transformations on the original data. Since we are
working with images, such transformations might include:
        ● horizontal/vertical image flip;
        ● rotation;
        ● cropping;
        ● translation;
        ● shift;
        ● zoom;
        ● shading.
        Experimentally we have discovered that in the case of visual defects detection, image flip
performs the best out of all of the above approaches. This might be explained by the fact that other of
the above techniques might introduce distortions in the dataset images marked as “original” (for
example, excessive pixelation after zooming in or out). This would have a significant negative effect
on the training process and the final CNN accuracy. Taking this into account, we have decided to
apply the data augmentation using the image flip technique.
        In order to conduct the training, we used Tensorflow and Keras with Python. As a result of
training on the previously created dataset, the neural network managed to achieve an accuracy of
98.5%. The process of training with the respective changes in both training and validation loss and
accuracy is shown in Table 1.

Table 1
The training process
       Epoch             Training loss         Training          Validation loss         Validation
                                               accuracy                                   accuracy
         1                 0.2121               0.9146               0.3368                0.9148
         2                 0.0973               0.9719               0.0710                0.9785
         3                 0.0845               0.9786               0.0663                0.9835
         4                 0.0826               0.9806               0.0443                0.9850

         As can be seen from Table 1, the created convolutional neural network can detect visual
artifacts in frames with decent probability.
         We can now use the created convolutional neural network to detect visual defects in the
video. Using a trained model, we have built a prototype of a software system that checks for visual
defects in the video by cyclically performing the following steps:
         ● retrieving a video frame – in this step of the algorithm, the prototype of the software
              system forms a video frame in the final image form – the same form in which it would be
              shown to the viewer;
         ● analyzing the video frame – the convolutional neural network analyzes the video frame
              which was obtained in the previous step. In this step the CNN will output the probability
              of visual distortion presence;
         ● saving the result of the analysis – the software system stores the result of the analysis in a
              particular video frame in order to be able to later show it to the user on the graphical
              interface;
         The described algorithm is shown in Figure 6.


Figure 6: Algorithm of the software system operation
         After performing the steps described above, the prototype of a software system builds a
diagram of the defects as shown in Figure 7. Normally, at this point, a person who is responsible for
the testing process would need to take a look at the diagram and investigate the defects that have been
found.


Figure 7: The defects probability diagram

    4. Conclusion
         In this paper, we analyzed the problem of video testing. The video content in modern video
content delivery systems undergoes multiple stages that can introduce visual distortions. Such
distortions can have tremendous effects on the end-user experience.
         Manual testing takes an unreasonable amount of time and leads to a high probability of
human error. In spite of different methods and ideas for improving defect detection, there are a
number of issues such as the need for significant human resources, significant costs for the viewing
equipment, limited usage for continuous quality assessment systems. To add, traditional methods of
quality assessment are often based on explicit modeling of human perception. Therefore, systems
based on such an approach are prone to overfitting and have questionable results on real data.
Machine learning-based methods could solve this issue. Moreover, such methods will not need the
original media file in order to assess quality.
         In order to solve this issue, we proposed using a convolutional neural network. Machine
learning-based methods can mimic human perception of quality and it can increase the percentage of
detected defects during testing. In order to conduct the training of the CNN, we have created a dataset
from pristine and distorted videos which consists of 129,092 frames. In order to provide even more
training data to the CNN, we used the data augmentation technique. As a result of training, the
convolutional neural network achieved an accuracy of 98.5%. Taking into account the high accuracy
of the trained model we can see that it is possible to use it in order to detect visual defects in video
frames.
         The created model was used to build a prototype of a software system that can detect defects
in video with quite a high accuracy. The use of the developed software system can reduce the human
factor in the process of video testing, as well as significantly speed up and reduce the cost of this
process. Saving the result of the analysis and diagram of the defects helps the user to easily analyze
the detected defects.

    5. References
    [1] A. K. Moorthy, K. Seshadrinathan, A. C. Bovik, Image and Video Quality Assessment:
        Perception, Psychophysical Models, and Algorithms, Perceptual Digital Imaging: Methods
        and Applications (2017) 55-81.
    [2] M. H. Pinson, S. Wolf, Comparing subjective video quality testing methodologies,
        Communications and Image Processing (2003) 573-582. doi:10.1117/12.509908.
[3] ITU-R, Methodology for the Subjective Assessment of the Quality of Television Pictures,
    2002. URL: http://www.gpds.ene.unb.br/databases/2012-UNB-Varium-Exp/Exp3-Delft/00-re
    port-alexandre/Papers---Judith/Subjective%20Studies/ITU-Recommendation---BT500-11.pdf
[4] F. M. Moss, K. Wang, F. Zhang, R. Baddeley, D. R. Bull, Moss, Felix Mercer, Ke Wang, Fan
    Zhang, Roland Baddeley, and David R. Bull. "On the optimal presentation duration for
    subjective video quality assessment, IEEE Transactions on Circuits and Systems for Video
    Technology (2015) 1977-1987.
[5] P. Fröhlich, S. Egger, R. Schatz, M. Mühlegger, K. Masuch, and D. Gardlo, QoE in 10
    seconds: Are short video clip lengths sufficient for quality of experience assessment?, 2012
    fourth international workshop on quality of multimedia experience. IEEE (2012) 242-247.
[6] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist, Visual correlates of fixation selection:
    Effects of scale and time, Vision research 45, no. 5 (2005) 643-659.
[7] J. E. Cutting, K. L. Brunick, J. E. DeLong, C. Iricinschi, A. Candan, Quicker, faster, darker:
    Changes in Hollywood film over 75 years, i-Perception 2 (2011) 596-576.
    doi:10.1068/i0441aap.
[8] Q. Huynh-Thu, M. Ghanbari, D. Hands, M. Brotherton, Subjective video quality evaluation
    for multimedia applications, Human Vision and Electronic Imaging XI, vol. 6057, p. 60571D.
    International Society for Optics and Photonics (2006). doi: 10.1117/12.641703.
[9] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error
    visibility to structural similarity, IEEE transactions on image processing 13(4) (2004) 600-
    612. doi:10.1109/TIP.2003.819861.
[10]         H. R. Sheikh, A. C. Bovik, G. De Veciana, An information fidelity criterion for
    image quality assessment using natural scene statistics, IEEE Transactions on image
    processing 14, no. 12 (2005) 2117-2128. doi:10.1109/TIP.2005.859389.
[11]         K. Seshadrinathan, A. C. Bovik, Motion tuned spatio-temporal quality assessment of
    natural videos, IEEE transactions on image processing 19.2 (2009) 335-350.
    doi:10.1109/TIP.2009.2034992.
[12]         R. Soundararajan, A. C. Bovik, Video quality assessment by reduced reference
    spatio-temporal entropic differencing, IEEE Transactions on Circuits and Systems for Video
    Technology (2013) 684-694. doi:10.1109/TCSVT.2012.2214933.
[13]         K. Seshadrinathan, A. C. Bovik, Motion-based perceptual quality assessment of
    video, Human Vision and Electronic Imaging XIV (2009). doi:10.1117/12.811817.
[14]         Z. Akhtar, T. H. Falk, Audio-Visual Multimedia Quality Assessment: A
    Comprehensive Survey, IEEE Access (2017). doi:10.1109/ACCESS.2017.2750918.
[15]         C. Aggarwal, Neural Networks and Deep Learning: A Textbook, Springer, New
    York, NY, 2018.
[16]         K. O'Shea, R. Nash, An Introduction to Convolutional Neural Networks, arXiv
    preprint arXiv:1511.08458 (2015).
[17]         F. Millstein, Convolutional Neural Networks In Python: Beginner's Guide To
    Convolutional Neural Networks In Python, Scotts Valley New York, NY, 2018.
[18]         L. Perez, J. Wang, The effectiveness of data augmentation in image classification
    using deep learning, arXiv preprint arXiv:1712.04621 (2017).
[19]         C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep
    learning, Journal of Big Data 6.1 (2019) 1-48. doi:10.1186/s40537-019-0197-0.
[20]         I. Goodfellow, Nips 2016 tutorial: Generative adversarial networks, arXiv preprint
    arXiv:1701.00160 (2016).
[21]         G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, C. Malossi, Bagan: Data
    augmentation with balancing gan, arXiv preprint arXiv:1803.09655 (2018).