Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)


   Shot Boundary Detection: Fundamental Concepts
                    and Survey
                     1st Benoughidene Abdel halim                                         2nd Titouna Faiza
                     Department of computer science                               Department of computer science
                          University of Batna 2                                        University of Batna 2
                             Batna, Algeria                                               Batna, Algeria
                        benouhalim@gmail.com                                            ftitouna@yahoo.fr


   Abstract—A great part of the Big Data surge in our digital en-       video event partitioning [4]. In addition, the video summary
vironments is in the form of video information. Hence automatic         is the best and most effective solution for converting large,
management of this massive growth in video content seems to be          amorphous videos into structured, concise, clear and mean-
significantly necessary. At present researches topic on automatic
video analyses includes video abstraction or summarization, video       ingful information. The main task of summarizing a video
classification, video annotation and content based video retrieval.     is to segment the original video into shots and extract key
In all these applications one needs to identify shot boundary           frames from the shots, which will be the most representative
detection. Video shot boundary detection (SBD) is the process           and concise of the entire video [5].
of segmenting a video sequence into smaller temporal units              Video shot boundary detection (SBD) is also called shot
called shots. SBD is the primary step for any further video
analyses. This paper presents the fundamental theory of the             segmentation, is the first process in video summarization, and
video shot boundary, and a brief overview on shot boundary              its output significantly affects the subsequent processes. The
detection approaches and their development. The advantages and          main idea of video shot boundary is extracting the feature
disadvantages of each approach are comprehensively explored             of video frames, and then detecting the shot type according
and challenges are presented. In addition to that, we focused           to the difference the feature. There are two kinds of video
on the machine learning technologies such as deep learning
approaches for SBD could be directed as new directions for the          shot boundary detection: Cut Transition (CT) and Gradual
future.                                                                 Transition (GT) [6]. In general, the performance of the shot
   Index Terms—Shot Boundary Detection(SBD), Cut Transition             boundary detection algorithm depends on its ability to detect
(CT), Gradual Transition (GT), Temporal Video Segmentation,             transitions (shot boundaries) in the video sequence. Whereas,
Video Content Analysis, Content Based Video Indexing and                the accuracy of detection of shot boundary detection generally
Retrieval (CBVIR), Feature Extraction, Machine Learning, Deep
Learning, Convolutional Neural Networks (CNN), Multimedia
                                                                        depends on the extracted features and their effectiveness in
Big Data.                                                               representing the visual content of video frames and the com-
                                                                        putational cost of the algorithm, which needs to be reduced [7].
                                                                        Practically, there are some effects that appear in a video shot
                       I. I NTRODUCTION
                                                                        such as: flash lights or light variations, object/camera motion,
   With the rapid development of computer networks and mul-             camera operation (such as zooming, panning, and tilting), and
timedia technology, the amount of multimedia data available             similar background. Currently, there is no complete solution
every day is enormous and is increasing at a high rate, as well         to these problems or most of them in the same algorithm. In
as the ease of access and availability of multimedia sources,           other words, a favorable and effective method of detecting
which leads to big data revolution Multimedia.                          transitions between shots is still not available despite the
Video is the most consumed data type on the Internet such               increased attention devoted to shot boundary detection in the
as YouTube, Vimeo or Dailymotion, Yahoo Video, social net-              last two decades. This unavailability is due to randomness and
working sites like Facebook, Twitter, Instagram, etc. The ex-           raw video data size. Hence, a robust, efficient, automated shot
plosive growth in video content leads to the problem of content         boundary detection method is a necessary requirement [8].
management. However, people spent their time uploading and              Most of the existing reviews are not covering the recent
browsing huge videos to determine whether these videos were             advancements directionsin the field of shot boundary detection
relevant or not, this is an difficult and stressful task for humans     as deep learning. This paper mainly focusing on review and
[1]. In such a scenario, it is necessary to have automated              analyze different kinds of shot boundary detection algorithms
video analysis applications to represent information stored in          that are implemented in the uncompressed domain following
large multimedia data. Such techniques are grouped into a               their accuracy rate, computational load, feature extraction
single concept of Content-Based Video Indexing and Retrieval            technique, advantages, and disadvantages. Future research
(CBVIR) systems. These applications include browsing of                 directions are also discussed.
video folders, news event analyses, intelligent management of
videos, video surveillance [2], key frame extraction [3], and
  II. BASIC C ONCEPTS OF S HOT B OUNDARY D ETECTION                 a) A Cut : Is a sudden change from a video shot to
   Partitioning a video sequence into shots is the first step          another one [6]. (see Fig 3)
toward video summarization. A video shot is defined as a
series of interrelated consecutive frames taken contiguously
by a single camera and representing a continuous action in
time and space. As such, a shot boundary is the transition
between two shots. This section presents the main concepts                     Fig. 3: (a) Cut Transition.
for shot boundary detection in videos [9].
   1) Video Definition : A video is a collection of image           b) A fade out : Occurs when the shot gradually turns
      frames arranged in a time-sequenced manner. As video             into a single monochrome frame, usually dark [6].
      consist of number of frames depend upon size of video.           (see Fig 4)
      These frames occupy large space in memory. Frame rate
      is about 20 to 30 frames per second [10].
   2) Video Hierarchically : A video can be broken down
      in scene, shot and frames. Scene is a logical grouping
      of shots into a semantic unit. A shot is a sequence
      of frames captured by a single camera in a single
      continuous action. The frames within a shot (intra-shot                     Fig. 4: (b) Fade out.
      frames) contain similar information and visual features
      with temporal variations. A frame is the smallest unit        c) A fade in : Takes place when the scene gradually
      that constitutes a shot [10]. (see Fig 1)                        appears on screen.[6]. (see Fig 5)


                                                                                  Fig. 5: (c) Fade in.

                                                                    d) A dissolve : Happens when a shot gradually re-
                                                                       places another one. One disappears as the follow-
                Fig. 1: Video Hierarchically.
                                                                       ing appears, and for a few seconds, they overlap,
                                                                       and both are visible. In the process of dissolve, two
  3) Shot transition types : The transition between one
                                                                       adjacent shots are temporally as well as spatially
     shot and the following can be cut or gradual. The cut
                                                                       associated [6]. (see Fig 6)
     shot occurs when two successive shots are concatenated
     directly without any editing (special effects). This type
     of transition is also known as a abrupt or hard transition.
     The cut is considered a sudden change from one shot
     to another. By contrast, gradual shot occurs when two
     shots are combined by utilizing special effects through-                     Fig. 6: (d) Dissolve.
     out the production course. Gradual shot may span two
     or more frames that are visually interdependent and
     contain truncated information [11]. According to the           e) The wipe : Is more dynamic and is considered
     different editing effects, there are several different kinds      as the most difficult to model and to detect. It
     of gradual shot types, such as fade in/fade out , dissolve,       happens when a shot pushes the other one off
     wipe [12]. (see Fig 2)                                            the screen. In this case, two adjacent shots are
                                                                       spatially separated at any time, but not temporally
                                                                       separated. Its difficulty lies in the number of types
                                                                       of wipe transitions that exists. Indeed, when a shot
                                                                       is moving from the screen (i/.e leaving place to
                                                                       the other incoming shot), the movement can be
                                                                       either horizontal (i.e. from bottom to top or vice
                                                                       versa), vertical (e.g. from left to right), oblique (i.e.
                                                                       from a corner to the opposite one), starting from
                                                                       the center, going towards the center or others, etc
             Fig. 2: Video shot transition types.                      [6]. (see Fig 7)
                                                                  A. Pixel-Based Methods
                                                                     In this method, intensity of pixels is evaluated by taking two
                                                                  consecutive video frames and comparing pixel by pixel or the
                                                                  percentage of pixels that has been changed in two successive
                                                                  frames is compared. When the intensity of pixels is more than
                                                                  threshold, then it is referred to shot change [6].
                                                                     The main drawback of such approaches (i.e intensity pixels),
                                                                  whatever the metric used, is sensitive to fast object and camera
                                                                  movement, camera panning or zooming. And limitations in this
                                                                  method is setting threshold manually.

                                                                  B. Histogram-Based Methods
                                                                     The most popular metric for cut transition detection is
                                                                  the difference between histograms of two consecutive frames.
                                                                  Histogram describes the distribution of gray, color, shape and
                                                                  texture without taking into account their position, so we can
                                                                  estimate the similarity between two images through the his-
        Fig. 7: (e) Various types of wipe transitions.            togram similarity. This method first extracts the histograms of
                                                                  the video frames, and then calculates the distance between the
  4) The feature extraction : Is the process to represent raw     histograms. When the distance is more than threshold, then it is
     image in a reduced form to facilitate decision making        referred to shot change. There are several kinds of methods to
     such as pattern detection, classification or recognition.    calculate the histogram distance , such as Manhattan distance
     The features extracted from the video frames may be          ,Euclidean distance and chi-square distance. Several variants
     low-level, mid-level or high level features [13].            of histogram-based have been proposed in the literature. Lu et
       a)   Low-level features : The low-level features are       al. In [14] employed Singular Value Decomposition (SVD),
          minor details of the image, like lines or dots,         with Hue Saturation Value (HSV) histogram, to propose a
          that does not take into consideration the visual        low computational complexity SBD scheme. The candidate
          or semantic. The low-level features consist of          segment selection using adaptive threshold is implemented,
          RGB values/histograms, intensity values, mean,          The color histograms are extracted in HSV (Hue-Saturation-
          variance, entropy of the pixel values etc [10].         Value) space from all frames in each candidate segment,
       b) Mid-level features : The mid-level features are in-     forming a frame feature matrix. The SVD is then performed
          termediate between the low-level features and high      on the frame feature matrices of all candidate segments to
          level semantics. The mid-level features consist of      reduce the feature dimension. Bendraou Youssef et al. In [15]
          feature point detectors and descriptors. Although,      formulated a new approach for detecting both hard (CT) and
          the feature points may be used for object identifica-   gradual (GT) transitions. They proposed approach processes
          tion in an image, these are not appropriate for high    the video segment by segment, is composed of two main parts:
          level semantic description of the content depicted      static segment verification (A candidate segment that have not
          in an image [10].                                       a transition) and shot transition identification (A candidate
       c) High-level features : High-level features are built     segment that may contain a transition CT or GT). Features
          to detect objects and larger shapes in the image,       are extracted from the Concatenated Block Based Histograms
          trajectory of paths followed by objects, motion         (CBBH). For each non static segment all frames in each
          vectors etc. These may be used for high level           this segment, forming a frame feature matrix. The economy
          description of the content in an image [10].            SVD is then performed on feature matrix. An adaptive double
                                                                  thresholding process was employed for detecting the hard cuts.
  Because of the importance of SBD, many researchers have         For gradual transitions detection, the folding in technique,
presented algorithms to boost the accuracy of SBD for Cut         known as SVD-updating, is used for the first time in video shot
Transition (CT) and Gradual Transition (GT). We introduce a       boundary detection. Hong Shao et al. In [16] Hue Saturation
survey on various SBD approaches below.                           Value (HSV) color histogram and Histogram of Gradient
                                                                  (HOG) features are exploited to detect cut transition. HSV
   III. SHOT BOUNDARY DETECTION M ETHODS                          color histogram is used to detect the difference between two
                                                                  adjacent frames. While HOG feature is adopted for secondary
  Nowadays, many researchers are doing work to develop            detection to improve the algorithm performance.
more reliable and accurate algorithms that can results into          The study confirmed that histogram difference is less sen-
more precise shot boundaries. There are several common            sitive to object motion than the pair-wise comparison, since it
methods that deal with CT and/or GT:                              ignores the spatial changes in a frame. However, histograms
may also produce missed shots when two frames with similar          in GT are small and the background is similar, semantics do
histograms share a different content.                               not change at all thus they cannot achieve high detection of
                                                                    accuracy.
C. Edge-Based Methods                                                   Jingwei Xu et al. In [22] use convolutional neural networks
   Another choice for characterizing an image is its edge in-       (CNNs) to extract typical features of frames. They adopted a
formation. An edge is the boundary between an object and the        candidate segment selection method to locate the positions of
background, and indicates the boundary between overlapping          shot boundaries coarsely using adaptive thresholds and elim-
objects. In edge-based approaches, transition is declared when      inate most non-boundary frames. Cut and gradual transitions
the locations of the edges of the current frame exhibit a large     can be obtained by using a novel pattern-matching method
difference with the edges of the previous frame that have           based on a new similarity strategy which is partially inspired
disappeared. For example, Heng et al. In [17] proposed a            by [14].
method based on an edges. They presented the concept of                 Hassanien et al. In [23] presented a shot boundary detection
an objects edge by considering the pixels close to the edge. A      method on huge video data set based on spatial-temporal CNN.
matching of the edges of an object between two consecutive          The Technique is named DeepSBD network that takes a seg-
frames was performed. Then, a transition was declared by            ments of fixed length as input and classify it into 3 categories
utilizing the ratio of the objects edge that was permanent over     (cut, gradual, no transition), its output is fed through SVM
time and the total number of edges. Zheng et al. In [18] an         classifier. This gives the first labeling estimate. Consecutive
approach based on a Robert edge detector for detecting fade-        segments with the same labeling are merged and the result is
in and fade-out transitions was proposed. First, the authors        passed to a post-processing step. The step reduce false alarms
identified the frame edges by comparing gradients with a fixed      of gradual transitions through a histogram-driven temporal
threshold. Second, they determined the total number of edges        differential measurement. However, the C3D ConvNet is more
that appeared. When a frame without edges occurred, fade in         complex than 2D ConvNet, which requires much computation
or fade out was declared.                                           resources and the lengths of gradual transitions are varying
   The advantage of this feature is that it is sufficiently         but DeepSBD is not designed for multi-scale detection.
invariant to illumination changes and several types of motion,          Michael Gygli et al. In [24] proposed to learn shot detection,
and is related to the human visual perception of a scene. Its       from pixels to final shot boundaries. A fully convolutional
main disadvantage is computational cost, and noise sensitivity.     neural network has been used for shot boundary detection
                                                                    task. For training this model, They consider the all shot
D. Motion-Based Methods                                             boundaries are generated. Thus, they created a dataset with one
   Motion is a key feature in videos and forms an integral part     million frames and automatically generated transitions such
of it. Because shots with camera motion can be incorrectly          as cuts, dissolves and fades. They considered this work as a
classified as gradual transitions, detecting zooms and pans         binary classification problem to correctly predict if a frame
increases the accuracy of a shot boundary detection algorithm.      is part of the same shot as the previous frame or not. Their
Bruno et al. In [19] proposed a linear motion prediction            method obtains state-of-the-art results on the RAI data set,
method based on wavelet coefficients, which were computed           while running at an unprecedented speed of more than 120x
directly from two successive frames.                                real-time. Currently, their model makes three main errors, (i)
   For an accurate motion estimation, each block should be          missing long dissolves, which it was not trained with, (ii)
matched with all blocks of the next frame, which lead to a          partial scene changes and (iii) fast scenes with motion blur.
large and unreasonable computational cost.                              Shitao Tang et al. In [25] presented a new cascade frame-
                                                                    work, a fast and accurate approach for shot boundary detec-
E. Deep Learning-Based Methods                                      tion. The first stage applied adaptive thresholding to initially
   Recently, employing deep learning algorithms in the field        filter the whole video and selects the candidate segments for
of computer vision received much attention from academics.          acceleration. In the second stage, they used a well designed 2D
Convolutional Neural Networks (CNN) is one of the most im-          ConvNet learning the similarity function between two images
portant deep learning algorithms due to its significant abilities   to locate the cut transitions. The third stage utilized a novel
to extract high level features from images and video frames         C3D ConvNet model to locate positions of gradual transitions.
[20].                                                                   Lifang Wu et al. In [26] presented a two stage method for
   Tong et al. In [21] used The CNN model to extract high-          shot boundary detection (TSSBD) which distinguishes cut shot
level interpretable features from the frames. It is capable of      by fusing color histogram (HSV) and deep features (CNN)
detecting both CT and GT boundaries. An adaptive threshold          where divide the complete video into segments containing
process was employed as a preprocessing stage to select             gradual transitions, and over these video segments, gradual
candidate segments. Taken one frame as input, the output of         shot change detection is implemented using 3D-convolutional
the network is a probability distribution among 1000 classes.       neural network, which classifies clips into specific gradual
The five classes with the highest probabilities are selected as     shot change types with a majority voting strategy, gap filling
the high-level features of the frame and called as the TAGs of      conducts to effectively distinguish shot types of frames and
the frame for simplicity. However, in some cases when changes       locate shot boundaries.
    Rui Liang et al. In [27] proposed a new video shot boundary     across each images. Then the false detection can be eliminated
detection method based on CNN feature. The method extracts          effectively by using local descriptors SURF.
the features using the AlexNet and ResNet-152 model for                Sawitchaya Tippaya et al. In [32] proposed a multi-modal
each frame, and calculate consine similarity to describe the        visual features based SBD framework. They adopted a can-
similarity of a pair of frames. For cut boundary detection, they    didate segment selection that performs without the threshold
used the similarity of local frames to get more accuracy, and       calculation. The discontinuity signal is calculated based on
proposed dual-threshold sliding window for gradual transition       the SURF matching score and RGB histogram cosine distance
detection.                                                          value.
    Lifang Wu et al. In [28] proposed a method for shot                Finally, In TABLE I demonstrates a comparison among
boundary detection with spatial-temporal convolutional neural       different SBD algorithms based on features employed, frame
networks based gradual shot detection and histogram base shot       skipping, data-set used, accuracy (precision, recall and F1
filtering. The cut shots are extracted from the whole video         score measures). From the table, it can be observed that the
with histogram base shot filtering. Then, C3D deep model            algorithms used frame skipping technique have low computa-
is constructed to extract features of frames and distinguish        tional cost with an acceptable accuracy as in [14]. Although
shot types of dissolve, swipe, fade in and fade out, and            some algorithms utilize frame skipping, they show a moderate
normal. For untrimmed videos, a frame level merging strategy        computational cost because of the computation complexity of
is constructed to help locate the boundary of shots from            the features used such as SURF in [32]. Obviously, CNN-
neighboring frames.                                                 based SBD algorithms that show a high computational cost
However, those methods only using the CNN for feature               such as [27, 28, 29, 32, 36] gain a remarkable accuracy
extraction and then using traditional classifiers to detect the     compared to other algorithms.
scene change. Recently, with the development and popularity
of deep learning, many efficient networks for various of appli-     IV. S HOT B OUNDARY D ETECTION E VALUATION M ETRICS
cations have been proposed. For example, the deep learning             There are two prospective metrics that need to be used
model Res-Net based networks can obtain very high accuracy          to evaluate the performance of SBD algorithms. These two
in image classification and object detection for many large         aspects are the accuracy and the computational complexity.
scale image data sets. Therefore, it can be adopted to solve        Usually improving one aspect would be on the cost of the
the issue of shot change detection. The downside of this            other one. Also, for the evaluation to be truly representative
method is revolve around the need for large annotated data-         and reliable for comparing various techniques, it must be done
sets. However, that the real data can contain cuts between shots    in similar conditions and with very similar data sets. In this
of the same scene which rarely occur in the synthetic data sets     section, we discuss the common metrics (recall, precision, and
due to the nature how they are generated.                           F1-score ) of measuring the accuracy and the computational
                                                                    complexity [33].
F. Others approaches                                                   1) Precision : It is the ratio of detection of correct exper-
   Thounaojam et al. In [29] proposed a shot detection ap-                imental to the detection of correct and false.
proach based on genetic algorithm (GA) and fuzzy logic.                                                      Nc
Fuzzy system is used to classify the video frames into dif-                                precision =                           (1)
                                                                                                          Nc + Nf
ferent types of transitions (cut and gradual). Color Histogram
Difference is used for feature extraction and for finding the         2) Recall : It is the ratio of detection of correct experimen-
differences between two consecutive frames in a video. GA                tal to the detection of correct and missed.
is used as optimizer to find the optimal range of values                                                    Nc
                                                                                             recall =                            (2)
of the fuzzy membership functions. The result shows that                                                Nc + Nm
the combination of this feature is efficient and the accuracy         3) F1 score : It combines precision and recall to achieve
increases with increase in iterations/generations of GA.                 one score. It is varies in the range [0, 1] where a score
   Jialei Bi et al. In [30] proposed a novel cut detection method        of 1 indicates the best efficacy of a system.
based on information theory using SVM. They first compute
the dissimilarity using information theory and construct a                                    2 × recall × precision
                                                                                       F1 =                                      (3)
discriminative feature vector based on mutual information.                                      recall + precision
Then a support vector machine is trained to classify the frames     Where, Nc is number of transitions correctly reported, Nm is
as cut or none-cut frames without using a traditional global or     number of transitions missed to be reported, and Nf is number
adaptive threshold.                                                 of falsely reported transitions.
   Junaid Baber et al. In [31] the proposed method, shot
boundaries are extracted from videos using frame entropy and                           V. O PEN C HALLENGES
SURF descriptors. Cut boundaries were detected by difference           Although a large amount of work has been done in shot
of entropy of the gray scale intensity in adjacent frames.          boundary detection, many issues are still open and deserve
And fade boundaries were detected indiscriminately based            further research. We can conclude from this state of art
on temporal changes in the entropy of the pixel intensity           that a good video shot detection method highly depends on
                                TABLE I: Comparison of different state-of-the-art SBD algorithm

                                                                                         CT                    GT
             Ref                     Methods                          Dataset
                                                                                   P      R       F1     P      R      F1
             [14]          SVD and HSV Histogram-Based             TRECVID 2001   0.91   0.85    0.88   0.83   0.81   0.81
             [15]                  Histogram-Based                 TRECVID 2001   0.97   0.95    0.96   0.87   0.93   0.90
             [21]            Deep Learning-Based (CNN)             TRECVID 2001   0.99   0.87    0.92   0.87   0.83   0.87
             [22]            Deep Learning-Based (CNN)             TRECVID 2001   1.00   0.98    0.99   0.99   0.95   0.97
             [23]            Deep Learning-Based (CNN)              UCF101-SBD    0.98   1.00    0.99   0.99   0.99   0.99
             [25]            Deep Learning-Based (CNN)             TRECVID 2007   0.98   1.00    0.99   0.84   0.84   0.84
             [27]            Deep Learning-Based (CNN)                 Other      0.95   0.97    0.96   0.86   0.91   0.87
             [29]   Genetic Algorithm (GA) and Fuzzy Logic based   TRECVID 2001   0.88   0.92    0.90   0.86   0.78   0.82
             [30]        Theory Information and SVM based          TRECVID 2002   0.98   0.97    0.98     -      -      -
             [32]   SURF matching score and RGB histogram based      Golf Video   1.00   0.98    0.99   0.89   0.81   0.85


features, similarity measure and thresholds used. We found           deep learning approaches for SBD could be directed as new
that the major challenges to detection techniques are by             directions for the future.
illumination changes, object and camera motion. For example             Usually, in the sequential case, the comparison of the frames
color histograms are robust to small camera motion, but they         and shot boundary detection sounds simple, but it can take
are not able to differentiate the shots within the same scene,       centuries to processes multimedia big data. Performance in
and they are sensitive to large camera motions. Edge features        a lengthy video data remains an open area of research. Our
are more invariant to illumination changes and motion than           future work is to focus on deep learning approaches for SBD
color histograms, and motion features can effectively handle         by used technologies of analyses multimedia big data.
the influence of object and camera motion. If we just use a
kind of feature to detect the shot boundary, the result may
not be satisfactory, but if we use many kinds of features,                                      R EFERENCES
the speed will be slow. And the major challenge is the
                                                                      [1] Deepika Bajaj and Shanu Sharma. Video depiction of
problem of determining an automatic threshold based on the
                                                                          key frames- a review. In Proceedings of the Sixth Inter-
characteristics of the video. The difficulty is how to choose the
                                                                          national Conference on Computer and Communication
optimal threshold. However, the efforts to replace thresholding
                                                                          Technology 2015, ICCCT ’15, pages 183–187, New York,
by machine learning have begun only recently. The importation
                                                                          NY, USA, 2015. ACM.
of these ideas may be novel drives to the advance of SBD.
                                                                      [2] Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and
                                                                          S. Maybank. A survey on visual content-based video
       VI. CONCLUSION AND FUTURE SCOPE
                                                                          indexing and retrieval. IEEE Transactions on Systems,
   Video shot boundary detection is the first step of video               Man, and Cybernetics, Part C (Applications and Re-
processing , it is also the most important step. There have been          views), 41(6):797–819, nov 2011.
a lot of studies about shot boundary at present. In this work,        [3] Tiecheng Liu and John R. Kender. Computational
a comprehensive survey of SBD algorithms (or shot boundary                approaches to temporal sampling of video sequences.
detection algorithms) was performed. Video definitions, tran-             ACM Transactions on Multimedia Computing, Commu-
sition types, and hierarchies were demonstrated. The different            nications, and Applications, 3(2):7–es, may 2007.
techniques are discussed to detect a shot boundary depending          [4] Remi Trichet, Ramakant Nevatia, and Brian Burns. Video
upon the contents and the change in contents of video. Despite            event classification with temporal partitioning. In 2015
the extensive research on concrete SBD techniques, SBD still              12th IEEE International Conference on Advanced Video
have some problems that are relevant in practice for different            and Signal Based Surveillance (AVSS). IEEE, aug 2015.
video scenarios which need to be studied. These challenges            [5] Shayok Chakraborty, Omesh Tickoo, and Ravi Iyer.
are represented by: Sudden illuminance changes, dim lighting              Adaptive keyframe selection for video summarization.
frames, comparable background frames, object and camera                   In 2015 IEEE Winter Conference on Applications of
motion, and change in small regions. Solving these challenges             Computer Vision. IEEE, jan 2015.
will surely improve the performance of SBD algorithms.                [6] Youssef Bendraou. Video shot boundary detection and
Finally, the machine learning approaches have been popular                key-frame extraction using mathematical models. Theses,
and received much attention in the field of computer vision               Université du Littoral Côte d’Opale, November 2017.
applications. However, in the field of SBD, the efforts to            [7] Jaydeb Mondal, Malay Kumar Kundu, Sudeb Das, and
replace thresholding by machine learning have begun only                  Manish Chowdhury. Video shot boundary detection
recently. But the amount of research carried out in the domain            using multiscale geometric analysis of nsct and least
of SBD using machine learning is quite less. Exploring the                squares support vector machine. Multimedia Tools and
benefit of the new machine learning technologies such as                  Applications, 77(7):8139–8161, apr 2017.
 [8] Gautam Pal, Dwijen Rudrapaul, Suvojit Acharjee, Ruben             Rong Xie. Cnn-based shot boundary detection and video
     Ray, Sayan Chakraborty, and Nilanjan Dey. Video shot              annotation. In 2015 IEEE International Symposium on
     boundary detection: A review. In Advances in Intelli-             Broadband Multimedia Systems and Broadcasting. IEEE,
     gent Systems and Computing, pages 119–127. Springer               jun 2015.
     International Publishing, 2015.                              [22] Jingwei Xu, Li Song, and Rong Xie. Shot boundary
 [9] A. Hanjalic. Shot-boundary detection: unraveled and               detection using convolutional neural networks. In 2016
     resolved? IEEE Transactions on Circuits and Systems               Visual Communications and Image Processing (VCIP).
     for Video Technology, 12(2):90–105, 2002.                         IEEE, nov 2016.
[10] Hrishikesh Bhaumik, Siddhartha Bhattacharyya, and Su-        [23] Ahmed Hassanien, Mohamed A. Elgharib, Ahmed Se-
     santa Chakraborty. Content coverage and redundancy                lim, Mohamed Hefeeda, and Wojciech Matusik. Large-
     removal in video summarization. In Intelligent Analysis           scale, fast and accurate shot boundary detection through
     of Multimedia Information, pages 352–374. IGI Global.             spatio-temporal convolutional neural networks. CoRR,
[11] Guangyu Gao and Huadong Ma. To accelerate shot                    abs/1705.03281, 2017.
     boundary detection by reducing detection region and          [24] Michael Gygli. Ridiculously fast shot boundary detec-
     scope. Multimedia Tools and Applications, 71(3):1749–             tion with fully convolutional neural networks. In 2018
     1770, Springer Science and Business Media, dec 2012.              International Conference on Content-Based Multimedia
[12] Zhonglan Wu and Pin Xu. Shot boundary detection                   Indexing (CBMI). IEEE, sep 2018.
     in video retrieval. In 2013 IEEE 4th International           [25] Shitao Tang, Litong Feng, Zhanghui Kuang, Yimin Chen,
     Conference on Electronics Information and Emergency               and Wei Zhang. Fast video shot transition localiza-
     Communication. IEEE, nov 2013.                                    tion with deep structured models. In Computer Vision
[13] Heba Ahmed Elnemr, Nourhan Mohamed Zayed, and                     – ACCV 2018, pages 577–592, Cham, 2019. Springer
     Mahmoud Abdelmoneim Fakhreldein. Feature extraction               International Publishing.
     techniques. In Handbook of Research on Emerging              [26] Lifang Wu, Shuai Zhang, Meng Jian, Zhe Lu, and
     Perspectives in Intelligent Pattern Recognition, Analysis,        Dong Wang. Two stage shot boundary detection via
     and Image Processing, pages 264–294. IGI Global, 2016.            feature fusion and spatial-temporal convolutional neural
[14] Zhe-Ming Lu and Yong Shi. Fast video shot boundary                networks. IEEE Access, 7:77268–77276, 2019.
     detection based on SVD and pattern matching. IEEE            [27] Rui Liang, Qingxin Zhu, Honglei Wei, and Shujiao Liao.
     Transactions on Image Processing, 22(12):5136–5145,               A video shot boundary detection approach based on
     dec 2013.                                                         CNN feature. In 2017 IEEE International Symposium
[15] Bendraou Youssef, Essannouni Fedwa, Aboutajdine                   on Multimedia (ISM). IEEE, dec 2017.
     Driss, and Salam Ahmed. Shot boundary detection via          [28] Lifang Wu, Shuai Zhang, Meng Jian, Zhijia Zhao, and
     adaptive low rank and svd-updating. Computer Vision               Dong Wang. Shot boundary detection with spatial-
     and Image Understanding, 161:20–28, aug 2017.                     temporal convolutional neural networks. In Pattern
[16] Hong Shao, Yang Qu, and Wencheng Cui. Shot bound-                 Recognition and Computer Vision, pages 479–491.
     ary detection algorithm based on HSV histogram and                Springer International Publishing, 2018.
     HOG feature. In Proceedings of the 2015 International        [29] Dalton Meitei Thounaojam, Thongam Khelchandra,
     Conference on Advanced Engineering Materials and                  Kh. Manglem Singh, and Sudipta Roy. A genetic algo-
     Technology, pages 951–957. Atlantis Press, 2015.                  rithm and fuzzy logic approach for video shot boundary
[17] Wei Jyh Heng and King N. Ngan. An object-based shot               detection. Computational Intelligence and Neuroscience,
     boundary detection using edge tracing and tracking. Jour-         2016:1–11, 2016.
     nal of Visual Communication and Image Representation,        [30] Jialei Bi, Xianglong Liu, and Bo Lang. A novel shot
     12(3):217–239, sep 2001.                                          boundary detection based on information theory using
[18] Jie Zheng, Fengmei Zou, and Mandel Shi. An efficient              SVM. In 2011 4th International Congress on Image and
     algorithm for video shot boundary detection. In Pro-              Signal Processing. IEEE, oct 2011.
     ceedings of 2004 International Symposium on Intelligent      [31] Junaid Baber, Nitin Afzulpurkar, Matthew N. Dailey,
     Multimedia, Video and Speech Processing, 2004. IEEE.              and Maheen Bakhtyar. Shot boundary detection from
[19] E. Bruno and D. Pellerin. Video shot detection based              videos using entropy and local descriptor. In 2011 17th
     on linear prediction of motion. In Proceedings. IEEE              International Conference on Digital Signal Processing
     International Conference on Multimedia and Expo 2002,             (DSP). IEEE, jul 2011.
     volume 1, pages 289–292. IEEE, 2002.                         [32] Sawitchaya Tippaya, Suchada Sitjongsataporn, Tele Tan,
[20] Eralda Nishani and Betim Cico. Computer vision ap-                Masood Mehmood Khan, and Kosin Chamnongthai.
     proaches based on deep learning and neural networks:              Multi-modal visual features-based video shot boundary
     Deep neural networks for video analysis of human pose             detection. IEEE Access, 5:12563–12575, 2017.
     estimation. In 2017 6th Mediterranean Conference on          [33] Amr Ahmed. Video representation and processing for
     Embedded Computing (MECO). IEEE, jun 2017.                        multimedia data mining. In Semantic Mining Technolo-
[21] Wenjing Tong, Li Song, Xiaokang Yang, Hui Qu, and                 gies for Multimedia Databases. IGI Global, 2009.