=Paper= {{Paper |id=Vol-3190/paper2 |storemode=property |title=Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-time Video Stream Analysis |pdfUrl=https://ceur-ws.org/Vol-3190/paper2.pdf |volume=Vol-3190 |authors=Yuelin Xin,Zihan Zhou,Yuxuan Xia |dblpUrl=https://dblp.org/rec/conf/ijcai/XinZX22 }} ==Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-time Video Stream Analysis== https://ceur-ws.org/Vol-3190/paper2.pdf
Scene Separation & Data Selection: Temporal Segmentation
Algorithm for Real-Time Video Stream Analysis
Yuelin Xin1,2,*,† , Zihan Zhou1,2,† and Yuxuan Xia1,2,†
1
    Southwest Jiaotong University, Chengdu, China
2
    University of Leeds, Leeds, UK


                                           Abstract
                                           We present 2SDS (Scene Separation and Data Selection algorithm), a temporal segmentation algorithm used in real-time video
                                           stream interpretation. It complements CNN-based models to make use of temporal information in videos. 2SDS can detect
                                           the change between scenes in a video stream by com-paring the image difference between two frames. It separates a video
                                           into segments (scenes), and by combining itself with a CNN model, 2SDS can select the optimal result for each scene. In
                                           this paper, we will be discussing some basic methods and concepts behind 2SDS, as well as presenting some preliminary
                                           experiment results regarding 2SDS. During these experiments, 2SDS has achieved an overall accuracy of over 90

                                           Keywords
                                           scene separation, temporal segmentation, real-time video analysis, dHash



1. Introduction
Image recognition models have gone increasingly accu-
rate in the past few years, yet video semantics tasks are
still challenging. A detailed comprehension on video
stream could play a significant part in video accessibility
[1], surveillance footage auto-interpretation [2, 3], and so
                                                                                                              Figure 1: Overall effect of the scene separation proce-
on. These technologies have already been proven useful
                                                                                                              dure. The whole video stream will be separated into scenes,
on large video platforms like YouTube, used for real-time
                                                                                                              in each of which the images in the video remain relatively
video interpretation and video topic analysis.                                                                stationary.

1.1. The Problem
In the processing of video stream, a 2D CNN can be                                                            implementation that complements the CNN model to
extended into 3D CNN by adding a temporal dimension                                                           solve the continuity issue. This implementation should
[4], but this approach can be hazardous if the video is                                                       group the discrete frames (adjacent on the temporal axis)
too long, or it is of indefinite length. However, a 2D CNN                                                    that look similar to each other into scenes, this procedure
is still very usable in a traditional image recognition or                                                    is what we call temporal segmentation (also referred as
image segmentation task.                                                                                      scene separation in 2SDS, see Fig. 1 for example).
   The problem is that 2D CNNs only recognise a video
as discrete images, rather than a continuous stream of                                                        1.2. Related Work
images. This poses some issues. For example, a CNN
model could not resolve the motion of a person (e.g.,                                                         SlowFast Networks. The SlowFast Networks use a two-
walking, dancing) be-cause the person is stationary in                                                        pathway architecture for video recognition, the slow
every frame, and this will cause the loss of significant                                                      pathway (low frame rate) is used to capture spatial se-
information in video analysis. So, we need to devise an                                                       mantics, and the fast pathway (high frame rate) is used to
                                                                                                              capture temporal semantics like motions in a relatively
                                                                                                              fine temporal resolution [5].
STRL’22: First International Workshop on Spatio-Temporal Reasoning
and Learning, July 24, 2022, Vienna, Austria
*
  Corresponding author.                                                                                       1.3. Our Work
†
  These authors contributed equally.
" sc20yx2@leeds.ac.uk (Y. Xin); sc20zz2@leeds.ac.uk (Z. Zhou);                                                                      What we have achieved is to devise the temporal segmen-
sc202yx@leeds.ac.uk (Y. Xia)                                                                                                        tation algorithm, 2SDS, which stands for “Scene Separa-
 0000-0002-9732-2414 (Y. Xin); 0000-0003-2613-7569 (Z. Zhou);                                                                      tion and Data Selection algorithm”. It can slice the video
0000-0002-1185-2722 (Y. Xia)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License stream into segments on the temporal axis, so it can be in-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                      terpreted using 2D CNN models while preserving critical
Figure 2: 2SDS used together with a CNN model. The
2SDS algorithm can separate the scenes in a continuous video
stream and select the result produced by the CNN model,
the two together, can produce a scene-separated recognition
                                                               Figure 3: Image processing in 2SDS based on an im-
result.
                                                               proved dHash algorithm. The two image processing parts
                                                               in 2SDS, the first step is down sample, and the second step is
                                                               gray scale conversion.
information on the temporal dimension. By combining
2SDS with a CNN model (Fig. 2), this implementation is
similar to the SlowFast Networks on splitting the input
into two pathways, in which the 2SDS is similar to the
                                                               3. Method: 2SDS
fast pathway of the SlowFast Networks, except we do not        2SDS stands for “Scene Separation and Data Selection
introduce another neural network, but we replace the           algorithm”. It works as a temporal segmentation and
network with the faster 2SDS, which guarantees even            result selection algorithm to complement CNN-based
better temporal resolution.                                    models. It contains a two-part procedure of separating
                                                               the video stream into segments and selecting a represen-
                                                               tative recognition result from the CNN model for output.
2. Motivation: Why Not Neural                                     2SDS utilises the difference hash (dHash) method [7]
   Networks                                                    to obtain the rough image difference between two frames,
                                                               if two frames have a very little difference, they will be
Traditionally, RNN-based models have been quite suc-           grouped into the same scene. This method involves a few
cessful in processing sequential information like time.        simple steps to calculate, and it is the most important
However, the usage of RNN or even neural networks              method 2SDS uses to achieve scene separation. As the
is not practical in time sensitive tasks like real-time ob-    calculation is relatively simple and straight forward, this
ject recognition and live video stream analysis, which         makes 2SDS extremely fast on scene separation.
requires fast responding algorithms, and RNNs usually             Also, 2SDS uses a pooling-layer-like data smoothing
cannot meet those requirements.                                and data selection method to pick out the representative
   RNN-based models like LSTM [6] generally have a             recognition result for a particular scene. This method
longer respond time even compared to CNN-based mod-            can generally improve the accuracy of the output be-
els (although the difference between them vary with dif-       cause it can smooth out the data on undesired frame
ferent settings of hyperparameters). A CNN + RNN ar-           moving (e.g., camera shaking, broken frames). Alongside
chitecture model would mean the doubling of processing         the data smoothing mechanism, another data selection
time, which is something we would rather avoid when            mechanism is implemented to select the representative
dealing with video stream analysis tasks.                      recognition result (referred as representative in the fol-
   However, RNNs do have the advantage of acting upon          lowing sections) from the whole data segment of a scene
temporal information, especially for models like LSTM.         (referred as candidate in the following sections).
So, we need to help the CNN-based models to preserve
temporal information, and that is where we introduce
our temporal segmentation algorithm, 2SDS.                     3.1. Scene Separation: based on dHash
   By adding the 2SDS algorithm, alongside a CNN model,        The scene separation procedure of 2SDS (Fig. 3) is based
we were able to achieve RNN-like results. In the mean-         on an improved version of the dHash algorithm, which
time, by avoiding the introduction of a neural network,        is originally used to judge the similarity of two images.
this implementation is also faster than the CNN + RNN          By applying the scene separation procedure, the tempo-
architecture or the CNN + CNN architecture.                    ral information can be preserved by the sequencing of
                                                               the separated scenes. The exact workflow of the scene
                                                               separation process is discussed extensively below.
                                                            is greater than a threshold (usually 5), we consider the
                                                            two frames to be in two different scenes, and we can sep-
                                                            arate them accordingly. For the calculation of Hamming
                                                            distance, we can simply use an Exclusive Or operator
                                                            on the two hash values, here is an example below (the
                                                            Hamming distance is 7 in this case):

                                                               c4e0d8988c989898 ⊕ eee6989c8c989898 = 7            (1)

                                                            3.2. Data Selection and Data Smoothing
Figure 4: Example on binary sequence conversion. The de-
rived binary sequence of row 1 in this case is 00111010. When the scene separation process detected a new scene,
                                                            the data collected on the previous scene is packed into
                                                            an array. This array contains all the recognition results
                                                            produced by the CNN model in the previous scene, and
   Down sampling. To make a rough comparison be-
                                                            the CNN model would have a recognition output on every
tween two frames in a video, the frames need to be down
                                                            frame in this scene.
sampled from their original size to an 8 by 9 (row by
                                                               To have a solid output for 2SDS, we need to perform
column) sub-image. This approach can both simplify
                                                            2 extra steps: data smoothing and data selection. The
the remaining calculation and make the algorithm less
                                                            method that is implemented here is inspired by the pool-
sensitive to subtle changes between frames.
                                                            ing layer in a convolutional neural network.
   Gray scale. We apply gray scale manipulation on
                                                               Data smoothing procedure. This step is also called
the previous sub-image using the Luminosity algorithm,
                                                            LWAP (Length Weighted Average Pooling). We start by
this step is purely for reducing the complexity of calcu-
                                                            segmenting the array containing all the recognition data
lating the difference on 3 channels. By converting the
                                                            into small groups of a defined size. Then, we apply the
RGB channels into one gray scale channel, this approach
                                                            following formulas:
dramatically lessens the complexity of the algorithm.
   Calculate Hash value. The derived gray scale image                                ∑︀𝑖≤𝜙
                                                                                        𝑖=1 (𝐿𝑖 × 𝜔𝑖 )
is converted into a single 16-bit hexadecimal hash value.                   𝑊 𝐴𝐿 =       ∑︀𝑖≤𝜙                     (2)
The algorithm looks at all the 8 rows separately, each                                     𝑖=1 𝜔 𝑖
row has 9 gray scale values from 0 to 255. These 9 values          {︂
                                                                       𝑓𝑖 (𝐷) = min𝑖=0 | card(𝐷𝑖 ) − WAL |
are converted to 8 binary numbers under the following                                                              (3)
                                                                       𝐶𝐼 = [𝑐 ∈ 𝐷 | card(𝑐) = 𝑓𝑖 (𝐷)]
rules:
    (a) One binary value stands for the gray scale differ- Here, 𝐿𝑖 stands for the length of each recognition data,
        ence between two adjacent pixels.                   for example, in “object 1, object 2, object 1”, 𝐿𝑖 = 3.
                                                            𝜔𝑖 stands for the weight of each recognition data which
    (b) If the gray scale value of the pixel on the left is
                                                            is 0.1 × 𝐿𝑖 . card(𝐷𝑖 ) stands for the cardinality of set
        greater than the pixel on the right, the binary
                                                            𝐷𝑖 , where 𝐷𝑖 is the segments previously obtained by
        value should be 1, otherwise, it should be 0.
                                                            segmenting the original array.
    (c) Every row should end up with an 8-bit long binary
                                                               This approach is inspired by the pooling layer in CNN,
        sequence.
                                                            but instead of a Max Pooling operation, the data smooth-
An example is given in Fig. 4.                              ing procedure uses a Weighted Average Pooling opera-
   Using this method, we can derive eight 8-bit long bi- tion. By using this data smoothing procedure, we can
nary sequences, each of them can be represented by a avoid unwanted recognition results like broken frames
2-bit long hexadecimal value. And by concatenating all or camera flashes.
the 2-bit hexadecimal values, we can obtain a 16-bit long      Data selection procedure. We apply a data selection
hexadecimal hash value, and this value will represent the procedure that uses the similar approach that we previ-
whole image (this is also the reason why the original im- ously used in the data smoothing procedure, which is
age is down sampled into an 8 by 9 sub-image rather than also a Weighted Average Pooling operation. This proce-
an 8 by 8 sub-image, because the 8 by 8 image will face dure will select the recognition result from a frame whose
some inconvenience when converting into a hexadecimal feature intensity is the closest to the weighted average
hash value).                                                value of all the candidates (feature intensity refers to
   Calculating the Hamming distance. By calculating the number of different classes of objects in a particular
the Hamming distance between the hash values of two frame).
adjacent frames, we can judge whether the two frames           Finally, we can output the result that we obtained in
are in the same scene or not. If the Hamming distance the previous steps as the representative of the whole
Table 1                                                                Table 2
Interview video tests results.                                         Vibrant video tests results.
          Experiment No.      Output - Truth      Accuracy                    Experiment No.       Output - Truth   Accuracy
          Interview 1              25 - 25        100.00%                     Vibrant 1                9 - 13        69.23%
          Interview 2              35 - 29         82.86%                     Vibrant 2                19 - 38       50.00%
          Interview 3              31 - 28         90.32%

                                                                       Table 3
                                                                       Hybrid video test result.
scene. This particular recognition result will be used to
represent the whole scene it is located in, and through the                   Experiment No.       Output - Truth   Accuracy
help of NLP and other models, this can even be used to                        Hybrid 1                105 - 106      99.06%
output the natural language interpretation of this video
scene.
                                                                       4.2. Vibrant Video Tests
4. Experiments                                                         We conducted 2 separate experiments using vibrant
                                                                       videos (Table 2). The total amount of scenes in these
Due to the lack of similar algorithms and datasets, we
                                                                       2 experiments is 51. The overall accuracy of 2SDS during
could only provide some preliminary and experimental
                                                                       the two experiments is 54.90%. The accuracy in vibrant
usage of the 2SDS algorithm1 .
                                                                       videos is substantially lower than interview videos for the
   We choose YOLOv5s as our image recognition CNN
                                                                       2SDS is unable to separate two fast-moving scenes effec-
for this experiment, and we have built an experimental
                                                                       tively. It is important to notice that we used harsh videos
dataset on video object detection using selected YouTube
                                                                       like sport videos and dynamic advertisement videos in
videos in the YouTube-VOS dataset [8]. Although the
                                                                       this experiment, so the performance of the 2SDS is ex-
YOLOv5s algorithm is trained on the COCO dataset, this
                                                                       pected to be much lower comparing to the previous ex-
CNN model is still sufficiently usable in this experiment
                                                                       periment. This is the biggest limitation of 2SDS, but this
for it is not the key focus of this experiment.
                                                                       issue is addressable with future improvements of the
   We are most interested in how 2SDS will perform in
                                                                       algorithm.
scene separation (temporal segmentation) tasks. We clas-
sified the testing videos into 3 classes: interviews, vibrant,
and hybrid.                                                            4.3. Hybrid Video Tests
   The interviews are usually straight forward and easier              We conducted one experiment using a long hybrid video
to undergo scene separation tasks. Vibrant videos are                  (Table 3). The total amount of scenes in this experiment
the more difficult ones due to their fast-moving images                is 106. The overall accuracy of 2SDS is 99.06%. Theoret-
and transition effects that might seem deceptive to 2SDS.              ically, the result of this experiment should sit between
The hybrid video sits in between the first two types, they             the previous two tests, however, an anomaly has arisen
have some features of the interview videos, as well as                 most likely due to the lack of samples. A more detailed
features from the vibrant videos, their difficulty should              experiment should be conducted to further determine
sit in the middle.                                                     the accuracy of 2SDS on hybrid videos.

4.1. Interview Video Tests
                                                                       5. Bringing in Spatial Information
We conducted 3 separate experiments using interview
videos (Table 1). The total amount of scenes in these 3                Bringing in spatial information and modeling techniques
experiments is 82. The overall accuracy of 2SDS during                 can potentially play a huge role in video interpretation.
these experiments is 90.10%. There are 2 cases where we                Previously difficult and untouchable problems like contin-
find the 2SDS algorithm actually over-judged the transi-               uous gesture recognition and scene recognition are being
tion between two scenes. This is potentially a sensitivity             cracked using the CNN-based spatio-temporal reasoning
issue posed by the hard coded threshold during scene                   model [9] and the 2SDS algorithm as well.
separation.                                                               Our work has only utilised the temporal information
                                                                       in video stream, our future work can make use of graphs,
                                                                       and spatially model a frame into a graph, with the ob-
1
    We only did some preliminary experiments on the accuracy of 2SDS   jects as the vertices and the spatial relations between
    on scene separation (temporal segmentation) tasks, more detailed   the objects as the edges, like the MST-GNN [10] and the
    experiments are still needed to be conducted.                      VRD-GCN [11]. Doing this, we can extract even more
information out of a video. For example, a person’s ges-      [2] A. S. Patel, R. Vyas, O. P. Vyas, M. Ojha, A study on
ture in a scene can be identified, and the scenes with            video semantics; overview, challenges, and applica-
more significant camera or object movements (e.g., the            tions, Multim. Tools Appl. 81 (2022) 6849–6897.
vibrant and hybrid video tests) will not cause significant    [3] R. Pal, A. A. Sekh, D. P. Dogra, S. Kar, P. P. Roy,
problem for the algorithm because the spatial relation of         D. K. Prasad, Topic-based video analysis: A survey,
the objects stays the same.                                       ACM Comput. Surv. 54 (2021) 118:1–118:34.
   This future work would bring immense potential with        [4] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M.
the use of spatial information, which will add a whole            Arzani, R. Yousefzadeh, L. V. Gool, Temporal 3d
other dimension of usable information that can bene-              convnets: New architecture and transfer learning
fit video analysis with richer semantics and the ability          for video classification, CoRR abs/1711.08200 (2017).
of grouping fast-moving frames, bringing video inter-             arXiv:1711.08200.
pretation models yet another step closer to how human         [5] C. Feichtenhofer, H. Fan, J. Malik, K. He, Slow-
perceive visual information.                                      fast networks for video recognition, in: 2019
                                                                  IEEE/CVF International Conference on Computer
                                                                  Vision, ICCV 2019, Seoul, Korea (South), October
6. Conclusion                                                     27 - November 2, 2019, IEEE, 2019, pp. 6201–6210.
                                                              [6] S. Hochreiter, J. Schmidhuber, Long short-term
Under the context of real-time video stream analysis us-
                                                                  memory, Neural Comput. 9 (1997) 1735–1780.
ing temporal segmentation methods, we devised 2SDS, a
                                                              [7] N. Krawetz, Kind of like that, 2013. URL:
temporal segmentation algorithm that can be used along-
                                                                  http://www.hackerfactor.com/blog/?/archives/
side CNNs to complement for the lack of temporal in-
                                                                  529-Kind-of-Like-That.html.
formation handling ability of the CNN-based models.
                                                              [8] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, T. S.
We gave, yet another powerful tool that CNN models
                                                                  Huang, YouTube-VOS: A large-scale video object
can utilise, the ability to take advantage of the inherent
                                                                  segmentation benchmark, CoRR abs/1809.03327
temporal aspect of videos. Video stream analysis is a
                                                                  (2018). arXiv:1809.03327.
completely different task compared to image recognition,
                                                              [9] O. Köpüklü, F. Herzog, G. Rigoll, Compara-
and we are finally seeing some evidence that we can still
                                                                  tive analysis of CNN-based spatiotemporal rea-
use 2D CNNs to interpret video information.
                                                                  soning in videos, CoRR abs/1909.05165 (2019).
   The 2SDS algorithm utilise a refined difference hash
                                                                  arXiv:1909.05165.
value method and a novel data smoothing and data se-
                                                             [10] M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, Q. Tian,
lection technique to crack the temporal segmentation
                                                                  Multiscale spatio-temporal graph neural networks
problem. Although there are still drawbacks with fast-
                                                                  for 3D skeleton-based motion prediction, IEEE
moving frames in vibrant videos, the 2SDS algorithm
                                                                  Trans. Image Process. 30 (2021) 7760–7775.
has already done a great job at separating relatively sim-
                                                             [11] X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, J. Xiao,
ple and stationary scenes in videos, and it gets the job
                                                                  Video relation detection with spatio-temporal
done at a respectful speed, which will allow the 2SDS
                                                                  graph, in: L. Amsaleg, B. Huet, M. A. Larson,
to get a finer temporal resolution compared with neural
                                                                  G. Gravier, H. Hung, C. Ngo, W. T. Ooi (Eds.), Pro-
networks.
                                                                  ceedings of the 27th ACM International Conference
   For future work, some improvements on 2SDS (e.g.,
                                                                  on Multimedia, MM 2019, Nice, France, October 21-
adding graphs to model spatial relations) can potentially
                                                                  25, 2019, ACM, 2019, pp. 84–93.
boost the algorithm’s performance on fast-moving scenes
and smooth transitions.


Acknowledgments
We would like to dedicate our thank to Dr Zhiguo Long,
Dr John Stell, and Dr Liu Yang, who have been extremely
generous and helpful throughout the course of our work.


References
 [1] L. Stappen, A. Baird, E. Cambria, B. W. Schuller,
     Sentiment analysis and topic recognition in video
     transcriptions, IEEE Intell. Syst. 36 (2021) 88–95.