=Paper=
{{Paper
|id=Vol-3190/paper2
|storemode=property
|title=Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-time Video Stream Analysis
|pdfUrl=https://ceur-ws.org/Vol-3190/paper2.pdf
|volume=Vol-3190
|authors=Yuelin Xin,Zihan Zhou,Yuxuan Xia
|dblpUrl=https://dblp.org/rec/conf/ijcai/XinZX22
}}
==Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-time Video Stream Analysis==
Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-Time Video Stream Analysis Yuelin Xin1,2,*,† , Zihan Zhou1,2,† and Yuxuan Xia1,2,† 1 Southwest Jiaotong University, Chengdu, China 2 University of Leeds, Leeds, UK Abstract We present 2SDS (Scene Separation and Data Selection algorithm), a temporal segmentation algorithm used in real-time video stream interpretation. It complements CNN-based models to make use of temporal information in videos. 2SDS can detect the change between scenes in a video stream by com-paring the image difference between two frames. It separates a video into segments (scenes), and by combining itself with a CNN model, 2SDS can select the optimal result for each scene. In this paper, we will be discussing some basic methods and concepts behind 2SDS, as well as presenting some preliminary experiment results regarding 2SDS. During these experiments, 2SDS has achieved an overall accuracy of over 90 Keywords scene separation, temporal segmentation, real-time video analysis, dHash 1. Introduction Image recognition models have gone increasingly accu- rate in the past few years, yet video semantics tasks are still challenging. A detailed comprehension on video stream could play a significant part in video accessibility [1], surveillance footage auto-interpretation [2, 3], and so Figure 1: Overall effect of the scene separation proce- on. These technologies have already been proven useful dure. The whole video stream will be separated into scenes, on large video platforms like YouTube, used for real-time in each of which the images in the video remain relatively video interpretation and video topic analysis. stationary. 1.1. The Problem In the processing of video stream, a 2D CNN can be implementation that complements the CNN model to extended into 3D CNN by adding a temporal dimension solve the continuity issue. This implementation should [4], but this approach can be hazardous if the video is group the discrete frames (adjacent on the temporal axis) too long, or it is of indefinite length. However, a 2D CNN that look similar to each other into scenes, this procedure is still very usable in a traditional image recognition or is what we call temporal segmentation (also referred as image segmentation task. scene separation in 2SDS, see Fig. 1 for example). The problem is that 2D CNNs only recognise a video as discrete images, rather than a continuous stream of 1.2. Related Work images. This poses some issues. For example, a CNN model could not resolve the motion of a person (e.g., SlowFast Networks. The SlowFast Networks use a two- walking, dancing) be-cause the person is stationary in pathway architecture for video recognition, the slow every frame, and this will cause the loss of significant pathway (low frame rate) is used to capture spatial se- information in video analysis. So, we need to devise an mantics, and the fast pathway (high frame rate) is used to capture temporal semantics like motions in a relatively fine temporal resolution [5]. STRL’22: First International Workshop on Spatio-Temporal Reasoning and Learning, July 24, 2022, Vienna, Austria * Corresponding author. 1.3. Our Work † These authors contributed equally. " sc20yx2@leeds.ac.uk (Y. Xin); sc20zz2@leeds.ac.uk (Z. Zhou); What we have achieved is to devise the temporal segmen- sc202yx@leeds.ac.uk (Y. Xia) tation algorithm, 2SDS, which stands for “Scene Separa- 0000-0002-9732-2414 (Y. Xin); 0000-0003-2613-7569 (Z. Zhou); tion and Data Selection algorithm”. It can slice the video 0000-0002-1185-2722 (Y. Xia) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License stream into segments on the temporal axis, so it can be in- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) terpreted using 2D CNN models while preserving critical Figure 2: 2SDS used together with a CNN model. The 2SDS algorithm can separate the scenes in a continuous video stream and select the result produced by the CNN model, the two together, can produce a scene-separated recognition Figure 3: Image processing in 2SDS based on an im- result. proved dHash algorithm. The two image processing parts in 2SDS, the first step is down sample, and the second step is gray scale conversion. information on the temporal dimension. By combining 2SDS with a CNN model (Fig. 2), this implementation is similar to the SlowFast Networks on splitting the input into two pathways, in which the 2SDS is similar to the 3. Method: 2SDS fast pathway of the SlowFast Networks, except we do not 2SDS stands for “Scene Separation and Data Selection introduce another neural network, but we replace the algorithm”. It works as a temporal segmentation and network with the faster 2SDS, which guarantees even result selection algorithm to complement CNN-based better temporal resolution. models. It contains a two-part procedure of separating the video stream into segments and selecting a represen- tative recognition result from the CNN model for output. 2. Motivation: Why Not Neural 2SDS utilises the difference hash (dHash) method [7] Networks to obtain the rough image difference between two frames, if two frames have a very little difference, they will be Traditionally, RNN-based models have been quite suc- grouped into the same scene. This method involves a few cessful in processing sequential information like time. simple steps to calculate, and it is the most important However, the usage of RNN or even neural networks method 2SDS uses to achieve scene separation. As the is not practical in time sensitive tasks like real-time ob- calculation is relatively simple and straight forward, this ject recognition and live video stream analysis, which makes 2SDS extremely fast on scene separation. requires fast responding algorithms, and RNNs usually Also, 2SDS uses a pooling-layer-like data smoothing cannot meet those requirements. and data selection method to pick out the representative RNN-based models like LSTM [6] generally have a recognition result for a particular scene. This method longer respond time even compared to CNN-based mod- can generally improve the accuracy of the output be- els (although the difference between them vary with dif- cause it can smooth out the data on undesired frame ferent settings of hyperparameters). A CNN + RNN ar- moving (e.g., camera shaking, broken frames). Alongside chitecture model would mean the doubling of processing the data smoothing mechanism, another data selection time, which is something we would rather avoid when mechanism is implemented to select the representative dealing with video stream analysis tasks. recognition result (referred as representative in the fol- However, RNNs do have the advantage of acting upon lowing sections) from the whole data segment of a scene temporal information, especially for models like LSTM. (referred as candidate in the following sections). So, we need to help the CNN-based models to preserve temporal information, and that is where we introduce our temporal segmentation algorithm, 2SDS. 3.1. Scene Separation: based on dHash By adding the 2SDS algorithm, alongside a CNN model, The scene separation procedure of 2SDS (Fig. 3) is based we were able to achieve RNN-like results. In the mean- on an improved version of the dHash algorithm, which time, by avoiding the introduction of a neural network, is originally used to judge the similarity of two images. this implementation is also faster than the CNN + RNN By applying the scene separation procedure, the tempo- architecture or the CNN + CNN architecture. ral information can be preserved by the sequencing of the separated scenes. The exact workflow of the scene separation process is discussed extensively below. is greater than a threshold (usually 5), we consider the two frames to be in two different scenes, and we can sep- arate them accordingly. For the calculation of Hamming distance, we can simply use an Exclusive Or operator on the two hash values, here is an example below (the Hamming distance is 7 in this case): c4e0d8988c989898 ⊕ eee6989c8c989898 = 7 (1) 3.2. Data Selection and Data Smoothing Figure 4: Example on binary sequence conversion. The de- rived binary sequence of row 1 in this case is 00111010. When the scene separation process detected a new scene, the data collected on the previous scene is packed into an array. This array contains all the recognition results produced by the CNN model in the previous scene, and Down sampling. To make a rough comparison be- the CNN model would have a recognition output on every tween two frames in a video, the frames need to be down frame in this scene. sampled from their original size to an 8 by 9 (row by To have a solid output for 2SDS, we need to perform column) sub-image. This approach can both simplify 2 extra steps: data smoothing and data selection. The the remaining calculation and make the algorithm less method that is implemented here is inspired by the pool- sensitive to subtle changes between frames. ing layer in a convolutional neural network. Gray scale. We apply gray scale manipulation on Data smoothing procedure. This step is also called the previous sub-image using the Luminosity algorithm, LWAP (Length Weighted Average Pooling). We start by this step is purely for reducing the complexity of calcu- segmenting the array containing all the recognition data lating the difference on 3 channels. By converting the into small groups of a defined size. Then, we apply the RGB channels into one gray scale channel, this approach following formulas: dramatically lessens the complexity of the algorithm. Calculate Hash value. The derived gray scale image ∑︀𝑖≤𝜙 𝑖=1 (𝐿𝑖 × 𝜔𝑖 ) is converted into a single 16-bit hexadecimal hash value. 𝑊 𝐴𝐿 = ∑︀𝑖≤𝜙 (2) The algorithm looks at all the 8 rows separately, each 𝑖=1 𝜔 𝑖 row has 9 gray scale values from 0 to 255. These 9 values {︂ 𝑓𝑖 (𝐷) = min𝑖=0 | card(𝐷𝑖 ) − WAL | are converted to 8 binary numbers under the following (3) 𝐶𝐼 = [𝑐 ∈ 𝐷 | card(𝑐) = 𝑓𝑖 (𝐷)] rules: (a) One binary value stands for the gray scale differ- Here, 𝐿𝑖 stands for the length of each recognition data, ence between two adjacent pixels. for example, in “object 1, object 2, object 1”, 𝐿𝑖 = 3. 𝜔𝑖 stands for the weight of each recognition data which (b) If the gray scale value of the pixel on the left is is 0.1 × 𝐿𝑖 . card(𝐷𝑖 ) stands for the cardinality of set greater than the pixel on the right, the binary 𝐷𝑖 , where 𝐷𝑖 is the segments previously obtained by value should be 1, otherwise, it should be 0. segmenting the original array. (c) Every row should end up with an 8-bit long binary This approach is inspired by the pooling layer in CNN, sequence. but instead of a Max Pooling operation, the data smooth- An example is given in Fig. 4. ing procedure uses a Weighted Average Pooling opera- Using this method, we can derive eight 8-bit long bi- tion. By using this data smoothing procedure, we can nary sequences, each of them can be represented by a avoid unwanted recognition results like broken frames 2-bit long hexadecimal value. And by concatenating all or camera flashes. the 2-bit hexadecimal values, we can obtain a 16-bit long Data selection procedure. We apply a data selection hexadecimal hash value, and this value will represent the procedure that uses the similar approach that we previ- whole image (this is also the reason why the original im- ously used in the data smoothing procedure, which is age is down sampled into an 8 by 9 sub-image rather than also a Weighted Average Pooling operation. This proce- an 8 by 8 sub-image, because the 8 by 8 image will face dure will select the recognition result from a frame whose some inconvenience when converting into a hexadecimal feature intensity is the closest to the weighted average hash value). value of all the candidates (feature intensity refers to Calculating the Hamming distance. By calculating the number of different classes of objects in a particular the Hamming distance between the hash values of two frame). adjacent frames, we can judge whether the two frames Finally, we can output the result that we obtained in are in the same scene or not. If the Hamming distance the previous steps as the representative of the whole Table 1 Table 2 Interview video tests results. Vibrant video tests results. Experiment No. Output - Truth Accuracy Experiment No. Output - Truth Accuracy Interview 1 25 - 25 100.00% Vibrant 1 9 - 13 69.23% Interview 2 35 - 29 82.86% Vibrant 2 19 - 38 50.00% Interview 3 31 - 28 90.32% Table 3 Hybrid video test result. scene. This particular recognition result will be used to represent the whole scene it is located in, and through the Experiment No. Output - Truth Accuracy help of NLP and other models, this can even be used to Hybrid 1 105 - 106 99.06% output the natural language interpretation of this video scene. 4.2. Vibrant Video Tests 4. Experiments We conducted 2 separate experiments using vibrant videos (Table 2). The total amount of scenes in these Due to the lack of similar algorithms and datasets, we 2 experiments is 51. The overall accuracy of 2SDS during could only provide some preliminary and experimental the two experiments is 54.90%. The accuracy in vibrant usage of the 2SDS algorithm1 . videos is substantially lower than interview videos for the We choose YOLOv5s as our image recognition CNN 2SDS is unable to separate two fast-moving scenes effec- for this experiment, and we have built an experimental tively. It is important to notice that we used harsh videos dataset on video object detection using selected YouTube like sport videos and dynamic advertisement videos in videos in the YouTube-VOS dataset [8]. Although the this experiment, so the performance of the 2SDS is ex- YOLOv5s algorithm is trained on the COCO dataset, this pected to be much lower comparing to the previous ex- CNN model is still sufficiently usable in this experiment periment. This is the biggest limitation of 2SDS, but this for it is not the key focus of this experiment. issue is addressable with future improvements of the We are most interested in how 2SDS will perform in algorithm. scene separation (temporal segmentation) tasks. We clas- sified the testing videos into 3 classes: interviews, vibrant, and hybrid. 4.3. Hybrid Video Tests The interviews are usually straight forward and easier We conducted one experiment using a long hybrid video to undergo scene separation tasks. Vibrant videos are (Table 3). The total amount of scenes in this experiment the more difficult ones due to their fast-moving images is 106. The overall accuracy of 2SDS is 99.06%. Theoret- and transition effects that might seem deceptive to 2SDS. ically, the result of this experiment should sit between The hybrid video sits in between the first two types, they the previous two tests, however, an anomaly has arisen have some features of the interview videos, as well as most likely due to the lack of samples. A more detailed features from the vibrant videos, their difficulty should experiment should be conducted to further determine sit in the middle. the accuracy of 2SDS on hybrid videos. 4.1. Interview Video Tests 5. Bringing in Spatial Information We conducted 3 separate experiments using interview videos (Table 1). The total amount of scenes in these 3 Bringing in spatial information and modeling techniques experiments is 82. The overall accuracy of 2SDS during can potentially play a huge role in video interpretation. these experiments is 90.10%. There are 2 cases where we Previously difficult and untouchable problems like contin- find the 2SDS algorithm actually over-judged the transi- uous gesture recognition and scene recognition are being tion between two scenes. This is potentially a sensitivity cracked using the CNN-based spatio-temporal reasoning issue posed by the hard coded threshold during scene model [9] and the 2SDS algorithm as well. separation. Our work has only utilised the temporal information in video stream, our future work can make use of graphs, and spatially model a frame into a graph, with the ob- 1 We only did some preliminary experiments on the accuracy of 2SDS jects as the vertices and the spatial relations between on scene separation (temporal segmentation) tasks, more detailed the objects as the edges, like the MST-GNN [10] and the experiments are still needed to be conducted. VRD-GCN [11]. Doing this, we can extract even more information out of a video. For example, a person’s ges- [2] A. S. Patel, R. Vyas, O. P. Vyas, M. Ojha, A study on ture in a scene can be identified, and the scenes with video semantics; overview, challenges, and applica- more significant camera or object movements (e.g., the tions, Multim. Tools Appl. 81 (2022) 6849–6897. vibrant and hybrid video tests) will not cause significant [3] R. Pal, A. A. Sekh, D. P. Dogra, S. Kar, P. P. Roy, problem for the algorithm because the spatial relation of D. K. Prasad, Topic-based video analysis: A survey, the objects stays the same. ACM Comput. Surv. 54 (2021) 118:1–118:34. This future work would bring immense potential with [4] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. the use of spatial information, which will add a whole Arzani, R. Yousefzadeh, L. V. Gool, Temporal 3d other dimension of usable information that can bene- convnets: New architecture and transfer learning fit video analysis with richer semantics and the ability for video classification, CoRR abs/1711.08200 (2017). of grouping fast-moving frames, bringing video inter- arXiv:1711.08200. pretation models yet another step closer to how human [5] C. Feichtenhofer, H. Fan, J. Malik, K. He, Slow- perceive visual information. fast networks for video recognition, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 6. Conclusion 27 - November 2, 2019, IEEE, 2019, pp. 6201–6210. [6] S. Hochreiter, J. Schmidhuber, Long short-term Under the context of real-time video stream analysis us- memory, Neural Comput. 9 (1997) 1735–1780. ing temporal segmentation methods, we devised 2SDS, a [7] N. Krawetz, Kind of like that, 2013. URL: temporal segmentation algorithm that can be used along- http://www.hackerfactor.com/blog/?/archives/ side CNNs to complement for the lack of temporal in- 529-Kind-of-Like-That.html. formation handling ability of the CNN-based models. [8] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, T. S. We gave, yet another powerful tool that CNN models Huang, YouTube-VOS: A large-scale video object can utilise, the ability to take advantage of the inherent segmentation benchmark, CoRR abs/1809.03327 temporal aspect of videos. Video stream analysis is a (2018). arXiv:1809.03327. completely different task compared to image recognition, [9] O. Köpüklü, F. Herzog, G. Rigoll, Compara- and we are finally seeing some evidence that we can still tive analysis of CNN-based spatiotemporal rea- use 2D CNNs to interpret video information. soning in videos, CoRR abs/1909.05165 (2019). The 2SDS algorithm utilise a refined difference hash arXiv:1909.05165. value method and a novel data smoothing and data se- [10] M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, Q. Tian, lection technique to crack the temporal segmentation Multiscale spatio-temporal graph neural networks problem. Although there are still drawbacks with fast- for 3D skeleton-based motion prediction, IEEE moving frames in vibrant videos, the 2SDS algorithm Trans. Image Process. 30 (2021) 7760–7775. has already done a great job at separating relatively sim- [11] X. Qian, Y. Zhuang, Y. Li, S. Xiao, S. Pu, J. Xiao, ple and stationary scenes in videos, and it gets the job Video relation detection with spatio-temporal done at a respectful speed, which will allow the 2SDS graph, in: L. Amsaleg, B. Huet, M. A. Larson, to get a finer temporal resolution compared with neural G. Gravier, H. Hung, C. Ngo, W. T. Ooi (Eds.), Pro- networks. ceedings of the 27th ACM International Conference For future work, some improvements on 2SDS (e.g., on Multimedia, MM 2019, Nice, France, October 21- adding graphs to model spatial relations) can potentially 25, 2019, ACM, 2019, pp. 84–93. boost the algorithm’s performance on fast-moving scenes and smooth transitions. Acknowledgments We would like to dedicate our thank to Dr Zhiguo Long, Dr John Stell, and Dr Liu Yang, who have been extremely generous and helpful throughout the course of our work. References [1] L. Stappen, A. Baird, E. Cambria, B. W. Schuller, Sentiment analysis and topic recognition in video transcriptions, IEEE Intell. Syst. 36 (2021) 88–95.