Video Scene Location Recognition with Neural Networks Lukáš Korel1 , Petr Pulc1 , Jiří Tumpach2 , and Martin Holeňa3 1 Faculty of Information Technology, Czech Technical University, Prague, Czech Republic 2 Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic 3 Institute of Computer Science, Academy of Sciences of the Czech Republic, Prague, Czech Republic Abstract: This paper provides an insight into the possi- 2 ANN-Based Scene Classification bility of scene recognition from a video sequence with a small set of repeated shooting locations (such as in tele- The problem of scene classification has been studied for vision series) using artificial neural networks. The basic many years. There are many approaches based on neural idea of the presented approach is to select a set of frames networks, where an ANN using huge amount of images from each scene, transform them by a pre-trained single- learned to recognize the type of given scene (for example, image pre-processing convolutional network, and classify a kitchen, a bedroom, etc.). For this case several datasets the scene location with subsequent layers of the neural are available. One example is [11], but it does not specify network. The considered networks have been tested and locations, so this and similar datasets are not usable for our compared on a dataset obtained from The Big Bang The- task. ory television series. We have investigated different neural However, our classification problem is different. We network layers to combine individual frames, particularly want to train an ANN able to recognize a particular lo- AveragePooling, MaxPooling, Product, Flatten, LSTM, cation (for example “Springfield-EverGreenTerrace-742- and Bidirectional LSTM layers. We have observed that floor2-bathroom”), which can me recorded by camera only some of the approaches are suitable for the task at from many angles (typically, some object can be occluded hand. by other objects from some angles). One approach using ANN to solve this task is described in [1], there convolutional networks were used. The differ- 1 Introduction ence to our approach is on the one hand in the extraction and usage of video images, on the other hand in types of People watching videos are able to recognize where the ANN layers. current scene is located. When watching some film or se- Another approach is described in [4]. The authors rial, they are able to recognize that a new scene is on the propose a high-level image representation, called Object same place they have already seen. Finally, people are able Bank, where an image is represented as a scale-invariant to understand scenes hierarchy. All this supports human response map of a large number of pre-trained generic ob- comprehensibility of videos. ject detectors. Leveraging on the Object Bank representa- The role of location identification in scene recognition tion, good performances on high level visual recognition by humans motivated our research into scene location clas- tasks can be achieved with simple off-the-shelf classifiers sification by artificial neural networks (ANNs). A more such as logistic regression and linear SVM. ambitious goal would be a make system able to remem- ber unknown video locations and using this data identify video scene that is located in that location and mark it with 3 Methodology the same label. This paper reports a work in progress in that direction. It describes the employed methodology and 3.1 Data Preparation presents first experimental results obtained with six kinds of neural networks. Video data consists of large video files. Therefore, the first The rest of the paper is organized as follows. The next task of video data preparation consists in loading the data section is about existing approaches to solve this problem. that is currently needed. Section 3 is divided to two parts. The first one is about data We have evaluated the distribution of the data used for preparation before their usage in ANNs. The second one ANN training. We have found there are some scenes with is about design of the ANNs in our experiments. Finally, low occurence, whereas others occur up to 30 times more Section 4 – the last section before the conclusion shows frequently compared to them. Hence, the second task of our results of experiments with these ANNs. video data preparation is to increase the uniformity of their distribution, to prevent biasing the ANN to most frequent classes. This is achieved due to undersampling the fre- quent classes in the training data. Copyright ©2021 for this paper by its authors. Use permitted under The input consists of video files and a text file. The Creative Commons License Attribution 4.0 International (CC BY 4.0). video files are divided into independent episodes. The text # frames video Frame s0 Frame s1 Frame sX Frame s(X+1) Scene 0 ... Scene X ... Scene N Input Layer #frames * [224 × 224 × 3] # frames s0 Frame Frame Frame sX #frames * ... sX+1 s(X+1)-1 # scene frames divided by Frame A Frame B # frames Frame C ... Frame n VGG19 #frames * pretrained ANN Resize Resize Resize Resize Resize [4096] Frame Frame Frame Frame ... RA RB RC Rn Figure 2: First, untrainable part of our neural network, Concatenated input data X where the Input Layer represents frame with resolution # frames × width × height × #colors 224 × 224 in BGR colors and output is vector with length 4096, which is output from VGG19 network without last Figure 1: Input preparation for a neural network two layers file is contains manually created metainformation about (resolution 224 × 224, BGR colors) from one scene. This every scene. Every row contains metainformation about is processed by a pretrained VGG19 neural network with- one scene. The scene is understand as sequence of frames, out two top layers. The two top layers were removed that are not interrupted by another frame with different due to transfer learning. Its output is a vector with size scene location label. Every row contains a relative path 4096. For the 20 input frames we have 20 vectors with to the source video file, the frame number where the scene size 4096. These vectors are merged to a 2D matrix with begins and the count of the its frames. Figure 1 outlines size 20 × 4096. how frames are extracted and prepared for an ANNs. For For the second part, forming the upper layers of the final ANNs training, we select from each target scene a constant network, we have considered six possibilities: a product count 20 frames (denoted # frames in Figure 1). To get layer, a flatten layer, an average pooling layer, a max pool- most informative representation of the considered scene, ing layer, an LSTM layer and a bidirectional LSTM layer. frames for sampling are taken from the whole length of All of them, as well as the VCGnet, will be described be- the scene. This, in particular, prevents to select frames low. Each of listed layers is preceded by a Dense layer. only within a short time interval. Each scene has its own The Dense layer returns matrix 20 × 12, where number 12 frame distance computed from its frames count: is equal to the number of classes. With this output every model works differently. SF SL = VGGnet The VGGNets [2] were originally developed for F object recognition and detection. They have deep convo- where SF is the count of scene frames, F is the considered lutional architectures with smaller sizes of convolutional constant count of selected frames and SL is the distance kernel (3 × 3), stride (1 × 1), and pooling window (2 × 2). between two selected frames in the scene. After frames There are different network structures, ranging from 11 extraction, every frame is reshaped to an input 3D matrix layers to 19 layers. The model capability is increased for the ANN. Finally the reshaped frames are merged to when the network is deeper, but imposing a heavier com- one input matrix for the neural network. putational cost. We have used the VGG19 model (VGG network with 3.2 Used Neural Networks and Their Design 19 layers) from the Keras library in our case. This model [3] won the 1st and 2nd place in the 2014 ImageNet Our first idea was to create a complex neural network Large Scale Visual Recognition Challenge in the 2 cat- based on different layers. However, there were too many egories called object localization and image classifica- parameters to train in view of the amount of data that we tion, respectively. It achieves 92.7% in image classifi- had. Therefore, we have decided to use transfer learning cation on Caltech-101, top-5 test accuracy on ImageNet from some pretrained network. dataset which contains 14 million images belonging to Because our data are actually images, we considered 1000 classes. The architecture of the VGG19 model is only ANNs pretrained on image datasets in particular depicted in figure 3. ResNet50 [9], ResNet101 [9] and VGGnet [2]. Finally, we have decided to use VGGnet due to its small size. 3.2.1 Product array Hence, ANNs which we trained on our data are com- posed of two parts. The first part, depicted in Figure 2 is In this approach, we apply a product array layer to all out- based on the VGGnet. At the input, we have 20 frames put vectors from the dense layer. A Product array layer Figure 3: Architecture of the used VGG19 model [10], in our network is used without FC1, FC2 and Softmax layers computes product of all values in chosen dimension of an n-dimensional array and returns an n-1-dimensional array. Input Layer [#frames × 4096] Input Layer [#frames × 4096] Dense [#frames × #locationClasses] Dense [#frames × #locationClasses] Flatten [#frames * #locationClasses] Product [#locationClasses] Dense [#locationClasses] Figure 4: Trainable part of the neural network based on a product layer A model with a product layer is outlined in Figure 4. Figure 5: Trainable part of the neural network based on a The output from a Product layer is one number for each flatten layer class, i.e. scene location, so our result is vector with 12 numbers. It returns a probability distribution over the set of scene locations. A model with a flatten layer is outlined in Figure 5. Af- ter the input and a dense layer, a flatten layer follows, 3.2.2 Flatten which returns long vector with 12 ∗ 20 numbers in this case. It is followed by a second dense layer. Its output In this approach, we apply a flatten layer to all output vec- has again a dimension equal to the number of classes and tors from the dense layer. A Flatten layer creates one long it returns a probability distribution over the set of scene vector from matrix so, that all rows are in sequence. locations. 3.2.3 Average Pooling in our case of scene locations. LSTM layers are intended for recurrent signal propagation, and differently to other In this approach, we apply average pooling to all output commonly encountered layers, they consists not of sim- vectors from the dense layer part of the network (Figure 6). ple neurons, but of units with their own inner structure. An average-pooling layer computes the average of values Several variants of such a structure have been proposed assigned to subsets of its preceding layer that are such that: (e.g., [5, 8]), but all of them include at least the following • they partition the preceding layer, i.e., that layer four components: equals their union and they are mutually disjoint; • Memory cells can store values, aka cell states, for an arbitrary time. They have no activation function, thus • they are identically sized. their output is actually a biased linear combination of Taking into account these two conditions, the size p1 × unit inputs and of the values coming through recur- . . . × pD of the preceding layer and the size r1 × . . . × rD rent connections. of the sets forming its partition determine the size of the • Input gate controls the extent to which values from average-pooling layer. the previous unit within the layer or from the preced- ing layer influence the value stored in the memory cell. It has a sigmoidal activation function, which is Input Layer [#frames × 4096] applied to a biased linear combination of the input and recurrent connections, though its bias and synap- tic weights are specific and in general different from the bias and synaptic weights of the memory cell. Dense • Forget gate controls the extent to which the memory [#frames × cell state is suppressed. It again has a sigmoidal ac- #locationClasses] tivation function, which is applied to a specific bi- ased linear combination of input and recurrent con- nections. AveragePooling1D • Output gate controls the extent to which the memory [#locationClasses] cell state influences the unit output. Also this gate has a sigmoidal activation function, which is applied to a specific biased linear combination of input and recur- rent connections, and subsequently composed either Figure 6: Trainable part of the neural network based on an directly with the cell state or with its sigmoidal trans- average-pooling layer formation, using a different sigmoid than is used by the gates. In this case, an Average Pooling layer’s forming sets size is 20 × 1. Using this size in average-pooling layer, Hence using LSTM layers a more sophisticated ap- we get again one number for each class, which returns a proach compared to simple average pooling. A LSTM, probability distribution over the set of scene locations. layer can keep hidden state through time with information Apart form average pooling, we have tried also max about previous frames. pooling. However, it led to substantially worse results. Its Figure 7 shows that the input to an LSTM layer is a 2D classification of the scene location was typically based on matrix. Its rows are ordered by the time of frames from people or items in the foreground, not on the scene as a the input scene. Every input frame in the network is repre- whole. sented by one vector. The output from the LSTM layer is Although using the average-pooling layer is easy, it a vector of the same size as in previous approaches, which gives acceptable results. The number of trainable parame- returns a probability distribution over the set of scene lo- ters of the network is then low, which makes it suitable for cations. our comparatively small dataset. 3.2.5 Bidirectional Long Short Term Memory 3.2.4 Long Short Term Memory An LSTM, due to its hidden state, preserves information An LSTM layer is used for classification of sequences of from inputs that has already passed through it. Unidirec- feature vectors, or equivalently, multidimensional time se- tional LSTM only preserves information from the past be- ries with discrete time. Alternatively, that layer can be also cause the only inputs it has seen are from the past. A Bidi- employed to obtain sequences of such classifications, i.e., rectional LSTM runs inputs in two ways, one from the past in situations when the neural network input is a sequence to the future and one from the future to the past. To this of feature vectors and its output is a a sequence of classes, end, it combines two hidden states, one for each direction. frames is a generator. It has lower capacity requirements, Input Layer because data are loaded just in time when they are needed [#frames × 4096] and memory is released after the data have been used for ANN. All non-image information about inputs (video lo- cation, scenes information, etc.) are processed in text for- mat by Pandas. Dense [#frames × #locationClasses] Table 1: Versions of the employed hardware and software CPU cores 2 GPU compute capability 3.5 and higher OS Linux 5.4.0 LSTM CUDA 11.3 [#locationClasses] Python 3.8.6 TensorFlow 2.3.1 Keras 2.4.0 Figure 7: Trainable part of the neural network based on an OpenCV 4.5.2 LSTM layer We have 17 independent datasets prepared by ourselves Input Layer from proprietary videos of the The Big Bang Theory se- [#frames × 4096] ries, thus the datasets can’t be public. Each dataset origi- nates from one episode of the series. Each experiment was trained with one dataset, so results are independent as well. So we can compare behavior of the models with different Dense datasets. [#frames × Our algorithm to select data in training routine is based #locationClasses] on oversampling. It randomly selects target class and from the whole training dataset is randomly select source scene with replacement. This algorithm is applied due to an un- balanced proportion of different target classes. Thanks to Bidirectional LSTM this method, all targets are distributed equally and the net- [#locationClasses] work does not overfit a highly represented class. Figure 8: Trainable part of the neural network based on a 4.2 Results bidirectional LSTM layer The differences between the models considered in the sec- ond, trained part of the network were tested for signifi- Figure 8 shows that the input to a bidirectional LSTM cance by the Friedman test. The basic null hypotheses layer is the same as the input to a LSTM layer. Every that the mean classification accuracy for all 6 models coin- input frame in the network is represented by one vector. cides was strongly rejected, with the achieved significance The output from the Bidirectional LSTM layer is a vector p = 2.8 × 10−13 . For the post-hoc analysis, we employed of the same size as in previous approaches, which returns the Wilcoxon signed rank test with two-sided alternative a probability distribution over the set of scene locations. for all 15 pairs of theconsidered models, because of the in- consistence of the more commonly used mean ranks post- hoc test, to which recently Benavoli et al. pointed out [6]. 4 Experiments For correction to multiple hypotheses testing, we used the Holm method [7]. The results are included the comparison 4.1 Experimental Setup between models in Table 2. Summary statistics of the predictive accuracy of classi- The ANNs for scene location classification were imple- fication all 17 episode datasets are in Table 3. Every exper- mented in the libraries Python language using TensorFlow iment was performed on every dataset at least 7 times. The and Keras. Neural network training was accelerated using table is complemented with results for individual episodes, a NVIDIA GPU. The versions of the employed hardware depicted in box plots. and software are listed in Table 1. For image preparation, The model with a max-pooling layer had the worst re- OpenCV and Numpy were used. The routine for preparing sults (Figure 12) of all experiments. Its overall mean ac- Table 2: Comparison of accuracy results on all 17 episode datasets. The values in the table are counts of datasets, in which the model in row has higher accuracy compared to the model in column. If the difference is not significant in the Wilcoxon test than the count is in italic. If the difference is significant, then the higher count is in bold. Product Flatten Average Max LSTM BidirectionalLSTM SummaryScore Product X 16 6 16 5 1 44 Flatten 1 X 0 10 0 0 11 Average 11 17 X 17 3 1 49 Max 1 6 0 X 0 0 7 LSTM 12 17 14 17 X 3 63 BidirectionalLSTM 16 17 15 17 14 X 79 curacy was around 10 %. This is only slighty higher than lar average-pooling, max-pooling, product, flatten, LSTM, random choice which is 1/12. The model was not able to and bidirectional LSTM layers. The considered networks achieve better accuracy than 20 %. Its results were stable have been tested and compared on a dataset obtained from and standard deviation was very low. The Big Bang Theory television series. The model with Slightly better results (Figure 10) had the model with max-pooling layer was not successful, its accuracy was the a flatten layer, it was sometimes able to achieve a high the lowest of all models. The models with a flatten or accuracy, but its standard deviation was very high. On the product layer were very unstable, their standard deviation other hand, results for some other episodes were not better was very large. The most stable among all models was than those of the max-pooling model. the one with an average-pooling layer. The models with A better solution is the product model, whose predictive unidirectional LSTM and bidirectional LSTM had simi- accuracy (Figure 9) was for several episodes higher than lar standard deviation of the accuracy. The model with 80 %. On the other hand, other episodes had only slightly a bidirectional LSTM had the highest accuracy among all better results than the flatten model. And it had the highest considered models. In our opinion, this is because its inter- standard deviation among all considered models. nal memory cells preserve information in both directions. The most stable results (Figure 11) with good accuracy Those results shows, that models with internal memory are had the model based on average-pooling layer. Its mean able to classify with a higher accuracy than models with- accuracy was 32 % and for no episode, the accuracy was out internal memory. substantially different. Our method may have limitations due to the chosen pre- The model with unidirectional LSTM layer had the sec- trained ANN and the low dimension of some neural layer ond mean accuracy of considered our models (Figure 13). parts. In future research, it is desirable to achieve higher Its internal memory brings advantage in compare over the accuracy in scene location recognition. This task may also previous approaches, over 40 %, though also a compara- need modifying model parameters or using other architec- tively high standard deviation. tures. It also may need other pretrained models or com- The highest mean accuracy had the model with a bidi- bining several pretrained models. It is also desirable that, rectional LSTM layer (Figure 14). It had a similar stan- if the ANN detects an unknown scene, it will remember dard deviation as the one with a unidirectional LSTM, but it and next time it will recognize a scene from the same an accuracy mean nearly 50 %. location properly. 5 Conclusion and Future Research Acknowledgments In this paper was provided an insight into the possibil- ity of using artificial neural networks for scene recog- nition location from a video sequence with a small set The research reported in this paper has been supported by of repeated shooting locations (such as in television se- the Czech Science Foundation (GAČR) grant 18-18080S. ries) was provided. Our idea was to select more than one frame from each scene and classify the scene using that Computational resources were supplied by the project sequence of frames. We used a pretrained VGG19 net- "e-Infrastruktura CZ" (e-INFRA LM2018140) provided work without two last layers. This results were used as within the program Projects of Large Research, Develop- an input to the trainable part our neural network architec- ment and Innovations Infrastructures. ture. We have designed six neural network models with Computational resources were provided by the ELIXIR- different layer types. We have investigated different neu- CZ project (LM2018131), part of the international ral network layers to combine video frames, in particu- ELIXIR infrastructure. Table 3: Aggregated predictive accuracy over all 17 datasets [%] model mean std 25% 50% 75% Product 43.7 38.4 4.6 32.4 85.2 Flatten 23.6 30.8 1.0 5.1 39.6 Average 32.2 8.1 26.5 31.5 37.1 Max 9.3 2.9 8.1 9.3 10.9 LSTM 40.7 25.2 19.7 39.9 59.4 BidirectionalLSTM 47.8 25.1 29.6 50.5 67.7 Figure 9: Box plot with results obtained using the product model Figure 10: Box plot with results obtained using the flatten model Figure 11: Box plot with results obtained using the average-pooling model Figure 12: Box plot with results obtained using the max-pooling model Figure 13: Box plot with results obtained using the LSTM model Figure 14: Box plot with results obtained using the bidirectional LSTM model References [1] Zhong, W., Kjellström, H.: Movie scene recognition with Convolutional Neural Networks. https://www.diva-portal.org/smash/get/diva2 :859486/FULLTEXT01.pdf KTH ROYAL INSTITUTE OF TECHNOLOGY (2015) 5–36 [2] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. https://arxiv.org/pdf/1409.1556v6.pdf Visual Ge- ometry Group, Department of Engineering Science, Univer- sity of Oxford (2015) [3] Russakovsky O., Deng J., Hao S., Krause J., Satheesh S., Ma Sean, Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A. C., Fei-Fei L.: ImageNet Large Scale Visual Recog- nition Challenge. International Journal of Computer Vision 115 (2015), pp. 211–252 [4] Li-jia L., Hao S., Fei-fei L., Xing E.: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification. https://cs.stanford.edu/groups/vision/pdf/ LiSuXingFeiFeiNIPS2010.pdf NIPS (2010) [5] Felix A. Gers, Schmidhuber J., and Cummins F., Learning to forget: Continual prediction with LSTM, in Proceedings of ICANN. ENNS (1999), pp. 850-–855 [6] Benavoli A., Corani G., Mangili F.: Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Journal of Machine Learning Research 17 (2016), pp. 1–10 [7] García S., Herrera F.: An Extension on "Statistical Com- parisons of Classifiers over Multiple Data Sets" for all Pair- wise Comparisons Journal of Machine Learning Research 9 (2008), pp. 2677–2694 [8] Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Springer (2012) [9] Kaiming H., Xiangyu Z., Shaoqing R., Jian S.: Deep Resid- ual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778 [10] Sudha V., Ganeshbabu T. R.: A Convolutional Neural Network Classifier VGG-19 Architecture for Lesion Detec- tion and Grading in Diabetic Retinopathy Based on Deep Learning. http://www.techscience.com/cmc/v66n1/40483 Computers, Materials & Continua (2021), pp. 827–842 [11] Zhou B., Lapedriza A., Khosla A., Oliva A., Torralba A.: Places: A 10 Million Image Database for Scene Recogni- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), pp. 1452–1464