1. Introduction

Combining a mobile deep neural network and a recurrent layer for violence detection in videos

Paolo Contardo

1 2

Selene Tomassini

Nicola Falcionelli

Aldo Franco Dragoni

Paolo Sernani

0 0 Department of Law, University of Macerata , Piaggia dell'Università 2, 62100 Macerata , Italy 1 Gabinetto Interregionale di Polizia Scientifica per le Marche e l'Abruzzo , Via Gervasoni 19, 60129 Ancona , Italy 2 Information Engineering Department, Università Politecnica delle Marche , Via Brecce Bianche 12, 60131 Ancona , Italy

Several techniques for the automatic detection of violent scenes in videos and security footage appeared in recent years, for example with the goal of unburdening authorities from the need of analyzing hours of Closed-Circuit TeleVision (CCTV) clips. In this regard, Deep Learning-based techniques such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) emerged as efective for violence detection. Nevertheless, most of such techniques require significant computational and memory resources to run the automatic detection of violence. Thus, we propose the combination of an established CNN, MobileNetV2, designed for the use in mobile and embedded devices with a recurrent layer to extract the spatio-temporal features in the security videos. A lightweight model can run in embedded devices, in a edge computing fashion, for example to allow processing the videos near the camera recording them, to preserve privacy. Specifically, we exploit transfer learning, as we use a pre-trained version of MobileNetV2, and we propose two diferent models combining it with a Bidirectional Long Short-Term Memory (Bi-LSTM) and a Convolutional LSTM (ConvLSTM). The paper presents accuracy tests of the two models on the AIRTLab dataset and a comparison with more complex models developed in our previous work, in order to evaluate the drop of accuracy necessary to use a model compatible with limited resources. The network composed of MobileNetV2 and the ConvLSTM scores a 94.1% accuracy, against the 96.1% of a model based on a more complex 3D CNN.

eol>Violence Detection Convolutional Neural Network Long Short-Term Memory Action Recognition MobileNetV2 Law Enforcement Crime Investigation Deep Learning

1. Introduction

time [9], many techniques to automatically detect violence in videos emerged in the scientific literature. In this Closed-Circuit TeleVision (CCTV) emerged as one of the regard, the first studies focused on the use of flow descripmainstream crime prevention techniques [1], providing tors and hand-crafted features (see, for example, [10, 11]). abundant and precise information for security and law However, Deep Learning-based techniques demonstrated enforcement applications [2, 3]. In fact, Artificial Intel- better accuracy in violence detection, proposing to use ligence (AI) methodologies, especially those based on Recurrent Neural Networks (RNNs) and Convolutional Deep Learning, are demonstrating their efectiveness in Neural Networks (CNNs) for such task [12]. These techapplications that take advantages of CCTV footage, such niques are capable of modeling the spatio-temporal inas weapon detection [4, 5], face recognition [6, 7], and formation included in the CCTV footage, i.e., features accident detection [8]. With the goal of unburdening that represent the motion information contained in a seauthorities from the need of manually analyzing hours of quence of frames, in addition to the spatial information CCTV videos and allowing them to take decisions in short contained in a single frame.

RTA-CSIT 2023: 5th International Conference Recent Trends and Appli- In our previous work [13], we tested 13 diferent Deep cations In Computer Science And Information Technology, April 26–27, Neural Networks (DNNs) for the task of violence detec2023, Tirana, Albania tion in videos. Specifically, we compared a pre-trained * Corresponding author. 3D CNN, C3D [14], combined with a Support Vector Ma$ p.contardo@pm.univpm.it (P. Contardo); chine (SVM) classifier, with C3D combined with fully consn..tfoamlciaosnsienllii@@psmta.f.uunniivvppmm..iitt ((SN..TFoamlcaiossnienlil)i;); nected layers, with a trained-from-scratch Convolutional a.f.dragoni@staf.univpm.it (A. F. Dragoni); paolo.sernani@unimc.it Long Short-Term Memory (ConvLSTM) [15] plus fully (P. Sernani) connected layers, with other ten networks based on time 0000-0002-5605-4783 (P. Contardo); 0000-0002-1087-7004 distributed pre-trained 2D CNNs combined with Bidirec(S. Tomassini); 0000-0002-1312-6310 (N. Falcionelli); tional LSTM (Bi-LSTM) [16] (5 networks) and ConvLSTM (0P0.0S0e-0rn00a2n-i3)013-3424 (A. F. Dragoni); 0000-0001-7614-7154 (5 networks). The C3D-based models got the best accu© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License racy results in detecting violence on diferent datasets, CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) taking advantage of the 3D architecture capable of modeling the spatio-temporal features of the videos as well as and 98.3% accuracy on the Hockey Fight dataset. Accatof the transfer learning. Nevertheless, 3D CNNs require toli et al. [24] and Ullah et al. [25] also based their work computational and storage resources which are usually on a 3D CNN, but, instead of training it from scratch, not compatible with mobile and embedded devices [17] they applied transfer learning. Accattoli et al. added a i.e., for edge computing. SVM to the CNN, getting 99.2% accuracy on the Hockey

To tackle such issue, in this paper we propose two Fight and 98.5% accuracy on the Crowd Violence. Instead, models based on the combination of a CNN specifically Ullah et al. implemented a end-to-end neural network by designed for mobile devices i.e., MobileNetV2 [18], with adding fully connected layers to the 3D CNN, getting 98% a recurrent layer to extract the temporal information and accuracy on the Crowd Violence and 96% accuracy on fully connected layers for the classification of the videos the Hockey Fight. Sernani et al. [13] compared 13 diferinto violent or not. Specifically, in one model we used ent Deep Neural Networks on the Hockey Fight, Crowd the Bi-LSTM as the recurrent layers, whereas in the other Violence and AIRTLab datasets. Specifically, they tested we used the ConvLSTM. To understand its efectiveness a pre-trained 3D CNN (C3D) combined with a SVM, C3D and evaluate any potential drop of accuracy, we test the with fully connected layers, a ConvLSTM combined with proposed networks on the AIRTLab dataset [19], com- fully connected layers, 5 time-distributed pre-trained 2D paring the results with those obtained in our previous CNNs combined with the Bi-LSTM and the same 2D work. As such, this paper contributes to the state of the CNNs combined with a ConvLSTM. They got the best art in violence detection with: results with the two C3D-based networks, with 96.1% accuracy on the AIRTLab dataset, 97.86% accuracy on the • The proposal of using MobileNetV2, pre-trained Hockey Fight, and 99.6% accuracy on the Crowd Violence. on the Imagenet dataset [20], by time distribut- Freire-Obregón et al. [26] used an Inflated 3D ConvNet ing it over the frames of the security videos to to extract the spatio-temporal features on the output of be classified into violent or not, in combination two person trackers to perform context-free violence dewith a recurrent module to model the temporal tection, i.e., the violence detection applied to the subjects information in addition to the spatial information in the videos only, discarding any background or conof the videos. text information. They combined such feature extractor • The comparison of the proposed networks with with diferent classiefirs, getting the best results with the our previous tested models [13] to evaluate the Linear Regression, with 99.45% accuracy on the Crowd drop of accuracy necessary to use a network tai- Violence dataset, 99.43% on the Hockey Fight, and 97.54% lored for mobile and embedded devices i.e., Mo- on the AIRTLab.

bileNetV2. Whereas the aforementioned techniques demonstrated The rest of this paper is organized as follows. Section 2 efective in the task of automatically detecting violence provides a literature review about Deep Learning tech- in diferent video databases, they are all high demandniques applied in the violence detection task. Section 3 ing for computational and storage resources, making describes the proposed networks, providing the neces- them inadequate to run in mobile and embedded devices sary background and detailing the structure of the used i.e., for edge computing. In our previous work [13], we dataset. Section 4 discusses the experimental evaluation demonstrated that pre-trained 2D CNNs, time distributed and presents the main findings. Finally, Section 5 draws on the frames of the security videos and combined with the conclusions of this study. Bi-LSTM, achieve a lower accuracy than 3D CNNs. For example, VGG16 [27] combined with a Bi-LSTM, achieved 94.92% accuracy on the AIRTLab dataset, 95.47% on the 2. Related Works Hockey Fight, and 97.39% on the Crowd Violence. Nevertheless, such accuracy in detecting violence might be still Several violence detection techniques based on Deep acceptable, to get a compromise to run violence detecNeural Networks and, specifically, Recurrent Neural Net- tion at the edge to avoid data transmission and preserve works (such as LSTM, Bi-LSTM, ConvLSTM) and CNNs privacy. Therefore, given such results and the need for demonstrated their efectiveness [ 12]. For example, Sud- models capable of running violence detection at the edge, hakaran and Lanz. [21] combined the spatial features diferently from the listed works, we propose to “timecomputed by 2D CNNs on the frames of the videos, with a distribute” MobileNetV2 [18], a 2D CNN specifically deConvLSTM, to extract the temporal features as well. They signed for mobile devices, on the frames of the security got 94.5% accuracy on the Crowd Violence dataset [10] videos. We combine it with a recurrent layer and fully and 97.1% on the Hockey Fight dataset [22]. Li et al. [23] connected layers to perform the violence classification proposed a 3D CNN composed of 10 layers, adding dense and test two diferent versions, one based on the Bi-LSTM and transitional layers after the convolutional layers. and one on the ConvLSTM.

They got 97.2% accuracy on the Crowd Violence dataset, In addition to the search for the best accuracy, the scientific literature concerning the use of Deep Learning techniques for the automatic detection of violent scenes includes other studies. For example, Ciampi et al. [28] tested some of the aforementioned techniques, such as 3D CNNs and ConvLSTM, on a novel dataset, the Bus Violence, to study the behavior of the violence detection methodologies based on Deep Learning when the background and context information significantly varies.

Silva et al. [29] proposed the use of a federated learning approach to distribute the learning process across diferent devices, preserving privacy, with a server combining the locally trained model into a global model. However, instead on relying on videos or on video portions, the applied 2D CNNs to single frames, achieving the best results with MobileNet (99.4% accuracy on the AIRTLab dataset). Yang et al. [30] proposed a multimodal approach (Multimodal Contrastive Learning – MCL) to use both video and audio for the automatic detection of violence.

They got 84.03% average precision on the XD-Violence dataset [31], against the 83.19% of using the video only and the 76.07% of using the audio only.

3. Materials and Methods

• The pointwise convolution layer applied to the output of the depthwise convolution layer using a 1x1 convolution.

In addition, in MobileNetV2, linear bottlenecks and resid

ual connections follow the convolution. Specifically, linear bottlenecks use a linear activation function instead of a non-linear activation function, reducing the computational cost of the network.

As a traditional CNN, MobileNetV2 models the spatial information of images i.e., the frames of the videos. Therefore, we added a recurrent layer to the output of MobileNetV2 to model the temporal information available in the videos, using a Bi-LSTM and a ConvLSTM. In the original LSTM architecture [33], a hidden unit is composed by a self-recurrent cell, called memory cell, whose input/output is regulated by three multiplicative gates i.e., the input gate, the output gate, and the forget gate [34]. Specifically, the output ℎ at time point of a LSTM hidden unit is given by the following equations [34]: = ( + ℎℎ− 1 + − 1 + ) = ( + ℎ ℎ− 1 + − 1 + ) (1) (2) = − 1 + tanh( + ℎℎ− 1 + ) (3)

As explained in Sections 1 and 2, many studies about

the use of Deep Neural Networks for the violence de- = ( + ℎℎ− 1 + + ) (4) tection in videos proposed complex architectures, such ℎ = tanh() (5) as 3D CNNs, requiring computational and memory resources that are usually not compatible with mobile and where , , , and are the activation vectors of the embedded devices. To this end, we propose the use of input gate, forget gate, output gate, and memory cell at MobileNetV2, time-distributed over 16-frames chunks time point , is the sigmoid function, denotes the bias of the videos, combined with a recurrent layer to model of each gate/cell, and are diagonal weight matrices. the temporal information of the sequence of frames, in In the original formulation, a LSTM processes input addition to the spatial information. In the following, we data in ascending temporal order. However, the recogprovide some background about MobileNetV2, the LSTM nition of a pattern might be more efective with the architecture, and the ConvLSTM architecture (3.1). Then, use of future context as well. To this end, Bidirectional we present the proposed neural networks (3.2) and de- RNNs [35] and, specifically, Bidirectional LSTMs [ 16] scribe the dataset used for the tests (3.3). have been proposed. The basic idea of such models is to present the training sequences both forwards and back3.1. Background: MobileNetV2, LSTM, wards, using two separate recurrent nets, which are connected to the same output layer. As such, we based one of and ConvLSTM our models on the Bi-LSTM, as the videos are processed In the original definition of LeCun and Bengio [ 32], a once recorded, taking advantage of both previous and unit of a layer in a CNN receives inputs from a set of future context. units in the local receptive field, via a convolution oper- For the ConvLSTM, we use the formulation of Shi et ation with kernels composed of shared weights. In Mo- al. [15], who extended the LSTM architecture by adding bileNetV2 [18] this concept is extended to cope with the convolutional structures to state transition. As Shi et al. limited computational resources of mobile and embedded explained, the LSTM architecture is adequate to extract devices. Instead of the traditional convolution operation temporal features, but includes too much redundancy for of CNNs, MobileNetV2 decomposes convolutional layers spatial features. In this regard, they proposed to add coninto two separate layers: volutional structures in the transitions between the input gate and the memory cell, and in the self-recurrency of • The depthwise convolution layer that applies a the memory cell, regulated by the forget gate. Therefore, separate filter to each input channel. in a ConvLSTM, the output of a hidden unit is regulated = − 1 + tanh( * + ℎ * ℎ− 1 + )

(8) = ( * + ℎ * ℎ− 1 + + ) ℎ = tanh() (9) (10) where the activations of input gate, forget gate, output gate, and memory cell (, , , and ), as well as input and output (, ℎ) are 3D tensors. As such, we used the ConvLSTM in the second of our proposed models.

3.2. Proposed Classification Architecture

Layer Architecture Output Shape Params # Time Distr. MobileNetV2 - (16, 7, 7, 1280) 2257984 Time Distr. Flatten - (16, 62720) 0 Bi-LSTM 128 units (256) 64357376 Dropout 0.5 rate (256) 0 Dense 128 units, ReLU (128) 32896 Dropout 0.5 rate (128) 0 Dense 1 units, Sigmoid (1) 129 Violent Non-violent 16-frames chunk Resized 16-frames chunk

Feature Extraction Classifier (Time Distributed 2D CNN + Bi-LSTM/ConvLSTM) (Ful y Connected Layers)

As depicted in the schematic in Figure 1, to classify the

videos into violent or not, we propose two Deep Learning- Layer Architecture Output Shape Params # based classifiers based on MobileNetV2, pre-trained on Time Distr. MobileNetV2 - (16, 7, 7, 1280) 2257984 the Imagenet dataset [20], followed by a recurrent layer ConvLSTM 64 3x3 filters, tanh (5, 5, 64) 3096832 and fully connected layer. The weights of MobileNetV2 FDlraotpteonut -0.5 rate ((11660000)) 00 are freezed on the Imagenet training. Instead, the Bi- Dense 256 units, ReLu (256) 409856 LSTM layer or the ConvLSTM layer and the fully con- DDreonpsoeut 01.5unraitt,eSigmoid ((21)56) 2507 nected layers are trained from scratch on the AIRTLab dataset, as explained in Section 4 (Subsection 4.1). Given that in our previous work we run the classification over Table 2 lists the layers composing the second proposed 16-frames chunks of the videos, in this work we use the model. A ConvLSTM composed of 64 3 × 3 filters with same videos split into 16 frames chunks, in order to al- the tanh activation function follows the time-distributed low a fair comparison between the classifiers. The video MobileNetV2. The network is completed by a 0.5 dropout, of the AIRTLab dataset are resized at 224 x 224 pixels, a fully connected layer with 256 ReLU neurons, another as this is the input shape in the original MobileNetV2 0.5 dropout and a fully connected sigmoid neuron to implementation. perform the final classification into violent or not.

Table 1 includes the layers composing the first pro- The Bi-LSTM-based model has a total of 66,648,385 paposed model. MobileNetV2, with its 2,257,984 freezed rameters. The weights of MobileNetV2 are freezed, which weights, is time distributed over the 16 frames used as the means that the total number of trainable parameters is input. The Bi-LSTM is composed of 128 hidden units, fol- 64,390,401 (corresponding to the 128 hidden units of the lowed by a 0.5 dropout to limit the overfitting, a fully con- Bi-LSTM, the 128 ReLU neurons of the first fully connected layer with 128 ReLU neurons, another 0.5 dropout nected layer, and the sigmoid neuron of the last layer). Inand a fully connected sigmoid neuron for the final classi- stead, in the ConvLSTM-based model there are 5,764,929 ifcation. parameters (3,506,945 are trainable, corresponding to the 64 filters of the ConvLSTM layer, the 256 ReLU neurons of the first fully connected layer, and the final sigmoid neuron for the classification). Therefore, the model based on the ConvLSTM requires less memory than the model based on the Bi-LSTM, being more adequate for the use in mobile and embedded devices.

3.3. Used Dataset

layers, together with the fully connected layers, needed To test the performance of the proposed classifiers and to be trained from scratch. Therefore, to run the training compare them to our previous work, we run accuracy and test on the AIRTLab dataset, we applied a stratified tests on the AIRTLab dataset. It contains 350 videos (MP4 shufle split cross-validation scheme. To this end, we ifles with H.264 codec, mean length of 5.63 seconds). The repeated a randomized 80-20 split 5 times, using the 80% frame rate is 30 fps and the frame resolution is 1920 of the data as the training set, and the 20% as the test set, x 1080 pixels. The dataset includes 230 violent videos preserving the percentage of samples from each class, in and 120 non-violent videos. The 230 violent videos rep- each split. The data splits were the same for both the resent 115 violent actions recorded from two diferent proposed models and for the models of our previous work, cameras placed into two diferent spots. Similarly, the to implement a fair comparison. Given that the inputs 120 non-violent videos represent 60 non-violent actions, for the models are sequences composed of 16 frames and recorded from two diferent cameras placed into two dif- the videos in the dataset include a total of 3537 of such ferent spots. All the videos were taken inside the same sequences, 2829 samples (i.e., 16-frames chunks) were room. One camera was placed in the top left corner in used for training, and 708 for testing, in each split. The front of the room door. The second camera was in the 12.5% of the training data i.e., the 10% of the entire dataset, top right corner on the door side. was used as validation data.

A group of non-professional actors played the violent Both the proposed models used the Binary Crossand non-violent actions. The number of actors varied Entropy loss function, minimized with the Adam opfrom 2 to 4 per video. In the violent videos, the actors timizer. We early stopped the training after 5 epochs simulated actions frequent in scufles, such as punches, without any improvement on the minimum validation kicks, beating with canes, slapping, gun shots, and stab- loss, restoring the weights corresponding to the best bing. In the non-violent videos, the actors simulated epoch. To this end, Table 3 lists the number of training actions which can result in false positives due to the sim- epochs in each split of the stratified shufle split cross ilarity with violent actions (for example for the presence validation scheme, for each model. The average number of fast movements). Specifically, the non-violent videos of training epochs was 22.4 (± 5.68) for the model using contains actions such as exulting, hugging, gesticulating, the Bi-LSTM layer, and 17.8 (± 4.21) for the model based and clapping and giving high fives. on the ConvLSTM. The batch size was 8 for both neural networks.

The tests ran on Google Colab Pro with the GPU run4. Results and Discussion time (the GPU used for the tests was a Nvidia A100 SXM4 with 40 GB of RAM) and extended RAM (83.5 GB), using Keras 2.11.0, TensorFlow 2.11.0, and Scikit-learn 1.2.1.

Labeling as negative the 16-frames chunks of the nonviolent videos and as positive the chunks of the violent videos, we computed the following metrics over the test set in each split of the stratified shufle split cross validation scheme: We tested the two proposed models with the same protocol used in our previous work [13] i.e., by measuring the classification results over the AIRTLab dataset. The objective is to compare the accuracy performance of the classifiers based on a 2D CNN designed for mobile and embedded devices with those of classifiers requiring more resources. Therefore, in the following subsections, we describe the experimental protocol (4.1), discuss the results (4.2), and present the limitations of our evaluation (4.3).

4.1. Experimental Protocol and Evaluation Metrics

Whereas MobileNetV2 was pre-trained on Imagenet and its weights were freezed, the Bi-LSTM and ConvLSTM • Sensitivity (True Positive Rate – TPR) i.e., the portion of positives that are correctly identified (over all the available positives). • Specificity (True Negative Rate – TNR) i.e., the portion of negatives that are correctly identified (over all the available negatives). • Accuracy i.e., the portion of samples that are cor

rectly identified (over all the available samples). • F1 score i.e., the armonic mean of precision (the ratio between the positives correctly identified and all the identified positives) and sensitivity.

in Figure 2. In fact, the model using the Bi-LSTM as the recurrent layer scores an average AUC equal to 94.38% These metrics can be formulated in terms of true positives (± 2.98%), whereas the model using the ConvLSTM gets (TP), true negatives (TN), false positives (FP), and false 98.26% (± 0.46%). This behavior might be due to the negatives (FN) according to the following equations: diferent number of trainable parameters of the two models. In the Bi-LSTM-based model there are 64,390,401 trainable parameters. Instead, in the ConvLSTM-based = + (11) model, the number of trainable parameters is 3,506,945.

As such, the Bi-LSTM-based model might be oversized = + (12) for the violence detection task on the AIRTLab dataset, + struggling to converge to an acceptable classification per = (13) formance. Therefore, the ConvLSTM-based model, that + + + is the lightest in terms of required resources between the 1 = (14) two proposed in this work, exhibits a better performance + 12 ( + ) in terms of classification accuracy and generalization Moreover, in each split, we computed the Receiver Oper- capability. ating Characteristic (ROC) curve and the Area Under the Table 6 compares the performance of the two models Curve (AUC), showing the TPR against the False Positive proposed in this paper with those based on C3D tested Rate (FPR) when the classification threshold varies, to in our previous work. Even if lighter in terms of reunderstand the diagnostic capability of each model. quired computational resources, the model based on MobileNetV2 and the ConvLSTM gets an average AUC of 98%, against the 99% of the C3D-based models. The av4.2. Results erage accuracy and 1 score of the ConvLSTM-based Table 4 lists the metrics obtained by the model composed model are 94.1% ( 0.91%) and 95.62% ( 0.67%) being of MobileNetV2 and the Bi-LSTM over the five splits of only around 2% lower than the C3D + SVM model of the cross-validation performed on the AIRTLab dataset. our previous work. Therefore, limited resources as those The metrics significantly vary across the splits, showing of mobile or embedded devices might justify the use of a poor generalization capability. For example, in split 1, the MobileNetV2 combined with the ConvLSTM, as the all the 708 samples of the test set are labeled as violent, decrease in the accuracy metrics is limited. causing 232 false positives. As such, the sensitivity is 100% whereas the specificity is 0%. The split where most 4.3. Limitations of the negatives are correctly identified is the number 2: here, 204 negatives out of 232 are correctly classified The results of the research described in this paper are (specificity 87.93%). In the same split, 433 violent chunks promising, but include some limitations. In fact, we out of 476 are correctly classified. As such, the accuracy focused on the accuracy of two models based on Mois 92.8%. bileNetV2, which is designed for mobile or embedded

Instead, the model based on MobileNetV2 and the Con- devices. Nevertheless, we ran our comparative tests in vLSTM exhibits a better generalization capability than the cloud, using a GPU. Whereas the decrease in accuthe previous one, as showed in Table 5. The sensitivity is racy is limited and justifies the use of the best between greater than 94% across all the splits, and the lowest speci- the proposed models, tests on real mobile or embedded ifcity is in split 3 (85.34%). The best split is the number 5, devices i.e., at the edge, are needed to get more general where the 1 score is 96.42%. conclusions. Morevoer, our tests are based on a dataset

The diference in the generalization capability of the of videos where the violence is simulated by actors. Tests two proposed models is highlighted by the ROC curves on videos from real surveillance cameras are needed to confirm the accuracy results. fashion. In fact, an intelligent answer preserves its impor

In addition, we collected the metrics on 16-frames tance only if given in time, as remarked in [36]. Hence, chunks taken from the short videos of the AIRTLab in this paper, we proposed two Deep Neural Networks dataset (the average length is 5.6 seconds), to make this for the classification of videos into violent or not. Both work comparable with our previous research. Whereas networks are based on MobileNetV2, a CNN specifically most of the related literature performs tests on short designed for mobile and embedded devices. Such CNN videos, the accuracy on full length, real videos should is responsible for the extraction of the spatial features be evaluated. Indeed, using the short chunks of frames in the videos. We combined MobileNetV2 with a recurtaken from long videos as in our study might result in too rent layer for the extraction of the temporal features as many false positives. Thus, results on the chunks should well. One of the two proposed models uses a Bi-LSTM be merged together with a proper strategy to maximize layer as the recurrent module. Instead, the other uses a the accuracy on full length videos. To this end, a simple ConvLSTM. solution is labeling a part of a long video as violent only We ran comparative tests on the AIRTLab dataset. The when a fixed number of consecutive 16-frames chunks model using the ConvLSTM, the lightest in terms of reare labeled as violent. quired computational and memory resources between the two proposed in this paper, got the best accuracy, with an average AUC equal to 98.26% (± 0.46%). Compared to 5. Conclusions the models of our previous work, based on a 3D CNN, the decrease of performance in terms of AUC is around 1%, To be used in real applications, Artificial Intelligence and and 2% in terms of classification accuracy over the splits Deep Learning-based techniques need to take into ac- of the AIRTLab dataset. Such results encourage the use of count real time performances and be capable of running mobile models for embedded devices. For example, this in mobile and embedded devices, in a edge computing

Acknowledgments The presented research has been part of the Memoran

dum of Understanding between the Università Politecnica delle Marche, Centro “CARMELO” and the Ministero dell’Interno, Dipartimento di Pubblica Sicurezza, Direzione Centrale Anticrimine della Polizia di Stato. might be useful to process data directly near the camera that is recording the security video and, thus, preserve the privacy while addressing public security.

Future works will address the identified limitations.

In particular, tests on real mobile or embedded devices need to be performed to get more conclusive and general results. [17] W. Niu, M. Sun, Z. Li, J.-A. Chen, J. Guan, X. Shen, for violence detection, Machine Vision and ApplicaY. Wang, S. Liu, X. Lin, B. Ren, RT3D: Achieving tions 33 (2022) 1–13. doi:10.1007/s00138-021real-time execution of 3D convolutional neural net- 01264-9. works on mobile devices, Proceedings of the AAAI [27] K. Simonyan, A. Zisserman, Very deep convoluConference on Artificial Intelligence 35 (2021) 9179– tional networks for large-scale image recognition, 9187. doi:10.1609/aaai.v35i10.17108. CoRR abs/1409.1556 (2015). URL: https://arxiv.org/ [18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, abs/1409.1556.

L.-C. Chen, Mobilenetv2: Inverted residuals and [28] L. Ciampi, P. Foszner, N. Messina, M. Staniszewski, linear bottlenecks, in: 2018 IEEE/CVF Conference C. Gennaro, F. Falchi, G. Serao, M. Cogiel, D. Golba, on Computer Vision and Pattern Recognition, 2018, A. Szczęsna, G. Amato, Bus violence: An open pp. 4510–4520. doi:10.1109/CVPR.2018.00474. benchmark for video violence detection on pub[19] M. Bianculli, N. Falcionelli, P. Sernani, S. Tomassini, lic transport, Sensors 22 (2022). doi:10.3390/ P. Contardo, M. Lombardi, A. F. Dragoni, A dataset s22218345. for automatic violence detection in videos, Data in [29] V. E. D. S. Silva, T. B. Lacerda, P. B. Miranda, A. C. Brief 33 (2020) 106587. doi:10.1016/j.dib.2020. Nascimento, A. P. C. Furtado, Federated learning 106587. for physical violence detection in videos, in: 2022 [20] O. Russakovsky, J. Deng, H. Su, J. Krause, International Joint Conference on Neural Networks S. Satheesh, S. Ma, Z. Huang, A. Karpathy, (IJCNN), 2022, pp. 1–8. doi:10.1109/IJCNN55064. A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, Im- 2022.9892150. ageNet Large Scale Visual Recognition Challenge, [30] L. Yang, Z. Wu, J. Hong, J. Long, MCL: A contrastive International Journal of Computer Vision (IJCV) learning method for multimodal data fusion in vi115 (2015) 211–252. doi:10.1007/s11263-015- olence detection, IEEE Signal Processing Letters 0816-y. (2022) 1–5. doi:10.1109/LSP.2022.3227818. [21] S. Sudhakaran, O. Lanz, Learning to detect violent [31] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, Z. Yang, videos using convolutional long short-term mem- Not only look, but also listen: Learning multiory, in: 2017 14th IEEE International Conference modal violence detection under weak supervision, on Advanced Video and Signal Based Surveillance in: Computer Vision – ECCV 2020, Springer In(AVSS), 2017, pp. 1–6. doi:10.1109/AVSS.2017. ternational Publishing, Cham, 2020, pp. 322–339. 8078468. doi:10.1007/978-3-030-58577-8_20. [22] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno Gar- [32] Y. LeCun, Y. Bengio, Convolutional Networks for cía, R. Sukthankar, Violence detection in video us- Images, Speech, and Time Series, MIT Press, Caming computer vision techniques, in: P. Real, D. Diaz- bridge, MA, USA, 1998, p. 255–258. Pernil, H. Molina-Abril, A. Berciano, W. Kropatsch [33] S. Hochreiter, J. Schmidhuber, Long short-term (Eds.), Computer Analysis of Images and Pat- memory, Neural Computation 9 (1997) 1735–1780. terns, Springer Berlin Heidelberg, Berlin, Hei- doi:10.1162/neco.1997.9.8.1735. delberg, 2011, pp. 332–339. doi:10.1007/978-3- [34] A. Graves, N. Jaitly, A. Mohamed, Hybrid speech 642-23678-5_39. recognition with deep bidirectional LSTM, in: [23] J. Li, X. Jiang, T. Sun, K. Xu, Eficient violence de- 2013 IEEE Workshop on Automatic Speech Recognitection using 3D convolutional neural networks, tion and Understanding, 2013, pp. 273–278. doi:10. in: 2019 16th IEEE International Conference on 1109/ASRU.2013.6707742.

Advanced Video and Signal Based Surveillance [35] M. Schuster, K. K. Paliwal, Bidirectional recur(AVSS), 2019, pp. 1–8. doi:10.1109/AVSS.2019. rent neural networks, IEEE Transactions on Signal 8909883. Processing 45 (1997) 2673–2681. doi:10.1109/78. [24] S. Accattoli, P. Sernani, N. Falcionelli, D. N. Mekuria, 650093.

A. F. Dragoni, Violence detection in videos by com- [36] A. F. Dragoni, P. Sernani, D. Calvaresi, When rabining 3D convolutional neural networks and sup- tionality entered time and became real agent in a port vector machines, Applied Artificial Intelli- cyber-society, in: Proceedings of the 3rd Internagence 34 (2020) 329–344. doi:10.1080/08839514. tional Conference on Recent Trends and Applica2020.1723876. tions in Computer Science and Information Tech[25] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, nology, volume 2280 of CEUR Workshop ProceedS. W. Baik, Violence detection using spatiotempo- ings, 2018, pp. 167–171. URL: http://ceur-ws.org/ ral features with 3D convolutional neural network, Vol-2280/paper-24.pdf.

Sensors 19 (2019) 2472. doi:10.3390/s19112472. [26] D. Freire-Obregón, P. Barra, M. Castrillón-Santana,

M. D. Marsico, Inflated 3d convnet context analysis