<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining a mobile deep neural network and a recurrent layer for violence detection in videos</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Contardo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Selene Tomassini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Falcionelli</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aldo Franco Dragoni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Sernani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Law, University of Macerata</institution>
          ,
          <addr-line>Piaggia dell'Università 2, 62100 Macerata</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Gabinetto Interregionale di Polizia Scientifica per le Marche e l'Abruzzo</institution>
          ,
          <addr-line>Via Gervasoni 19, 60129 Ancona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Engineering Department, Università Politecnica delle Marche</institution>
          ,
          <addr-line>Via Brecce Bianche 12, 60131 Ancona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Several techniques for the automatic detection of violent scenes in videos and security footage appeared in recent years, for example with the goal of unburdening authorities from the need of analyzing hours of Closed-Circuit TeleVision (CCTV) clips. In this regard, Deep Learning-based techniques such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) emerged as efective for violence detection. Nevertheless, most of such techniques require significant computational and memory resources to run the automatic detection of violence. Thus, we propose the combination of an established CNN, MobileNetV2, designed for the use in mobile and embedded devices with a recurrent layer to extract the spatio-temporal features in the security videos. A lightweight model can run in embedded devices, in a edge computing fashion, for example to allow processing the videos near the camera recording them, to preserve privacy. Specifically, we exploit transfer learning, as we use a pre-trained version of MobileNetV2, and we propose two diferent models combining it with a Bidirectional Long Short-Term Memory (Bi-LSTM) and a Convolutional LSTM (ConvLSTM). The paper presents accuracy tests of the two models on the AIRTLab dataset and a comparison with more complex models developed in our previous work, in order to evaluate the drop of accuracy necessary to use a model compatible with limited resources. The network composed of MobileNetV2 and the ConvLSTM scores a 94.1% accuracy, against the 96.1% of a model based on a more complex 3D CNN.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Violence Detection</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Long Short-Term Memory</kwd>
        <kwd>Action Recognition</kwd>
        <kwd>MobileNetV2</kwd>
        <kwd>Law Enforcement</kwd>
        <kwd>Crime Investigation</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>time [9], many techniques to automatically detect
violence in videos emerged in the scientific literature. In this
Closed-Circuit TeleVision (CCTV) emerged as one of the regard, the first studies focused on the use of flow
descripmainstream crime prevention techniques [1], providing tors and hand-crafted features (see, for example, [10, 11]).
abundant and precise information for security and law However, Deep Learning-based techniques demonstrated
enforcement applications [2, 3]. In fact, Artificial Intel- better accuracy in violence detection, proposing to use
ligence (AI) methodologies, especially those based on Recurrent Neural Networks (RNNs) and Convolutional
Deep Learning, are demonstrating their efectiveness in Neural Networks (CNNs) for such task [12]. These
techapplications that take advantages of CCTV footage, such niques are capable of modeling the spatio-temporal
inas weapon detection [4, 5], face recognition [6, 7], and formation included in the CCTV footage, i.e., features
accident detection [8]. With the goal of unburdening that represent the motion information contained in a
seauthorities from the need of manually analyzing hours of quence of frames, in addition to the spatial information
CCTV videos and allowing them to take decisions in short contained in a single frame.</p>
      <p>RTA-CSIT 2023: 5th International Conference Recent Trends and Appli- In our previous work [13], we tested 13 diferent Deep
cations In Computer Science And Information Technology, April 26–27, Neural Networks (DNNs) for the task of violence
detec2023, Tirana, Albania tion in videos. Specifically, we compared a pre-trained
* Corresponding author. 3D CNN, C3D [14], combined with a Support Vector
Ma$ p.contardo@pm.univpm.it (P. Contardo); chine (SVM) classifier, with C3D combined with fully
consn..tfoamlciaosnsienllii@@psmta.f.uunniivvppmm..iitt ((SN..TFoamlcaiossnienlil)i;); nected layers, with a trained-from-scratch Convolutional
a.f.dragoni@staf.univpm.it (A. F. Dragoni); paolo.sernani@unimc.it Long Short-Term Memory (ConvLSTM) [15] plus fully
(P. Sernani) connected layers, with other ten networks based on time
0000-0002-5605-4783 (P. Contardo); 0000-0002-1087-7004 distributed pre-trained 2D CNNs combined with
Bidirec(S. Tomassini); 0000-0002-1312-6310 (N. Falcionelli); tional LSTM (Bi-LSTM) [16] (5 networks) and ConvLSTM
(0P0.0S0e-0rn00a2n-i3)013-3424 (A. F. Dragoni); 0000-0001-7614-7154 (5 networks). The C3D-based models got the best
accu© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License racy results in detecting violence on diferent datasets,
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) taking advantage of the 3D architecture capable of
modeling the spatio-temporal features of the videos as well as and 98.3% accuracy on the Hockey Fight dataset.
Accatof the transfer learning. Nevertheless, 3D CNNs require toli et al. [24] and Ullah et al. [25] also based their work
computational and storage resources which are usually on a 3D CNN, but, instead of training it from scratch,
not compatible with mobile and embedded devices [17] they applied transfer learning. Accattoli et al. added a
i.e., for edge computing. SVM to the CNN, getting 99.2% accuracy on the Hockey</p>
      <p>To tackle such issue, in this paper we propose two Fight and 98.5% accuracy on the Crowd Violence. Instead,
models based on the combination of a CNN specifically Ullah et al. implemented a end-to-end neural network by
designed for mobile devices i.e., MobileNetV2 [18], with adding fully connected layers to the 3D CNN, getting 98%
a recurrent layer to extract the temporal information and accuracy on the Crowd Violence and 96% accuracy on
fully connected layers for the classification of the videos the Hockey Fight. Sernani et al. [13] compared 13
diferinto violent or not. Specifically, in one model we used ent Deep Neural Networks on the Hockey Fight, Crowd
the Bi-LSTM as the recurrent layers, whereas in the other Violence and AIRTLab datasets. Specifically, they tested
we used the ConvLSTM. To understand its efectiveness a pre-trained 3D CNN (C3D) combined with a SVM, C3D
and evaluate any potential drop of accuracy, we test the with fully connected layers, a ConvLSTM combined with
proposed networks on the AIRTLab dataset [19], com- fully connected layers, 5 time-distributed pre-trained 2D
paring the results with those obtained in our previous CNNs combined with the Bi-LSTM and the same 2D
work. As such, this paper contributes to the state of the CNNs combined with a ConvLSTM. They got the best
art in violence detection with: results with the two C3D-based networks, with 96.1%
accuracy on the AIRTLab dataset, 97.86% accuracy on the
• The proposal of using MobileNetV2, pre-trained Hockey Fight, and 99.6% accuracy on the Crowd Violence.
on the Imagenet dataset [20], by time distribut- Freire-Obregón et al. [26] used an Inflated 3D ConvNet
ing it over the frames of the security videos to to extract the spatio-temporal features on the output of
be classified into violent or not, in combination two person trackers to perform context-free violence
dewith a recurrent module to model the temporal tection, i.e., the violence detection applied to the subjects
information in addition to the spatial information in the videos only, discarding any background or
conof the videos. text information. They combined such feature extractor
• The comparison of the proposed networks with with diferent classiefirs, getting the best results with the
our previous tested models [13] to evaluate the Linear Regression, with 99.45% accuracy on the Crowd
drop of accuracy necessary to use a network tai- Violence dataset, 99.43% on the Hockey Fight, and 97.54%
lored for mobile and embedded devices i.e., Mo- on the AIRTLab.</p>
      <p>bileNetV2. Whereas the aforementioned techniques demonstrated
The rest of this paper is organized as follows. Section 2 efective in the task of automatically detecting violence
provides a literature review about Deep Learning tech- in diferent video databases, they are all high
demandniques applied in the violence detection task. Section 3 ing for computational and storage resources, making
describes the proposed networks, providing the neces- them inadequate to run in mobile and embedded devices
sary background and detailing the structure of the used i.e., for edge computing. In our previous work [13], we
dataset. Section 4 discusses the experimental evaluation demonstrated that pre-trained 2D CNNs, time distributed
and presents the main findings. Finally, Section 5 draws on the frames of the security videos and combined with
the conclusions of this study. Bi-LSTM, achieve a lower accuracy than 3D CNNs. For
example, VGG16 [27] combined with a Bi-LSTM, achieved
94.92% accuracy on the AIRTLab dataset, 95.47% on the
2. Related Works Hockey Fight, and 97.39% on the Crowd Violence.
Nevertheless, such accuracy in detecting violence might be still
Several violence detection techniques based on Deep acceptable, to get a compromise to run violence
detecNeural Networks and, specifically, Recurrent Neural Net- tion at the edge to avoid data transmission and preserve
works (such as LSTM, Bi-LSTM, ConvLSTM) and CNNs privacy. Therefore, given such results and the need for
demonstrated their efectiveness [ 12]. For example, Sud- models capable of running violence detection at the edge,
hakaran and Lanz. [21] combined the spatial features diferently from the listed works, we propose to
“timecomputed by 2D CNNs on the frames of the videos, with a distribute” MobileNetV2 [18], a 2D CNN specifically
deConvLSTM, to extract the temporal features as well. They signed for mobile devices, on the frames of the security
got 94.5% accuracy on the Crowd Violence dataset [10] videos. We combine it with a recurrent layer and fully
and 97.1% on the Hockey Fight dataset [22]. Li et al. [23] connected layers to perform the violence classification
proposed a 3D CNN composed of 10 layers, adding dense and test two diferent versions, one based on the Bi-LSTM
and transitional layers after the convolutional layers. and one on the ConvLSTM.</p>
      <p>They got 97.2% accuracy on the Crowd Violence dataset, In addition to the search for the best accuracy, the
scientific literature concerning the use of Deep Learning
techniques for the automatic detection of violent scenes
includes other studies. For example, Ciampi et al. [28]
tested some of the aforementioned techniques, such as
3D CNNs and ConvLSTM, on a novel dataset, the Bus
Violence, to study the behavior of the violence
detection methodologies based on Deep Learning when the
background and context information significantly varies.</p>
      <p>Silva et al. [29] proposed the use of a federated learning
approach to distribute the learning process across
diferent devices, preserving privacy, with a server combining
the locally trained model into a global model. However,
instead on relying on videos or on video portions, the
applied 2D CNNs to single frames, achieving the best
results with MobileNet (99.4% accuracy on the AIRTLab
dataset). Yang et al. [30] proposed a multimodal approach
(Multimodal Contrastive Learning – MCL) to use both
video and audio for the automatic detection of violence.</p>
      <p>They got 84.03% average precision on the XD-Violence
dataset [31], against the 83.19% of using the video only
and the 76.07% of using the audio only.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Materials and Methods</title>
      <p>• The pointwise convolution layer applied to the
output of the depthwise convolution layer using
a 1x1 convolution.</p>
      <sec id="sec-2-1">
        <title>In addition, in MobileNetV2, linear bottlenecks and resid</title>
        <p>ual connections follow the convolution. Specifically,
linear bottlenecks use a linear activation function instead
of a non-linear activation function, reducing the
computational cost of the network.</p>
        <p>As a traditional CNN, MobileNetV2 models the
spatial information of images i.e., the frames of the videos.
Therefore, we added a recurrent layer to the output of
MobileNetV2 to model the temporal information
available in the videos, using a Bi-LSTM and a ConvLSTM.
In the original LSTM architecture [33], a hidden unit is
composed by a self-recurrent cell, called memory cell,
whose input/output is regulated by three multiplicative
gates i.e., the input gate, the output gate, and the
forget gate [34]. Specifically, the output ℎ at time point 
of a LSTM hidden unit is given by the following
equations [34]:
 =  ( + ℎℎ− 1 + − 1 + )
 =  (  + ℎ ℎ− 1 +  − 1 +  )
(1)
(2)
 = − 1 +  tanh( + ℎℎ− 1 + ) (3)</p>
      </sec>
      <sec id="sec-2-2">
        <title>As explained in Sections 1 and 2, many studies about</title>
        <p>the use of Deep Neural Networks for the violence de-  =  ( + ℎℎ− 1 +  + ) (4)
tection in videos proposed complex architectures, such ℎ =  tanh() (5)
as 3D CNNs, requiring computational and memory
resources that are usually not compatible with mobile and where , , , and  are the activation vectors of the
embedded devices. To this end, we propose the use of input gate, forget gate, output gate, and memory cell at
MobileNetV2, time-distributed over 16-frames chunks time point ,  is the sigmoid function,  denotes the bias
of the videos, combined with a recurrent layer to model of each gate/cell, and  are diagonal weight matrices.
the temporal information of the sequence of frames, in In the original formulation, a LSTM processes input
addition to the spatial information. In the following, we data in ascending temporal order. However, the
recogprovide some background about MobileNetV2, the LSTM nition of a pattern might be more efective with the
architecture, and the ConvLSTM architecture (3.1). Then, use of future context as well. To this end, Bidirectional
we present the proposed neural networks (3.2) and de- RNNs [35] and, specifically, Bidirectional LSTMs [ 16]
scribe the dataset used for the tests (3.3). have been proposed. The basic idea of such models is to
present the training sequences both forwards and
back3.1. Background: MobileNetV2, LSTM, wards, using two separate recurrent nets, which are
connected to the same output layer. As such, we based one of
and ConvLSTM our models on the Bi-LSTM, as the videos are processed
In the original definition of LeCun and Bengio [ 32], a once recorded, taking advantage of both previous and
unit of a layer in a CNN receives inputs from a set of future context.
units in the local receptive field, via a convolution oper- For the ConvLSTM, we use the formulation of Shi et
ation with kernels composed of shared weights. In Mo- al. [15], who extended the LSTM architecture by adding
bileNetV2 [18] this concept is extended to cope with the convolutional structures to state transition. As Shi et al.
limited computational resources of mobile and embedded explained, the LSTM architecture is adequate to extract
devices. Instead of the traditional convolution operation temporal features, but includes too much redundancy for
of CNNs, MobileNetV2 decomposes convolutional layers spatial features. In this regard, they proposed to add
coninto two separate layers: volutional structures in the transitions between the input
gate and the memory cell, and in the self-recurrency of
• The depthwise convolution layer that applies a the memory cell, regulated by the forget gate. Therefore,
separate filter to each input channel. in a ConvLSTM, the output of a hidden unit is regulated
 = − 1 +  tanh( *  + ℎ * ℎ− 1 + )</p>
        <p>(8)
 =  ( *  + ℎ * ℎ− 1 +  + )
ℎ =  tanh()
(9)
(10)
where the activations of input gate, forget gate, output
gate, and memory cell (, , , and ), as well as input
and output (, ℎ) are 3D tensors. As such, we used the
ConvLSTM in the second of our proposed models.</p>
        <sec id="sec-2-2-1">
          <title>3.2. Proposed Classification Architecture</title>
          <p>Layer Architecture Output Shape Params #
Time Distr. MobileNetV2 - (16, 7, 7, 1280) 2257984
Time Distr. Flatten - (16, 62720) 0
Bi-LSTM 128 units (256) 64357376
Dropout 0.5 rate (256) 0
Dense 128 units, ReLU (128) 32896
Dropout 0.5 rate (128) 0
Dense 1 units, Sigmoid (1) 129
Violent
Non-violent
16-frames chunk
Resized 16-frames chunk</p>
          <p>Feature Extraction Classifier
(Time Distributed 2D CNN + Bi-LSTM/ConvLSTM) (Ful y Connected Layers)</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>As depicted in the schematic in Figure 1, to classify the</title>
        <p>videos into violent or not, we propose two Deep Learning- Layer Architecture Output Shape Params #
based classifiers based on MobileNetV2, pre-trained on Time Distr. MobileNetV2 - (16, 7, 7, 1280) 2257984
the Imagenet dataset [20], followed by a recurrent layer ConvLSTM 64 3x3 filters, tanh (5, 5, 64) 3096832
and fully connected layer. The weights of MobileNetV2 FDlraotpteonut -0.5 rate ((11660000)) 00
are freezed on the Imagenet training. Instead, the Bi- Dense 256 units, ReLu (256) 409856
LSTM layer or the ConvLSTM layer and the fully con- DDreonpsoeut 01.5unraitt,eSigmoid ((21)56) 2507
nected layers are trained from scratch on the AIRTLab
dataset, as explained in Section 4 (Subsection 4.1). Given
that in our previous work we run the classification over Table 2 lists the layers composing the second proposed
16-frames chunks of the videos, in this work we use the model. A ConvLSTM composed of 64 3 × 3 filters with
same videos split into 16 frames chunks, in order to al- the tanh activation function follows the time-distributed
low a fair comparison between the classifiers. The video MobileNetV2. The network is completed by a 0.5 dropout,
of the AIRTLab dataset are resized at 224 x 224 pixels, a fully connected layer with 256 ReLU neurons, another
as this is the input shape in the original MobileNetV2 0.5 dropout and a fully connected sigmoid neuron to
implementation. perform the final classification into violent or not.</p>
        <p>Table 1 includes the layers composing the first pro- The Bi-LSTM-based model has a total of 66,648,385
paposed model. MobileNetV2, with its 2,257,984 freezed rameters. The weights of MobileNetV2 are freezed, which
weights, is time distributed over the 16 frames used as the means that the total number of trainable parameters is
input. The Bi-LSTM is composed of 128 hidden units, fol- 64,390,401 (corresponding to the 128 hidden units of the
lowed by a 0.5 dropout to limit the overfitting, a fully con- Bi-LSTM, the 128 ReLU neurons of the first fully
connected layer with 128 ReLU neurons, another 0.5 dropout nected layer, and the sigmoid neuron of the last layer).
Inand a fully connected sigmoid neuron for the final classi- stead, in the ConvLSTM-based model there are 5,764,929
ifcation. parameters (3,506,945 are trainable, corresponding to the
64 filters of the ConvLSTM layer, the 256 ReLU neurons
of the first fully connected layer, and the final sigmoid
neuron for the classification). Therefore, the model based
on the ConvLSTM requires less memory than the model
based on the Bi-LSTM, being more adequate for the use
in mobile and embedded devices.</p>
        <sec id="sec-2-3-1">
          <title>3.3. Used Dataset</title>
          <p>layers, together with the fully connected layers, needed
To test the performance of the proposed classifiers and to be trained from scratch. Therefore, to run the training
compare them to our previous work, we run accuracy and test on the AIRTLab dataset, we applied a stratified
tests on the AIRTLab dataset. It contains 350 videos (MP4 shufle split cross-validation scheme. To this end, we
ifles with H.264 codec, mean length of 5.63 seconds). The repeated a randomized 80-20 split 5 times, using the 80%
frame rate is 30 fps and the frame resolution is 1920 of the data as the training set, and the 20% as the test set,
x 1080 pixels. The dataset includes 230 violent videos preserving the percentage of samples from each class, in
and 120 non-violent videos. The 230 violent videos rep- each split. The data splits were the same for both the
resent 115 violent actions recorded from two diferent proposed models and for the models of our previous work,
cameras placed into two diferent spots. Similarly, the to implement a fair comparison. Given that the inputs
120 non-violent videos represent 60 non-violent actions, for the models are sequences composed of 16 frames and
recorded from two diferent cameras placed into two dif- the videos in the dataset include a total of 3537 of such
ferent spots. All the videos were taken inside the same sequences, 2829 samples (i.e., 16-frames chunks) were
room. One camera was placed in the top left corner in used for training, and 708 for testing, in each split. The
front of the room door. The second camera was in the 12.5% of the training data i.e., the 10% of the entire dataset,
top right corner on the door side. was used as validation data.</p>
          <p>A group of non-professional actors played the violent Both the proposed models used the Binary
Crossand non-violent actions. The number of actors varied Entropy loss function, minimized with the Adam
opfrom 2 to 4 per video. In the violent videos, the actors timizer. We early stopped the training after 5 epochs
simulated actions frequent in scufles, such as punches, without any improvement on the minimum validation
kicks, beating with canes, slapping, gun shots, and stab- loss, restoring the weights corresponding to the best
bing. In the non-violent videos, the actors simulated epoch. To this end, Table 3 lists the number of training
actions which can result in false positives due to the sim- epochs in each split of the stratified shufle split cross
ilarity with violent actions (for example for the presence validation scheme, for each model. The average number
of fast movements). Specifically, the non-violent videos of training epochs was 22.4 (± 5.68) for the model using
contains actions such as exulting, hugging, gesticulating, the Bi-LSTM layer, and 17.8 (± 4.21) for the model based
and clapping and giving high fives. on the ConvLSTM. The batch size was 8 for both neural
networks.</p>
          <p>The tests ran on Google Colab Pro with the GPU
run4. Results and Discussion time (the GPU used for the tests was a Nvidia A100 SXM4
with 40 GB of RAM) and extended RAM (83.5 GB), using
Keras 2.11.0, TensorFlow 2.11.0, and Scikit-learn 1.2.1.</p>
          <p>Labeling as negative the 16-frames chunks of the
nonviolent videos and as positive the chunks of the violent
videos, we computed the following metrics over the test
set in each split of the stratified shufle split cross
validation scheme:
We tested the two proposed models with the same
protocol used in our previous work [13] i.e., by measuring
the classification results over the AIRTLab dataset. The
objective is to compare the accuracy performance of the
classifiers based on a 2D CNN designed for mobile and
embedded devices with those of classifiers requiring more
resources. Therefore, in the following subsections, we
describe the experimental protocol (4.1), discuss the
results (4.2), and present the limitations of our evaluation
(4.3).</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>4.1. Experimental Protocol and</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>Evaluation Metrics</title>
          <p>Whereas MobileNetV2 was pre-trained on Imagenet and
its weights were freezed, the Bi-LSTM and ConvLSTM
• Sensitivity (True Positive Rate – TPR) i.e., the
portion of positives that are correctly identified
(over all the available positives).
• Specificity (True Negative Rate – TNR) i.e., the
portion of negatives that are correctly identified
(over all the available negatives).
• Accuracy i.e., the portion of samples that are
cor</p>
          <p>rectly identified (over all the available samples).
• F1 score i.e., the armonic mean of precision (the
ratio between the positives correctly identified
and all the identified positives) and sensitivity.</p>
          <p>in Figure 2. In fact, the model using the Bi-LSTM as the
recurrent layer scores an average AUC equal to 94.38%
These metrics can be formulated in terms of true positives (± 2.98%), whereas the model using the ConvLSTM gets
(TP), true negatives (TN), false positives (FP), and false 98.26% (± 0.46%). This behavior might be due to the
negatives (FN) according to the following equations: diferent number of trainable parameters of the two
models. In the Bi-LSTM-based model there are 64,390,401
  trainable parameters. Instead, in the ConvLSTM-based
 =   +   (11) model, the number of trainable parameters is 3,506,945.</p>
          <p>As such, the Bi-LSTM-based model might be oversized
  =   +   (12) for the violence detection task on the AIRTLab dataset,
  +   struggling to converge to an acceptable classification
per = (13) formance. Therefore, the ConvLSTM-based model, that
  +   +   +   is the lightest in terms of required resources between the
1  =   (14) two proposed in this work, exhibits a better performance
  + 12 (  +   ) in terms of classification accuracy and generalization
Moreover, in each split, we computed the Receiver Oper- capability.
ating Characteristic (ROC) curve and the Area Under the Table 6 compares the performance of the two models
Curve (AUC), showing the TPR against the False Positive proposed in this paper with those based on C3D tested
Rate (FPR) when the classification threshold varies, to in our previous work. Even if lighter in terms of
reunderstand the diagnostic capability of each model. quired computational resources, the model based on
MobileNetV2 and the ConvLSTM gets an average AUC of
98%, against the 99% of the C3D-based models. The
av4.2. Results erage accuracy and 1 score of the ConvLSTM-based
Table 4 lists the metrics obtained by the model composed model are 94.1% ( 0.91%) and 95.62% ( 0.67%) being
of MobileNetV2 and the Bi-LSTM over the five splits of only around 2% lower than the C3D + SVM model of
the cross-validation performed on the AIRTLab dataset. our previous work. Therefore, limited resources as those
The metrics significantly vary across the splits, showing of mobile or embedded devices might justify the use of
a poor generalization capability. For example, in split 1, the MobileNetV2 combined with the ConvLSTM, as the
all the 708 samples of the test set are labeled as violent, decrease in the accuracy metrics is limited.
causing 232 false positives. As such, the sensitivity is
100% whereas the specificity is 0%. The split where most 4.3. Limitations
of the negatives are correctly identified is the number
2: here, 204 negatives out of 232 are correctly classified The results of the research described in this paper are
(specificity 87.93%). In the same split, 433 violent chunks promising, but include some limitations. In fact, we
out of 476 are correctly classified. As such, the accuracy focused on the accuracy of two models based on
Mois 92.8%. bileNetV2, which is designed for mobile or embedded</p>
          <p>Instead, the model based on MobileNetV2 and the Con- devices. Nevertheless, we ran our comparative tests in
vLSTM exhibits a better generalization capability than the cloud, using a GPU. Whereas the decrease in
accuthe previous one, as showed in Table 5. The sensitivity is racy is limited and justifies the use of the best between
greater than 94% across all the splits, and the lowest speci- the proposed models, tests on real mobile or embedded
ifcity is in split 3 (85.34%). The best split is the number 5, devices i.e., at the edge, are needed to get more general
where the 1 score is 96.42%. conclusions. Morevoer, our tests are based on a dataset</p>
          <p>The diference in the generalization capability of the of videos where the violence is simulated by actors. Tests
two proposed models is highlighted by the ROC curves on videos from real surveillance cameras are needed to
confirm the accuracy results. fashion. In fact, an intelligent answer preserves its
impor</p>
          <p>In addition, we collected the metrics on 16-frames tance only if given in time, as remarked in [36]. Hence,
chunks taken from the short videos of the AIRTLab in this paper, we proposed two Deep Neural Networks
dataset (the average length is 5.6 seconds), to make this for the classification of videos into violent or not. Both
work comparable with our previous research. Whereas networks are based on MobileNetV2, a CNN specifically
most of the related literature performs tests on short designed for mobile and embedded devices. Such CNN
videos, the accuracy on full length, real videos should is responsible for the extraction of the spatial features
be evaluated. Indeed, using the short chunks of frames in the videos. We combined MobileNetV2 with a
recurtaken from long videos as in our study might result in too rent layer for the extraction of the temporal features as
many false positives. Thus, results on the chunks should well. One of the two proposed models uses a Bi-LSTM
be merged together with a proper strategy to maximize layer as the recurrent module. Instead, the other uses a
the accuracy on full length videos. To this end, a simple ConvLSTM.
solution is labeling a part of a long video as violent only We ran comparative tests on the AIRTLab dataset. The
when a fixed number of consecutive 16-frames chunks model using the ConvLSTM, the lightest in terms of
reare labeled as violent. quired computational and memory resources between the
two proposed in this paper, got the best accuracy, with
an average AUC equal to 98.26% (± 0.46%). Compared to
5. Conclusions the models of our previous work, based on a 3D CNN, the
decrease of performance in terms of AUC is around 1%,
To be used in real applications, Artificial Intelligence and and 2% in terms of classification accuracy over the splits
Deep Learning-based techniques need to take into ac- of the AIRTLab dataset. Such results encourage the use of
count real time performances and be capable of running mobile models for embedded devices. For example, this
in mobile and embedded devices, in a edge computing</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>The presented research has been part of the Memoran</title>
        <p>dum of Understanding between the Università
Politecnica delle Marche, Centro “CARMELO” and the
Ministero dell’Interno, Dipartimento di Pubblica Sicurezza,
Direzione Centrale Anticrimine della Polizia di Stato.
might be useful to process data directly near the camera
that is recording the security video and, thus, preserve
the privacy while addressing public security.</p>
        <p>Future works will address the identified limitations.</p>
        <p>In particular, tests on real mobile or embedded devices
need to be performed to get more conclusive and general
results.
[17] W. Niu, M. Sun, Z. Li, J.-A. Chen, J. Guan, X. Shen, for violence detection, Machine Vision and
ApplicaY. Wang, S. Liu, X. Lin, B. Ren, RT3D: Achieving tions 33 (2022) 1–13.
doi:10.1007/s00138-021real-time execution of 3D convolutional neural net- 01264-9.
works on mobile devices, Proceedings of the AAAI [27] K. Simonyan, A. Zisserman, Very deep
convoluConference on Artificial Intelligence 35 (2021) 9179– tional networks for large-scale image recognition,
9187. doi:10.1609/aaai.v35i10.17108. CoRR abs/1409.1556 (2015). URL: https://arxiv.org/
[18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, abs/1409.1556.</p>
        <p>L.-C. Chen, Mobilenetv2: Inverted residuals and [28] L. Ciampi, P. Foszner, N. Messina, M. Staniszewski,
linear bottlenecks, in: 2018 IEEE/CVF Conference C. Gennaro, F. Falchi, G. Serao, M. Cogiel, D. Golba,
on Computer Vision and Pattern Recognition, 2018, A. Szczęsna, G. Amato, Bus violence: An open
pp. 4510–4520. doi:10.1109/CVPR.2018.00474. benchmark for video violence detection on
pub[19] M. Bianculli, N. Falcionelli, P. Sernani, S. Tomassini, lic transport, Sensors 22 (2022). doi:10.3390/
P. Contardo, M. Lombardi, A. F. Dragoni, A dataset s22218345.
for automatic violence detection in videos, Data in [29] V. E. D. S. Silva, T. B. Lacerda, P. B. Miranda, A. C.
Brief 33 (2020) 106587. doi:10.1016/j.dib.2020. Nascimento, A. P. C. Furtado, Federated learning
106587. for physical violence detection in videos, in: 2022
[20] O. Russakovsky, J. Deng, H. Su, J. Krause, International Joint Conference on Neural Networks
S. Satheesh, S. Ma, Z. Huang, A. Karpathy, (IJCNN), 2022, pp. 1–8. doi:10.1109/IJCNN55064.
A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, Im- 2022.9892150.
ageNet Large Scale Visual Recognition Challenge, [30] L. Yang, Z. Wu, J. Hong, J. Long, MCL: A contrastive
International Journal of Computer Vision (IJCV) learning method for multimodal data fusion in
vi115 (2015) 211–252. doi:10.1007/s11263-015- olence detection, IEEE Signal Processing Letters
0816-y. (2022) 1–5. doi:10.1109/LSP.2022.3227818.
[21] S. Sudhakaran, O. Lanz, Learning to detect violent [31] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, Z. Yang,
videos using convolutional long short-term mem- Not only look, but also listen: Learning
multiory, in: 2017 14th IEEE International Conference modal violence detection under weak supervision,
on Advanced Video and Signal Based Surveillance in: Computer Vision – ECCV 2020, Springer
In(AVSS), 2017, pp. 1–6. doi:10.1109/AVSS.2017. ternational Publishing, Cham, 2020, pp. 322–339.
8078468. doi:10.1007/978-3-030-58577-8_20.
[22] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno Gar- [32] Y. LeCun, Y. Bengio, Convolutional Networks for
cía, R. Sukthankar, Violence detection in video us- Images, Speech, and Time Series, MIT Press,
Caming computer vision techniques, in: P. Real, D. Diaz- bridge, MA, USA, 1998, p. 255–258.
Pernil, H. Molina-Abril, A. Berciano, W. Kropatsch [33] S. Hochreiter, J. Schmidhuber, Long short-term
(Eds.), Computer Analysis of Images and Pat- memory, Neural Computation 9 (1997) 1735–1780.
terns, Springer Berlin Heidelberg, Berlin, Hei- doi:10.1162/neco.1997.9.8.1735.
delberg, 2011, pp. 332–339. doi:10.1007/978-3- [34] A. Graves, N. Jaitly, A. Mohamed, Hybrid speech
642-23678-5_39. recognition with deep bidirectional LSTM, in:
[23] J. Li, X. Jiang, T. Sun, K. Xu, Eficient violence de- 2013 IEEE Workshop on Automatic Speech
Recognitection using 3D convolutional neural networks, tion and Understanding, 2013, pp. 273–278. doi:10.
in: 2019 16th IEEE International Conference on 1109/ASRU.2013.6707742.</p>
        <p>Advanced Video and Signal Based Surveillance [35] M. Schuster, K. K. Paliwal, Bidirectional
recur(AVSS), 2019, pp. 1–8. doi:10.1109/AVSS.2019. rent neural networks, IEEE Transactions on Signal
8909883. Processing 45 (1997) 2673–2681. doi:10.1109/78.
[24] S. Accattoli, P. Sernani, N. Falcionelli, D. N. Mekuria, 650093.</p>
        <p>A. F. Dragoni, Violence detection in videos by com- [36] A. F. Dragoni, P. Sernani, D. Calvaresi, When
rabining 3D convolutional neural networks and sup- tionality entered time and became real agent in a
port vector machines, Applied Artificial Intelli- cyber-society, in: Proceedings of the 3rd
Internagence 34 (2020) 329–344. doi:10.1080/08839514. tional Conference on Recent Trends and
Applica2020.1723876. tions in Computer Science and Information
Tech[25] F. U. M. Ullah, A. Ullah, K. Muhammad, I. U. Haq, nology, volume 2280 of CEUR Workshop
ProceedS. W. Baik, Violence detection using spatiotempo- ings, 2018, pp. 167–171. URL: http://ceur-ws.org/
ral features with 3D convolutional neural network, Vol-2280/paper-24.pdf.</p>
        <p>Sensors 19 (2019) 2472. doi:10.3390/s19112472.
[26] D. Freire-Obregón, P. Barra, M. Castrillón-Santana,</p>
        <p>M. D. Marsico, Inflated 3d convnet context analysis</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>