<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Conference on Digital Technologies in Education, Science and
Industry, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Violence Recognition in Surveillance Videos with Dense Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Askarbek Assubayev</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aizhan Altaibek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marat Nurtas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aigerim Altayeva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Al-Farabi Kazakh National University</institution>
          ,
          <addr-line>al-Farabi Avenue 71, Almaty, 050040</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Ionosphere</institution>
          ,
          <addr-line>Gardening community IONOSPHERE 117, Almaty, 050020</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>Manas St. 34/1, Almaty, 050000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>0</volume>
      <fpage>6</fpage>
      <lpage>07</lpage>
      <abstract>
        <p>Security camera-based surveillance systems are becoming the main tool in public settings, utilizing computer vision and machine learning techniques for diverse applications in safety monitoring. The proposed model is a spatial feature-extracting DenseNet-121 followed by convolutional LSTM for temporal feature extraction and classification. In this paper, we propose a novel architecture for violence detection from various video data of surveillance cameras. The proposed model is called DenseNet-LSTM it demonstrates efficient computational performance while achieving good results. The model is evaluated on a diverse set of violent action datasets such as Hockey, Movies, Movie Datasets, Violent Flow, and Real-Life Violence Situations (RLVS). DenseNet-LSTM promises near state-of-the-art performance and accuracy on most of them while requiring low computational time.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Violence recognition</kwd>
        <kwd>surveillance</kwd>
        <kwd>deep learning</kwd>
        <kwd>computer vision</kwd>
        <kwd>DenseNet</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Video surveillance, commonly referred to as CCTV (Closed Circuit Television) [3], has become an
integral tool for enhancing security and monitoring in various environments. In urban settings
such as cities and town centres, video surveillance is deployed to deter criminal activities,
including theft, vandalism, and public disturbances. The presence of visible cameras acts as a
powerful deterrent, discouraging potential wrongdoers from engaging in illicit activities [2]. In
commercial establishments like malls, banks, and retail stores, video surveillance serves both as
a proactive security measure and a monitoring tool. It helps in safeguarding assets, ensuring the
safety of employees and customers, and investigating any incidents that occur within the
premises. Video surveillance is a versatile and indispensable technology, playing a vital role in
maintaining security and offering valuable insights for effective monitoring across a wide range
of environments. The utilization of data captured by video surveillance-based systems can be
used as a powerful tool to detect and prevent violent events. Nonetheless, the majority of
individuals perceive video surveillance as an intrusion into their privacy [1]. There's
apprehension regarding the potential uses or abuses of video recordings, particularly with
advancements in contemporary image-processing technologies.</p>
      <p>This research work proposes a low-weigh, efficient deep-learning method to classify violent
and non-violent actions in security camera footage. The primary objective is to conduct an
indepth exploration and analysis of the DenseNet [4] network for spatial feature extraction. This
entails a comprehensive examination of the network's capabilities and characteristics in the
context of spatial feature extraction. Additionally, the research aims to identify and propose
optimal solutions to seamlessly integrate the spatial feature extraction model with a temporal
feature extraction model, achieving an efficient and effective fusion of spatial and temporal
information for further violent event recognition.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed method</title>
      <p>The main goal of the proposed method is to conduct an analysis of DenseNet architecture for
video classification that is comparable with the efficiency of state-of-the-art models while
maintaining low classification time per frame.</p>
      <p>The algorithm under consideration primarily consists of three main stages:
1. Spatial feature extractor
2. Temporal feature extractor
3. Classifier</p>
      <p>Fig. 1 shows the architecture of the proposed method. First pre-processing steps are applied
to the input video frames. The next two consecutive stages of feature extraction are applied; a
DenseNet-121 model stage which is responsible for spatial feature extraction for each frame, and
a ConvLSTM [5] stage which works as a temporal features extractor. Finally, the extracted
features are fed to Dense layers for classification.</p>
      <p>DenseNet is a feed-forward Convolutional Neural Network (CNN) [6] architecture that
connects each layer to every other layer. DenseNet design is founded on a straightforward and
basic principle: by concatenating the feature maps of all previous layers, a dense block [5] allows
each layer to access the features of all preceding levels. In classic CNNs, each layer only has access
to the characteristics of the layer immediately before it. The architecture is arranged of transition
layers and dense blocks. Each block is connected with a convolutional layer inside a dense block
that is connected to every other layer within the block. The transition layers minimize the size of
the feature maps across dense blocks letting the network grow effectively.</p>
      <p>Figure 2 in the paper Densely Connected Convolutional Networks [4] shows a comparison
between the traditional convolutional network and the DenseNet architecture. In the DenseNet
architecture, each layer is connected to all subsequent layers. This dense connectivity pattern
allows for a better flow of information and gradients throughout the network, making it easier to
train. Each layer has direct access to the gradients from the loss function and the original input
signal, leading to implicit deep supervision, which helps in the training of deeper network
architectures. Further, dense connections have a regularizing effect, which reduces overfitting
tasks with smaller training set sizes.</p>
      <p>The benefits of using DenseNet as a feature extractor for computer vision tasks are that
DenseNets naturally integrate the properties of identity mappings, deep supervision, and
diversified depth. They allow feature reuse throughout the networks and can consequently learn
more compact and, according to the experiments conducted by the authors, more accurate
models. Because of their compact internal representations and reduced feature redundancy,
DenseNets are a good feature extractor for violence recognition.</p>
      <p>ConvLSTM [5] is a type of neural network architecture that extends the idea of LSTM (Long
Short-Term Memory) to have convolutional structures in both the input-to-state and
state-tostate transitions. This allows the network to capture spatiotemporal correlations in the input
data, making it particularly useful for spatiotemporal sequence forecasting problems such as
precipitation nowcasting. The ConvLSTM layer preserves the advantages of LSTM but is also
suitable for spatiotemporal data due to its inherent convolutional structure. The network
captures spatiotemporal correlations by extending the idea of FC-LSTM to have convolutional
structures in both the input-to-state and state-to-state transitions. By stacking multiple
ConvLSTM layers and forming an encoding-forecasting structure, the network can learn complex
spatiotemporal patterns in the dataset through its nonlinear and convolutional structure. This
allows the network to handle the strong spatial correlation in the radar maps and discover
sudden changes in the encoding network, which is difficult to achieve with other methods like
optical flow and semi-Lagrangian advection-based methods.</p>
      <p>According to sequence-to-sequence learning [7] framework that can be used for a wide range
of spatiotemporal sequence forecasting problems, including video classification and action
recognition in computer vision. The ConvLSTM network can be used as a building block in this
framework to extract temporal features from the input sequence, which can then be fed into other
layers of the network for further processing. By stacking multiple ConvLSTM layers and forming
an encoding-forecasting structure, the network can learn complex spatiotemporal patterns in the
dataset and generate accurate predictions.</p>
      <sec id="sec-2-1">
        <title>2.1. Data Preprocessing</title>
        <p>The videos from the dataset are input as shapes of (V, F,128,128,3): V is the number of videos,
F number of video frames, and (128,128) height. These input videos are processed in the next
pipeline; Every 3rd frame is skipped to a lower amount of duplicate frames; Each frame is resized
to (128,128,3) shape; Data augmentation consists of converting frames to grayscale and vertical
flipping. Thereafter Dataset is split into 80% for training sets and 20% for validation sets.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.1.1. Feature extraction</title>
        <p>In this stage two different feature types are extracted consequently; the first spatial features
are extracted by the pre-trained Densenet-121 on the ImageNet dataset. The end part of it is a
global average pooling (GAP) layer applied. Global average pooling reduces the spatial
dimensions of the feature maps to a 1x1xN tensor, where N is the number of channels. This
operation summarizes the spatial features and creates a compact representation. The total
trainable parameters from this step are 7037504 parameters with shapes (V, 4, 4, 1024) where
1024 is the number of features extracted from each frame. Due to violent actions performed by
humans occurring throughout the time in the sequence of video frames, a second type of feature
set is required.</p>
        <p>Thus, a fresh set of frame features is forwarded to the second stage to extract temporal
features. In this stage, ConvLSTM is utilized to capture sequential information along the video
frames. The LSTM setup uses a 64-unit ConvLSTM2D layer having dimensionality of the output.
A Flatten layer is used after ConvLSTM2D.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.1.2. Classification</title>
        <p>In this step Dense layers are used for classification where the first layer has N neurons, the
second layer has N neurons and the last layer contains only 2 for violence and nonviolence classes.</p>
        <p>The first and second layers used the Rectified Linear Unit (RELU) activation function as in (1),
while the last layer used the soft-max activation function as in (2).</p>
        <p>( ) =</p>
        <p>(0,  )
 (  ) =</p>
        <p>(1)
(2)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and results</title>
      <p>The proposed model is evaluated against three of the state-of-the-art benchmark datasets
including hockey fight [n], violent flow [n], and movie [n] datasets. Benchmark of Real-Life
Violence Situations (RLVS) is most diverse in terms of environment, and actions, thus it is used
for both testing and fine-tuning of the proposed model.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset preparation</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.1. Hockey dataset</title>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.2. Movie dataset</title>
        <p>The Hockey Dataset [8] was assembled from 1000 National Hockey League games. While 500
of them are non-violent 500 of them contain hockey players fighting. All frames from videos were
extracted without pre-processing due to all of them having consistent backgrounds, during
experiments 25 frames were extracted skipping every 3rd frame to reduce duplicate frames.</p>
        <p>The Movie Dataset [8] consists of 200 videos divided into 100 violence and 100 non-violence
videos. The violent videos were collected from movie scenes, while the non-violence videos were
collected from other actions. Unlike the hockey dataset, the movie dataset exhibits diverse
backgrounds. During the experiments conducted, 15 frames were selected from each video.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1.3. Violent flow dataset</title>
        <p>Violent flow [9] comprises 246 videos depicting crowd scenes featuring fights that occurred
between individuals. These videos were sourced from violent incidents during football matches.
In the performed experiments, 20 frames were utilized as input for our proposed model.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.1.4. Real life violent situations dataset</title>
        <p>Real Life Violence Situations [10] benchmark consists of 2000 videos divided into 1000
violence clips and 1000 nonviolence clips. The violent clips involve fights in many different
environments such as streets, prisons, and schools. The non-violence videos contain other human
actions such as playing tennis, football, basketball, swimming, and eating. In creating the RLVS
dataset, some videos were captured manually, while others were sourced from YouTube to
prevent redundancy in persons and environment in the captured videos. Long videos are cut into
short-length videos with a maximum duration of 7 seconds, minimum duration of 3 seconds, and
average duration of 5 seconds. These segments are considered to have high resolution (480p –
720p) and to include a variety of people in race, age, and gender with different environments.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.2. Evaluation parameters</title>
        <p>The Main metric that was used to determine the best type of LSTM network that works in
combination with DenseNet-121 is accuracy and loss. We also will be using recall precision and
f-1 score metrics to evaluate the quality of the model’s predictions in future work. The proposed
method was tested on Hockey, Movie, Violent Flow, Real Life Violent Situations datasets. All
models were trained on each dataset using Adam as an optimizer. The learning rate was set to
0.0001. The Networks were trained for 20 epochs.</p>
        <p>Currently, the performed experiments were aimed at determining the type of LSTM network
that works best with DenseNet-121.Table 1 shows the accuracy and loss with datasets.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.3. Results</title>
        <p>associated with the overall or final model accuracy.
where TruePositive - is a correctly predicted label, and True Negative - is incorrectly labeled True
when it's originally False. FalsePositive - incorrectly labelled False when it was originally True.
And FalseNegative - fully incorrectly predicted label</p>
      </sec>
      <sec id="sec-3-8">
        <title>Loss</title>
        <p>A loss function sometimes called a cost function, considers the probabilities or the level of
uncertainty in a prediction by measuring how much the prediction deviates from the actual value.
This provides a more detailed assessment of the model's performance. Unlike accuracy, which is
expressed as a percentage, loss is the cumulative measure of errors made for individual samples
in training or validation datasets. Loss is typically employed in the training phase to determine
the optimal parameter values for the model, such as the weights in a neural network. In the
training process, the objective is to minimize this value.</p>
        <p>1

 =1

= −
∑ ⬚   ∙   2( 

) + (1 −   ) ∙ (1 −  

)
where N is its output size, y_i - sample, y_pred_i - predicted value from the model, minus means
for us to minimize function.</p>
        <p>A widely used evaluation metric for characterizing the performance of a classification model
confusion matrix Figure 4. is typically presented as a table. This table allows for a comparison
between predicted and actual values to assess the model's performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and future work</title>
      <p>We proposed the DenseNet-LSTM network specifically for violence detection and classification.
Because of their reduced feature redundancy and compact internal representations, DenseNets
are viable feature extractors for convolutional features. Feature transfer of DenseNets allows
them to be used in computer vision tasks. Due to this combining them together with ConvLSTM
allowed us to achieve high accuracy on various datasets. For our future work, we will introduce
other benchmarks in order to test and analyze more diverse settings and complex actions
between humans and potentially dangerous objects.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgements</title>
    </sec>
    <sec id="sec-6">
      <title>6. References</title>
      <p>This Word template was created by Aleksandr Ometov, TAU, Finland. The template is made
available under a Creative Commons License Attribution-ShareAlike 4.0 International (CC BY-SA
4.0).
[1] Chiba, N., &amp; Hino, K. (2017). CCTV installation in public areas balancing privacy and security.</p>
      <p>Reports of the City Planning Institute of Japan, 16(2), 124–128.
https://doi.org/10.11361/reportscpij.16.2_124.
[2] Hino, K. (2022). Changes in public attitudes toward CCTV installations in residential areas
between 2008 and 2019. Cities, 128, 103810. https://doi.org/10.1016/j.cities.2022.103810
[3] Huang, G., Liu, Z., Van Der Maaten, L., &amp; Weinberger, K. Q. (2016). Densely connected
convolutional networks. arXiv (Cornell University).
https://doi.org/10.48550/arxiv.1608.06993.
[4] Piza, E. L., Welsh, B. C., Farrington, D. P., &amp; Thomas, A. L. (2019). CCTV surveillance for crime
prevention. Criminology &amp; Public Policy, 18(1), 135–159.
https://doi.org/10.1111/17459133.12419.
[5] Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., &amp; Woo, W. (2015). Convolutional LSTM
network: a machine learning approach for precipitation nowcasting. arXiv (Cornell
University). https://doi.org/10.48550/arxiv.1506.04214.
[6] Teuwen, J., &amp; Moriakov, N. (2020). Convolutional neural networks. In Elsevier eBooks (pp.</p>
      <p>481–501). https://doi.org/10.1016/b978-0-12-816176-0.00025-9.
[7] Sutskever, I. (2014, September 10). Sequence to Sequence Learning with Neural Networks.</p>
      <p>Retrieved from https://arxiv.org/abs/1409.3215.
[8] Gracia, I. S., Deniz, O., García, G. B., &amp; Kim, T. (2015). Fast fight detection. PLOS ONE, 10(4),
e0120448. https://doi.org/10.1371/journal.pone.0120448.
[9] Hassner, T., Itcher, Y., &amp; Kliper-Gross, O. (2012). Violent flows: Real-time detection of violent
crowd behavior. 2012 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition Workshops. https://doi.org/10.1109/cvprw.2012.6239348.
[10] Soliman, M., Kamal, M. H., Nashed, M., Mostafa, Y. M., Chawky, B. S., &amp; Khattab, D. (2019).</p>
      <p>Violence Recognition from Videos using Deep Learning Techniques. 2019 Ninth
International Conference on Intelligent Computing and Information Systems (ICICIS).
https://doi.org/10.1109/icicis46948.2019.9014714.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>