Violence detection in videos using Conv2D VGG-19 architecture
and LSTM network
J.V. Vidhya a and R. Annie Uthra b
a
    SRM Institute of Science and Technology, Kattankulathur, Kancheepuram, Tamilnadu, 603203, India
b
    SRM Institute of Science and Technology, Kattankulathur, Kancheepuram, Tamilnadu, 603203, India

                 Abstract
                 Violence identification from surveillance videos can be considered as a special form of Activity
                 recognition, that targets at recognizing human actions in public places. A video sequence is a
                 collection of consecutive frames that is sampled in both temporal and vertical directions and
                 the given input video is converted into frames and the preprocessing is done at the frame level.
                 For Feature extraction the 2D convolutional neural network (Conv2D) is used and it adapts the
                 layers of VGG-19 net architecture with global average pooling and learns the spatial
                 information in the given video. Those extracted features are then combined using Long Short
                 Term Memory (LSTM) and it learns about temporal information from the video. The model is
                 validated using the Hockey data set and a loss of 0.02 and accuracy of 98 is achieved.

                 Keywords 1
                 Violence identification, Activity recognition, Preprocessing, Feature extraction, VGG-19,
                 Global average pooling, Long short term memory (LSTM).

1. Introduction
    Video is modernizing the way it brings changes in the world. With the raising rates of the crime
incidents, the use of popular security-enhancing technological devices called Closed-Circuit Television
(CCTV) as an effective security measure is on the rise around the world. There are about 770 million
CCTV cameras installed worldwide so far. Violence identification from surveillance videos can be
considered as special form of Activity recognition [10], that targets at recognizing human actions in
public places. Monitoring the suspicious activity of the human being throughout the day becomes a
tedious task [4], [5]. This leads to the necessity of methods to detect abnormal human activity
automatically. Video recognition of human behavior was carried out using machine learning techniques
and computer vision techniques [1]- [3]

     In the past years, several researches have been carried out over activity recognition and tested the
model on quite simple datasets, which contains various actions simulated by actors in an
environment.[6]-[9]. There are few factors that differentiate abnormal and violence activity. The
activities which are unlike normal activity are termed as abnormal activity such as Beating, Stealing,
Harassment, fighting are examples of violent activities [13]- [15]. Automatic recognition of violence in
videos are becoming essential as it can reduce the time and labour consumption. There are many
approaches and methods built to detect brutal events and other uncertain patterns in the videos [11],
[12]. Conventional Feature mining methods with classifiers and deep learning framework can be used
for this purpose. Machine learning and Deep learning provides a great way to detect the violence in the
video and classify with high accuracy and less response time. Traditional methods used in earlier stages
are STIPs [16]-[17] and MoSIFT [18]-[20]. Major machine learning algorithms used to recognize or
classify the objects or persons are expected to over fit in the process of training data. Visual data are
complex in nature. Due to complexity, the models incline to have input of high dimension and a lot of

Algorithms, Computing and Mathematics Conference, August 19 – 20, 2021, Chennai, India.
EMAIL: vidhyaj@srmist.edu.in (J. V. Vidhya)
ORCID: 0000-0002-9196-2867 (J. V. Vidhya)
            © 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                   102
parameters are to be used to fit in the model. Overfitting happens when the training data is low. As a
result, the models cannot be generalized.

    Deep Learning eliminates the need of hand-crafted features [21] [22]. The model can be generalized
well using large training dataset. Deep Learning methods are successful for recognizing video based
activities [23] [24]. ImageNet contains all the images of real world of various classes. It has about 14
million plus images organized in twenty thousand classes of objects or scenes. It used as a benchmark
dataset for training the model in deep neural network [24].

    Our work has been organized as follows, in section 2, several algorithms that were used to classify
violent and non-violent actions in the earlier years are addressed. Section 3 shows the proposed
approach implementation which comprises Pre-processing, Classification then prediction of videos to
identify violence or nonviolence. Section 4 contains the experimental results of the model evaluated
using Hockey dataset.

2. Related Works
    A statistical technique based on optical flow to identify violent behavior in a crowd scene is
proposed in [25]. Statistical characteristic of the optical flow descriptor (SCOF) was used to denote the
video frame sequence. SCOF descriptor categorize the video as violence or non-violence using linear
Support Vector Machine. The proposed model was tested on the Hockey dataset and Crowd Dataset.
86.9% and 86.37% accuracy is observed in Hockey and Crowd dataset.

    Zhang et al. [26] suggested a novel method for localization and detection of violence in surveillance
videos. Violence regions are extracted using the Gaussian model of optical flow (GMOF). Gaussian
Mixture Model (GMM) is adopted to acquire the behavior of crowd features mined from the optic flow.
Sampling the violent regions by means of a multi-scale scanning window, violence is detected.
Histogram of Optical Flow (OHOF) descriptor is fed into a linear Support vector machine which
classifies the event as violence or nonviolence. Performance of the algorithm is validated in BEHAVE,
CAVIAR, and Crowd Violence dataset and an accuracy of 88.78%, 89.68%, and 86.59% is observed.

    In [27], Optical flow and Harris 3D spatial temporal interest point detector are combined to detect
violence in a frame of a video. Harris 3D considers base data as regions where both temporal and spatial
domain changes. Pyramid LK captures the large motion whereas Lucas-Kanade optical flow algorithm
captures the small motion in the video. These algorithms define the object movement intensities in a
particular duration. Proposed method is testified using C270 Logitech camera. Motion intensity is
estimated using motion coefficient. Threshold value depicts the violence occurrence precisely.

    The work proposed in [28] presented a new Histogram of optical flow magnitude and orientation
(HOMO) feature descriptor where optical flow between two successive frames is calculated. Six binary
indicators that reflects orientation and magnitude deviations between consecutive frames is obtained.
Histogram of binary indicators is combined to form the HOMO descriptor used to train the SVM
classifier to detect violence or non-violence. Accuracy of 89.3% and 76.3% is observed in Hockey
dataset and Violent flow dataset.

     Zhou P et al. [29] proposed a violence detection using low level features. Based on optical flow
fields regions with motion are segmented. In motion regions, dynamics and appearance of violent
actions are mined using the low-level feature descriptors, what are Local Histogram of Optical Flow
(LHOF) and Local Histogram Oriented Gradient (LHOG). Mined features are coded using Bag of
Words (BoW) model. Support Vector Machine (SVM) classifies the vector as violence or not.

    Ismael Serrano et al. [30] proposed Hough Forests model, provides for each class a weighted image
by considering the relevant motion parts eliminating the noise and static background. Representative
image is obtained from video sequence by accumulating frames associated with the temporal position.

                                                    103
A 2D convolutional Neural Network classifies the image frame as violence or non-violence. The
proposed method is validated in Hockey Dataset and an accuracy of 94.6%is acquired.

    The proposed model in [31] captures the spatial information using Convolutional Neural Network
and temporal information using LSTM. An updated CNN(VGG19) where an additional dense layer is
augmented to the final output layer is used as an alternative of adding global average pooling layer to
the output layer. It acts as the spatial feature extractor to the LSTM cells. An accuracy of 96.33% is
observed in Hockey Dataset using the above framework.

3. Proposed Model
    In the proposed model, Raw videos are preprocessed and fed to Convolutional Network to obtain
spatial information from the video frames. After obtaining spatial information the videos are further
processed using LSTM (Long Short Term Memory) to analyze the temporal information in the video.

3.1.    Dataset Used
    Dataset used for implementation of the proposed model is “Hockey Fight Dataset”. The dataset
contains total of 1000 videos gathered from NHL (National Hockey League). Each clip is around 1.6
to 1.96 length in seconds. The dimension of the video segments is 720x576. Each image frame has a
resolution of 360 x 288. Annotations are done in video level. Each video consisting of almost 50 frames
are classified into fight and non-fight.

3.2.    Preprocessing
   The dataset comprises videos captured in hockey stadium. Video sequence consists of set of frames.
The image frames are mined from the video and preprocessed before giving to the neural network. The
image frames are initially in BGR (Blue Green Red) format, are then converted to RGB (Red, Green,
Blue) format for further processing. Fig 1 shows the steps involved in preprocessing stage.


Figure 1: Preprocessing

3.2.1. Sampling
    Sampling denotes the resizing of image. By sampling, a new image with high pixel can be attained
with no loss of image quality. All image frames in video are transformed to a dimension of (224,224)
which is the input shape of conv layer 1 of VGG19 model. The frames are up sampled using “Bicubic
Interpolation” and considers only (4 x 4) 16 pixels in neighborhood at a time. It preserves fine details
about the frames. Images resampled using Bicubic interpolation are smoother and have only few
interpolation artifacts, yielding extensively better results.

3.2.2. Denoising
   The noise present in the frames of the image, reduces the clarity of video which in turn affects the
model performance. Denoising is done by using Median blur and is used for smoothing the frames. It
smoothens the image using median filter with the kernel size of (3 x 3) aperture. Here, the central pixel


                                                    104
is replaced with the median of all pixels in the kernel window. It eliminates noise from the frame
preserving the edges. After removing the noise image data is normalized using the highest value of the
pixel data and fed to the convolutional neural network.

3.3. CNN
     CNN architecture, VGG-19 trained on ImageNet dataset is used to train the model. VGG Network
is a deeper network with small filters. VGG-19 architecture has 19 layers and a small filter of size 3 x
3 conv with periodic pooling all over the network model. It has 16 convolutional layers and 3 fully
connected layers. The starting input layer has an image of size 224 x 224 with depth 3. Layer 1 and 2
of CNN Conv2D is of depth 64. Depth represents the number of filters used for generating feature map.
Each filter corresponds to different pattern in the input convolving around the image and generate
feature maps. Dot products of the kernel with each filter produces the feature map. Rectified linear Unit-
ReLu is used in Convolutional 2D layer to make the model classify better and to improve computational
time. MaxPooling2D is used as the pooling layer in the subsequent layer. Max pooling (2 x 2) with
stride [2 2] and padding [0 0 0 0] has been adapted. Convolutional layer of depth 128 has been used in
the next two consecutive layers. ReLu activation function is used in this layer. 128 feature maps are
generated by this layers at each level. The succeeding layer is the MaxPooling2D layer with pool
dimension 2 × 2. Convolutional 2D layer of depth 256 has been used as the next four layers of the
network. Max pooling is applied on the final convolutional layer of depth 256. The resulting feature
map is given as input to the next convolutional 2D layer of depth 512. Successive 4 layers are
convolutional layers of depth 512. In the last layer Max pooling of stride 2 is applied. The resultant
feature map is given to the next layer which has process similar to the last 4 layers of convolutional
network of depth 512. After applying Max pooling the output is flattened to generate a 1D feature
vector. The 1D feature vector is given to the Fully connected dense layer. Two Fully connected layers
are adapted in this architecture. It has huge parameters because of dense connection. To get better
results, Ensembling is done on the features of the final fully connected layer before going to the 1000
ImageNet classes. The Fully connected layer (FC2) of dimension 4096 is used for feature extraction as
it represents the feature well. Global average pooling is applied to the output of the last convolutional
block, and hence the final result of the model will be a 2D tensor. Fig 2 shows the architecture of VGG-
19.


Figure 2: VGG-19 Architecture


                                                    105
3.4. LSTM
    The extracted features are given as input to the Long Short Term Memory (LSTM) network as
sequences. 20 frames per second are extracted from the video in the proposed model. Features extracted
from these 20 frames are given to the LSTM layer at a time. The LSTM remembers the values over
specific time intervals which helps our model to reminisce temporal features while making the required
analysis on the given video.

     Sequential model is created using LSTM. It comprises Dense layers and Activation layers. Initially
to the model, the dense layer of 1024 classes is added with which the data is categorized. ReLU
activation function is applied to the dense network. This enables the model to learn quicker and perform
well by overcoming the vanishing gradient problems. The subsequent layer is a dense layer of 50
classes. Model uses Sigmoid function as an activation function in the next layer as it is very efficient.
It is a probabilistic approach whose value ranges in between 0 to 1. Since the range is minimal the
prediction would be more accurate. The last layer is the dense layer with 2 classes which predicts if the
violence is there or not using the “softmax” activation function. It normalizes the output for each class
amid 0 and 1, and divides by their sum providing the probability of the input video to have violence or
not. The model uses ‘Adamax’ as the optimization function and ‘Mean squared error’ as a loss function.
Figure 3 demonstrates the architecture of the LSTM model employed.


Figure 3: LSTM Architecture

4. Experimental Results
    The proposed model attained an accuracy of 98 %, on the Hockey Fight Dataset. The accuracy is
determined by evaluating how well the model detects violence or non-violence correctly. It is calculated
using the formula

                                                    106
                                                    TP + TN
                                  Accuracy =
                                               TP + TN + FP + FN

   Where TP -> True Positive,
         TN -> True Negative,
         FP -> False positive
         FN -> False negative


Figure 4: Training accuracy versus validation accuracy of the proposed model

   The ‘Mean squared error’ loss of the model on the Hockey Fight Dataset is shown below. Mean
squared error Loss observed using this model is 0.02.


Figure 5: Training accuracy versus validation accuracy of the proposed model

5. Conclusion
    In this paper, violence is detected in videos using modified Convolutional network and LSTM
model. Videos are pre-processed by converting the image frames into the RGB format and resampling
the video frame to the size(224x224x3). Median blur denoising is applied to the frames to remove the
noise. The resultant preprocessed sequence of image frames is fed to the Convolutional neural network
which uses VGG-19 architecture with global average pooling. Spatial information is learned using
features extracted CNN. Extracted features are given to LSTM by which the temporal information about
the video sequence are known. The proposed technique delivers an accuracy of 98 and mean square
error of 0.02.


                                                  107
6. Acknowledgements
    Mrs. J.V. Vidhya, is working as Assistant Professor and pursuing Ph.D in the Department of
Computer Science and Engineering at SRM Institute of Science and Technology. Her research interests
include Video Image Processing, Machine learning and Deep learning.

    Dr. R. Annie Uthra is currently working as Associate Professor in the Department of Computer
Science and Engineering at SRM Institute of Science and Technology. Additionally, she serves as the
Adjunct Associate Teaching Professor in the Institute for Software Research in the School of Computer
Science at Carnegie Mellon University, Pittsburgh, USA. A graduate of SRM University’s Master of
Engineering in Computer Science and Engineering program, and has received Ph.D Degree from SRM
University. Her research interest includes wireless sensor networks, Machine learning, Positioning and
Navigation, IoT, Energy Aware Routing Techniques.

7. References
[1] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human
     activities: A survey,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 11, pp. 1473–1488,
     Nov. 2008.
[2] 1R. Poppe, “A survey on vision-based human action recognition,” Image Vis. Comput., vol. 28,
     no. 6, pp. 976–990, 2010.Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher,
     Location (1999).
[3] S.-R. Ke, H. L. U. Thuc, Y.-J. Lee, J.-N. Hwang, J.-H. Yoo, and K.-H. Choi, “A review on video-
     based human activity recognition,” Computers, vol. 2, no. 2, pp. 88–131, 2013.
[4] I. S. Gracia, O. D. Suarez, G. B. Garcia, and T.-K. Kim, ‘‘Fast fight detection,’’ PLoS ONE, vol.
     10, no. 4, Apr. 2015, Art. no. e0120448.
[5] O. Deniz, I. Serrano, G. Bueno, and T.-K. Kim, ‘‘Fast violence detection in video,’’ in Proc. Int.
     Conf. Comput. Vis. Theory Appl. (VISAPP), vol. 2, Jan. 2014, pp. 478–485.
[6] Barrett, D.P., Siskind, J.M.: Action recognition by time series of retinotopic appearance and
     motion features. IEEE Trans. Circuits Syst. Video Technol. 26(12), 2250–2263 (2015).
[7] Rodriguez, M., et al.: One-shot learning of human activity with an MAP adapted GMM and
     simplex-HMM. IEEE Trans. Cybern. 47(7), 1769–1780 (2017).
[8] Zhang, T., et al.: Discriminative dictionary learning with motion weber local descriptor for
     violence detection. IEEE Trans. Circuits Syst. Video Technol. 27(3), 696–709 (2017).
[9] Wang, S., et al.: Anomaly detection in crowded scenes by SL-HOF descriptor and foreground
     classification. In: 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE
     (2016).
[10] L. Tian, H. Wang, Y. Zhou, and C. Peng, ‘‘Video big data in smart city: Background construction
     and optimization for surveillance video processing,’’ Future Gener. Comput. Syst., vol. 86, pp.
     1371–1382, Sep. 2018.
[11] C. Dhiman and D. K. Vishwakarma, ‘‘A review of state-of-the-art techniques for abnormal human
     activity recognition,’’ Eng. Appl. Artif. Intell., vol. 77, pp. 21–45, Jan. 2018.
[12] P. Zhou, Q. Ding, H. Luo, and X. Hou, ‘‘Violent interaction detection in video based on deep
     learning,’’ J. Phys., Conf. Ser., vol. 844, no. 1, 2017, Art. no. 12044.
[13] S. Chaudhary, M. A. Khan, and C. Bhatnagar, ‘‘Multiple anomalous activity detection in videos,’’
     Procedia Comput. Sci., vol. 125, pp. 336–345,Jan. 2018.
[14] T. Zhang, Z. Yang, W. Jia, B. Yang, J. Yang, and X. He, ‘‘A new method for violence detection
     in surveillance scenes,’’ Multimedia Tools Appl., vol. 75, no. 12, pp. 7327–7349, 2016.
[15] M. Alvar, A. Torsello, A. Sanchez-Miralles, and J. M. Armingol, ‘‘Abnormal behavior detection
     using do minant sets,’’ Mach. Vis. Appl., vol. 25, no. 5, pp. 1351–1368, Jul. 2014.
[16] Laptev I, Lindeberg T. Space-time interest points. In: 9th International Conference on Computer
     Vision, Nice, France. IEEE conference proceedings; 2003. p. 432–439.


                                                  108
[17] De Souza FDM, Cha´vez GC, Valle E, de Albuquerque Arau´jo A. Violence Detection in Video
     Using Spatio-Temporal Features. In: SIBGRAPI; 2010.Poker-Edge.Com, Stats and analysis, 2006.
     URL: http://www.poker-edge.com/stats.php.
[18] Yu Chen M, Hauptmann A. MoSIFT: Recognizing Human Actions in Surveillance Videos; 2009.
[19] Bermejo Nievas E, Deniz Suarez O, Bueno Garcı ´a G, Sukthankar R. Violence detection in video
     using computer vision techniques. In: Computer Analysis of Images and Patterns. Springer; 2011.
     p. 332–339.
[20] Xu L, Gong C, Yang J, Wu Q, Yao L. Violent video detection based on MoSIFT feature and sparse
     coding. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
     Conference on. IEEE; 2014. p. 3538–3542.
[21] P. Bilinski, F. Bremond, Human violence recognition and detection in surveillance videos. in
     2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance
     (AVSS) (IEEE, 2016), pp. 30–36.
[22] E.B. Nievas, O.D. Suarez, G.B. Garca, R. Sukthankar, Violence detection in video using computer
     vision techniques. in International Conference on Computer Analysis of Images and
     Patterns (Springer, Berlin, Heidelberg, 2011), pp. 332–339
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies for accurate object
     detection and semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.
     2014, pp. 580–587.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification with deep convolutional
     neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., Lake Tahoe, NV, USA, 2012, pp.
     1097–1105.
[25] J. Huang and S. Chen, "Detection of violent crowd behavior based on statistical characteristics of
     the optical flow," 2014 11th International Conference on Fuzzy Systems and Knowledge
     Discovery (FSKD), Xiamen, 2014, pp. 565-569, doi: 10.1109/FSKD.2014.6980896.
[26] Zhang, T., Yang, Z., Jia, W. et al. A new method for violence detection in surveillance scenes.
     Multimed Tools Appl 75, 7327–7349 (2016). https://doi.org/10.1007/s11042-015-2648-8
[27] Y. Lyu and Y. Yang, "Violence Detection Algorithm Based on Local Spatio-temporal Features
     and Optical Flow," 2015 International Conference on Industrial Informatics - Computing
     Technology, Intelligent Technology, Industrial Information Integration, Wuhan, 2015, pp. 307-
     311, doi: 10.1109/ICIICII.2015.157.
[28] Mahmoodi, Javad & Salajeghe, Afsane. (2019). A classification method based on optical flow for
     violence detection. Expert Systems with Applications. 127. 10.1016/j.eswa.2019.02.032.
[29] Zhou P, Ding Q, Luo H, Hou X (2018) Violence detection in surveillance video using lowlevel
     features. PLoS ONE 13(10): e0203668.https://doi.org/10.1371/journal.pone.0203668
[30] I. Serrano, O. Deniz, J. L. Espinosa-Aranda and G. Bueno, "Fight Recognition in Video Using
     Hough Forests and 2D Convolutional Neural Network," in IEEE Transactions on Image
     Processing, vol. 27, no. 10, pp. 4787-4797, Oct. 2018, doi: 10.1109/TIP.2018.2845742.
[31] A. R. Abdali and R. F. Al-Tuma, "Robust Real-Time Violence Detection in Video Using CNN
     And LSTM," 2019 2nd Scientific Conference of Computer Sciences (SCCS), Baghdad, Iraq, 2019,
     pp. 104-108, doi: 10.1109/SCCS.2019.8852616.


                                                   109