A Deep Learning Based Approach to Detect Suspicious Weapons
Prashant Varshney, Harsh Tyagi, Nikhil Kr. Lohia, Abhishek Kajla and Palak Girdhar
Computer Science & Engineering Department, Bhagwan Parshuram Institute of Technology, Guru Gobind Singh
Indraprastha University, Sector 17, Rohini, New Delhi, 110089, India


                                Abstract
                                Over the past few decades, the world has witnessed a lot of terrorist and criminal activities. The
                                Public Surveillance System has gained a lot of importance as a response to counter these
                                activities. Various state governments have started to install cameras in their densely populated
                                and important cities to safeguard their citizens. To cover a complete city under a surveillance
                                network of thousands of cameras, hundreds of security personnels are needed to monitor its
                                video feed in real-time. To make this task cost-effective and feasible, one security personnel is
                                monitoring nearly 6 – 8 cameras which usually leads to failure in detection of threats. One slip
                                of concentration can cause damage to many lives. This research paper determines the optimized,
                                efficient & faster way to detect commonly used weapons like AK47, Hand Revolver, Pistol,
                                Combat Knife, Grenade, etc. in a live video feed.

                                Keywords 1
                                Object Detection, Neural Network, Deep Learning, Computer Vision, mAP Score


1. Introduction
                                                                                                               Recent growth in the field of Artificial
                                                                                                           Intelligence has contributed a lot to solve major
   Object Detection is a field of Artificial
                                                                                                           crises all over the world. This research paper
Intelligence associated with Digital Image
                                                                                                           mainly focuses on different Object Detection
Processing and Computer Vision, which deals
                                                                                                           models in Deep Learning like RCNN (Region-
with detecting instances of an object like a car,
                                                                                                           based Convolutional Neural Network) [2], SSD
humans, weapons, etc. possessing similar
                                                                                                           (Single Shot Detector) [3], and YOLO (You
features with the trained object classes [1].
                                                                                                           Only Look Once) [4, 5, 6, 7] using which
Object Detection methods are generally
                                                                                                           possession of any suspicious weapons are
categorized into either ML-based approach or
                                                                                                           detected in video surveillance. The primary
DL-based approach depending upon the
                                                                                                           goal of this paper is to analyze the performance
complexity of object class. For Machine
                                                                                                           of these models and determine the most
Learning based approaches, it is essential to
                                                                                                           efficient and reliable model amongst them for
define features beforehand using methods like
                                                                                                           surveillance purposes.
Haar Cascade, SIFT, etc. which further uses the
                                                                                                               The main inspiration for this research paper
support vector machine (SVM) technique for
                                                                                                           came from the Mumbai Chhatrapati Shivaji
detecting object class. The Deep Learning
                                                                                                           terminus railway station attack where a couple
based approaches uses Artificial Neural
                                                                                                           of terrorists entered the railway station with
Networks to do an end-to-end object detection
                                                                                                           their automatic assault rifles and started
without defining features specifically.
                                                                                                           indiscriminate shooting which killed 58

ACI’21: Workshop on Advances in Computational Intelligence at
ISIC 2021, February 25–27, 2021, Delhi, India
EMAIL: pv03158@gmail.com (P. Varshney);
tyagih1699@gmail.com (H. Tyagi); nikhillohia6128@gmail.com
(N. Lohia); abhishekkajla5511@gmail.com (A. Kajla);
palakgirdhar@bpitindia.com (P. Girdhar)
ORCID: 0000-0003-0497-6214 (P. Varshney); 0000-0001-5919-
811X (H. Tyagi); 0000-0002-7338-7619 (N. Lohia); 0000-0002-
4042-6001 (P. Girdhar)
                            © 2021 Copyright for this paper by its authors. Use permitted under Creative
                            Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Wor
 Pr
    ks
     hop
  oceedi
       ngs
             ht
             I
              tp:
                //
                 ceur
                    -
             SSN1613-
                     ws
                      .or
                    0073
                        g
                            CEUR Workshop Proceedings (CEUR-WS.org)
civilians and caused 100 plus casualties [8]. At              1. Region Proposal based framework
that time, if there had been any AI-based                          that includes models like RCNN,
Weapon Detection Technology that was                               FRCNN, and Faster RCNN.
monitoring the city, then the Police Control                  2. Regression-based frameworks that
Room would have known about them                                   include models like YOLO and
beforehand. We decided to build such a system                      SSD.
that might add an extra layer to the security at          Region Proposal-based algorithms use a
public places preventing such threatening             sliding window approach to extract features
activities.                                           from the visual data. In the year 2014, Ross
                                                      Girshick presented RCNN model based on this
2. Related Work                                       algorithm, which obtained mAP of 53.3% On
                                                      the contrary, to the results achieved on the
                                                      PASCAL VOC dataset, an improvement of
    The detection of threatening weapons in the
                                                      30% was achieved by this model. In this model,
surveillance system is a challenging task to do.
                                                      the whole image is processed with a
To cover a complete city under a surveillance
                                                      Convolution Neural Network to produce a
network of thousands of cameras, hundreds of          feature map and then a feature vector of fixed-
security personnels are needed to monitor its         length with a Region of Interest (RoI) pooling
video feed in real-time. To make this task cost-
                                                      layer is extracted from each region proposal.
effective and feasible, one security personnel is
monitoring nearly 6 – 8 cameras which leads to
failure in detection of threats at their initial      3. Proposed Methodology
stage and results in delayed response causing         3.1. Flowchart
causalities.
    According to the research of Velastin et al.
of the Queen Mary University of London,
carried in 2006 usually after 20 minutes of
video monitoring, operators in many instances
fail to notice the presence of threatening objects
in a video feed [9]. Researching further
analyzed that after 12 minutes of monitoring
video feed, an operator is likely to fail to notice
up to 45% of suspicious activities and after 22
minutes of monitoring, up to 95% of suspicious
activities are failed to notice [10]. Thus the
most optimal solution to this problem could be
to eliminate humans from the equation as much
as possible.
    In the year 2001, Paul Viola and Michael
Jones proposed the first robust, efficient, and
real-time Machine Learning based Object
Detection Framework in their paper “Rapid
Object Detection using Boosted Cascade of
Simple Features” [11,12]. This framework can
be trained to discover the variety of objects by
taking lots of images that contain the object
which we want our classifier to detect (positive
images) and the same images but without the
object which needs to be detected (negative
images), to train the classifier. However, this
approach cannot be used to identify the
presence of complex objects in different
orientations and sizes.                               Figure 1: Flowchart of weapon detection
    Deep Learning based Object Detection              system proposed
frameworks mainly consist of two types –
3.2. Scraping images of commonly                      4.1. Evaluation Metrics in an
used weapons                                          Object Detection Model

    To build DL-based object detection model,            In computer vision, Mean Average
nearly 5000 images of commonly used                   Precision (mAP) is used to evaluate the Object
weapons like AK47, Hand Pistol, Revolver,             Detection Model [13]. It measures the accuracy
Shotgun, Combat Knife, etc. were scrapped             by calculating the number of correct predictions
available in various sizes and orientations,          that the model made. To find the mAP score of
which is later pre-processed for building             a model, we have to find the value of
dataset. These images were gathered from              Intersection of Union (IoU), precision and
different sources available on the internet using     recall prior.
automation scripts.
                                                         Object     detection   models    generate
3.3.    Building labelled dataset                     predictions in terms of a class label and a
                                                      bounding box. For every prediction, we will
                                                      measure IoU by taking the ratio of area of
   After pre-processing and scaling, these            overlap between the predicted bounding box
scrapped images were labelled using                   and the ground truth bounding box to the area
“LabelImg” software. This software helps in           of union of both bounding boxes.
labelling class objects by simply marking the         Mathematically,
object manually in the image using selection
tool. The software then creates a text (*.txt) file                    Area of Overlap             (1)
where the exact 2D coordinates of the bounding                 𝐼𝑜𝑈 =
box along with the class identifier is stored.                         𝐴𝑟𝑒𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛
These files when combined with images give us
a complete labelled dataset which further can be          We will find the values of Recall and
used to train any desired DL-based object             Precision by using this IoU value, for a given
detection model.                                      threshold. Taking an example, if IoU is set to
                                                      0.7 threshold, and the IoU value of 0.8 is
                                                      achieved for a prediction, which is greater than
3.4.    Model training                                or equals to the set threshold then the prediction
                                                      is classified as TP i.e. True Positive otherwise
    Then proceeding further different Deep            the prediction is classified as FP i.e. False
Learning frameworks and models such as                Positive.
RCNN, SSD, and YOLO were trained using the
above built dataset. The dataset was randomly
divided in the ratio of 80 - 20. Using 80 percent
of the dataset the model was trained. The
remaining 20 percent of the dataset was used for
testing purposes. To train these DL-based
models Google Collab was used because of its
free and powerful GPUs.

4. Experimental Results                               Figure 2: Marking a predicted box as True
                                                      Positive or False Positive based on IoU value
    After completion of training with 80
percent of dataset the desired weights were               Precision of an object detection model is the
obtained. The remaining 20 percent of the             ratio of total number of instances of True
dataset was tested against these obtained             Positives to the total number of instances of
weights, giving a complete and detailed               True Positives and False Positives all together.
analysis of accuracy and precision of these           Mathematically,
models which was recorded for further
comparison and eventually figuring out the best                                 TP                 (2)
                                                              𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
detection model.                                                             (𝑇𝑃 + 𝐹𝑃)
where,
   TP is true positives i.e. predicted as positive
and is true & FP is false positives i.e. predicted
as positive and is false.

    Recall measures the fraction of relevant
predictions that were predicted by the object
detection model. It is measured by taking the
ratio of total number of instances of True
Positives to the total number of instances of
True Positives and False Negatives all together.
Mathematically,

                        TP                    (3)     Figure 4: Graph of GPU time vs Accuracy
          𝑅𝑒𝑐𝑎𝑙𝑙 =
                     (𝑇𝑃 + 𝐹𝑁)                        Precision for SSD Model

where,
        TP is true positives i.e. predicted as
positive and is correct & FN is false negatives
i.e. model is failed to predict the presence of the
object and object is present there.

        The Average Precision (AP) is the area
under the Precision vs Recall curve [14]. (Mean
Average Precision) mAP is the average of
Average Precision (AP).

4.2. Analysis of Accuracy Precision
(AP) and GPU time to process one                      Figure 5: Graph of GPU time vs Accuracy
                                                      Precision for YOLO Model
frame (in ms)
                                                      Table 1
                                                      Comparison of mAP Score and GPU Time of
                                                      different object detection model
                                                           Model         mAP Score     GPU time
                                                                                        (in ms)
                                                            RCNN           33.325         874
                                                             SSD            20.58         86.8
                                                            YOLO             33.1         394

                                                      5. Conclusion

                                                         A comparative study of various DL-based
                                                      object detectors that uses Artificial Neural
                                                      Network to classify and localize visual data was
Figure 3: Graph of GPU time vs Accuracy               conducted. It was tested on a custom dataset of
Precision for RCNN Model                              commonly used weapons. SSD amongst all the
                                                      other DL-based Object Detection Frameworks
                                                      has the least mAP score of 20, but the
                                                      computational time to detect the object was the
                                                      fastest of others. So, it can be a better choice if
we need a fast object detector in trade-off to          4. Another possible application could be
accuracy. YOLO and RCNN provided a similar                 the detection of fire at any place which
mAP score of 33 and 34.2 respectively which                upon detection can be reported directly
gives the better accuracy of detecting the object.         to a fire department ensuring that there
Although the YOLO trained model is                         is minimum damage around that area.
comparatively faster than the RCNN model                5. One can also monitor through the traffic
making it an efficient and reliable object                 using this system where the cameras
detector.                                                  will detect all the vehicles breaking any
                                                           law and reporting the same to a traffic
                                                           control department helping them to
                                                           resolve traffic issues.
                                                        6. One of the limitations of our project is
                                                           that there is no possible solution right
                                                           now to detect any weapon which is
                                                           hidden by the criminal in either his
                                                           pocket or suitcases. We are thinking of
                                                           a way to overcome this problem and
                                                           build a better and safer environment for
                                                           citizens.


                                                     7. Acknowledgement

                                                        The Graphics Processing Unit (GTX 1050
                                                     Ti) used in this research paper is provided by
Figure 6: Assault rifle AK47 detected in a           the Computer Science & Engineering
captured video                                       Department, Bhagwan Parshuram Institute of
                                                     Technology, Delhi, India.
6. Future Scope
                                                     8. References
   Deep Learning based Object Detection
framework mainly consists of two types –             [1] Dasiopoulou, Stamatia, et al. "Knowledge-
                                                         assisted semantic video object detection."
   1. This research paper is limited to                  IEEE Transactions on Circuits and
      weapons like AK47, Shotguns, 9mm                   Systems for Video Technology 15.10
      Hand Pistols, Combat Knives, and Hand              (2005): 1210–1224.
      Grenades which are some of the most            [2] Ross, Girshick (2014). "Rich feature
      commonly used weapons amongst                      hierarchies for accurate object detection
      criminals. But there are still a large             and                              semantic
      number of dangerous and illegal                    segmentation" (PDF). Proceedings of the
      weapons whose possession by any                    IEEE Conference on Computer Vision and
      individual can cause a problem. So our             Pattern Recognition. IEEE: 580–
      target will be to add more and more of             587. arXiv:1311.2524. doi:10.1109/CVP
      these labeled datasets into our training           R.2014.81. ISBN 978-1-4799-5118-
      part to ensure maximum accuracy.                   5. S2CID 215827080
   2. We can enhance this project to a               [3] Liu, Wei (October 2016). "SSD: Single
      collision detection system where all               shot multibox detector". Computer Vision
      sorts of vehicle accidents can be traced           – ECCV 2016. European Conference on
      and reported to the nearby police station          Computer Vision. Lecture Notes in
      to prevent any further damage.                     Computer       Science. 9905.     pp. 21–
   3. Another application can be to use the              37. arXiv:1512.02325. doi:10.1007/978-
      system for detection of any major blood            3-319-46448-0_2. ISBN 978-3-319-
      around the area which will report a                46447-3. S2CID 2141740.
      nearby hospital to avoid any fatality.
[4] Redmon, Joseph (2016). "You only look             [16] Nivid Limbasiya, Prateek Agrawal,
     once:      Unified,     real-time      object         "Bidirectional Long Short Term Memory
     detection". Proceedings of the IEEE                   Based Spatio-Temporal in Community
     Conference on Computer Vision and                     Question Answering", A book on Deep
     Pattern                                               learning based approaches for sentiment
     Recognition. arXiv:1506.02640. Bibcode:               analysis, pp. 291-310, Jan 2020, Springer.
     2015arXiv150602640R.                             [17] Prateek Agrawal, Deepak Chaudhary,
[5] Redmon, Joseph (2017). "YOLO9000:                      Vishu Madaan, Anatoliy Zabrovskiy,
     better,                                faster,        Radu Prodan, Dragi Kimovski, Christian
     stronger". arXiv:1612.08242 [cs.CV].                  Timmerer, “Automated Bank Cheque
[6] Redmon, Joseph (2018). "Yolov3: An                     Verification Using Image Processing and
     incremental                                           Deep Learning Methods”, Multimedia
     improvement". arXiv:1804.02767 [cs.CV.                tools and applications (MTAP), 80(1), pp.
[7] Zhang, Shifeng (2018). "Single-Shot                    1-32. https://doi.org/10.1007/s11042-020-
     Refinement Neural Network for Object                  09818-1.
     Detection". Proceedings of the IEEE              [18] Prateek Agrawal, Deepak Chaudhary,
     Conference on Computer Vision and                     Vishu Madaan, Anatoliy Zabrovskiy,
     Pattern         Recognition:          4203–           Radu Prodan, Dragi Kimovski, Christian
     4212. arXiv:1711.06897. Bibcode:2017ar                Timmerer, “Automated Bank Cheque
     Xiv171106897Z.                                        Verification Using Image Processing and
[8] "26/11 Mumbai Terror Attacks Aftermath:                Deep Learning Methods”, Multimedia
     Security Audits Carried Out On 227 Non-               tools and applications (MTAP), 80(1), pp.
     Major Seaports Till Date". NDTV. Press                1-32. https://doi.org/10.1007/s11042-020-
     Trust of India. 26 November 2017.                     09818-1.
     Retrieved 7 December 2017.                       [19] Neha Bhadwal, Prateek Agrawal, Vishu
[9] S. A. Velastin, B. A. Boghossian, M. A.                Madaan, Awadhesh Shukla, Anuj Kakran,
     Vicencio-Silva, A motion-based image                  “Smart Border Surveillance System using
     processing system for detecting potentially           Wireless Sensor Network and Computer
     dangerous situations in underground                   Vision”, International Conference on
     railway       stations,       Transportation          Automation,        Computational      and
     ResearchPart C: Emerging Technologies                 Technology Management (ICACTM’19),
     14 (2) (2006) 96–113.                                 pp. 183-190, IEEEXplore.
[10] T. Ainsworth, Buyer beware, Security Oz
     19 (2002) 18–26.
[11] Rapid object detection using a boosted
     cascade of simple features
[12] Viola, Jones: Robust Real-time Object
     Detection, IJCV 2001 (pages 1,3).
[13] Hughes, G. (1968). On the mean accuracy
     of statistical pattern recognizers. IEEE
     transactions on information theory, 14(1),
     55-63.
[14] Buckland, M., & Gey, F. (1994). The
     relationship     between       recall     and
     precision. Journal of the American society
     for information science, 45(1), 12-19.
[15] Ahmad Salihi Ben Musa, Sanjay Kumar
     Singh, Prateek Agrawal, “Suspicious
     Human Activity Recognition for Video
     Surveillance      System”,      International
     Conference on Control, Instrumentation,
     Communication         &      Computational
     Technologies               ICCICCT-2014,
     IEEEXplore.